[jira] [Commented] (SPARK-16998) select($"column1", explode($"column2")) is extremely slow
[ https://issues.apache.org/jira/browse/SPARK-16998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16258340#comment-16258340 ] Ruslan Dautkhanov commented on SPARK-16998: --- Can somebody please help review PR 19683 in SPARK-21657? As there is still a lot of room for improvement possible in explode code generation for array of structs.. > select($"column1", explode($"column2")) is extremely slow > - > > Key: SPARK-16998 > URL: https://issues.apache.org/jira/browse/SPARK-16998 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: TobiasP >Assignee: Herman van Hovell > Fix For: 2.2.0 > > > Using a Dataset containing 10.000 rows, each containing null and an array of > 5.000 Ints, I observe the following performance (in local mode): > {noformat} > scala> time(ds.select(explode($"value")).sample(false, 0.001, 1).collect) > 1.219052 seconds > > res9: Array[org.apache.spark.sql.Row] = Array([3761], [3766], [3196]) > scala> time(ds.select($"dummy", explode($"value")).sample(false, 0.001, > 1).collect) > 20.219447 seconds > > res5: Array[org.apache.spark.sql.Row] = Array([null,3761], [null,3766], > [null,3196]) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16998) select($"column1", explode($"column2")) is extremely slow
[ https://issues.apache.org/jira/browse/SPARK-16998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15681374#comment-15681374 ] Takeshi Yamamuro commented on SPARK-16998: -- [~hvanhovell] Since SPARK-15214 improves this query by ~11x, I think we can also close this ticket; https://github.com/apache/spark/pull/13065/files#diff-b7bf86a20a79d572f81093300568db6eR152 > select($"column1", explode($"column2")) is extremely slow > - > > Key: SPARK-16998 > URL: https://issues.apache.org/jira/browse/SPARK-16998 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: TobiasP > > Using a Dataset containing 10.000 rows, each containing null and an array of > 5.000 Ints, I observe the following performance (in local mode): > {noformat} > scala> time(ds.select(explode($"value")).sample(false, 0.001, 1).collect) > 1.219052 seconds > > res9: Array[org.apache.spark.sql.Row] = Array([3761], [3766], [3196]) > scala> time(ds.select($"dummy", explode($"value")).sample(false, 0.001, > 1).collect) > 20.219447 seconds > > res5: Array[org.apache.spark.sql.Row] = Array([null,3761], [null,3766], > [null,3196]) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16998) select($"column1", explode($"column2")) is extremely slow
[ https://issues.apache.org/jira/browse/SPARK-16998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439336#comment-15439336 ] Takeshi Yamamuro commented on SPARK-16998: -- can we link this ticket to SPARK-15214? > select($"column1", explode($"column2")) is extremely slow > - > > Key: SPARK-16998 > URL: https://issues.apache.org/jira/browse/SPARK-16998 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: TobiasP > > Using a Dataset containing 10.000 rows, each containing null and an array of > 5.000 Ints, I observe the following performance (in local mode): > {noformat} > scala> time(ds.select(explode($"value")).sample(false, 0.001, 1).collect) > 1.219052 seconds > > res9: Array[org.apache.spark.sql.Row] = Array([3761], [3766], [3196]) > scala> time(ds.select($"dummy", explode($"value")).sample(false, 0.001, > 1).collect) > 20.219447 seconds > > res5: Array[org.apache.spark.sql.Row] = Array([null,3761], [null,3766], > [null,3196]) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16998) select($"column1", explode($"column2")) is extremely slow
[ https://issues.apache.org/jira/browse/SPARK-16998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439331#comment-15439331 ] Takeshi Yamamuro commented on SPARK-16998: -- yea, no problem. thanks! > select($"column1", explode($"column2")) is extremely slow > - > > Key: SPARK-16998 > URL: https://issues.apache.org/jira/browse/SPARK-16998 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: TobiasP > > Using a Dataset containing 10.000 rows, each containing null and an array of > 5.000 Ints, I observe the following performance (in local mode): > {noformat} > scala> time(ds.select(explode($"value")).sample(false, 0.001, 1).collect) > 1.219052 seconds > > res9: Array[org.apache.spark.sql.Row] = Array([3761], [3766], [3196]) > scala> time(ds.select($"dummy", explode($"value")).sample(false, 0.001, > 1).collect) > 20.219447 seconds > > res5: Array[org.apache.spark.sql.Row] = Array([null,3761], [null,3766], > [null,3196]) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16998) select($"column1", explode($"column2")) is extremely slow
[ https://issues.apache.org/jira/browse/SPARK-16998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15439276#comment-15439276 ] Herman van Hovell commented on SPARK-16998: --- [~maropu] Do you mind if I do it myself? I already started hacking. > select($"column1", explode($"column2")) is extremely slow > - > > Key: SPARK-16998 > URL: https://issues.apache.org/jira/browse/SPARK-16998 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: TobiasP > > Using a Dataset containing 10.000 rows, each containing null and an array of > 5.000 Ints, I observe the following performance (in local mode): > {noformat} > scala> time(ds.select(explode($"value")).sample(false, 0.001, 1).collect) > 1.219052 seconds > > res9: Array[org.apache.spark.sql.Row] = Array([3761], [3766], [3196]) > scala> time(ds.select($"dummy", explode($"value")).sample(false, 0.001, > 1).collect) > 20.219447 seconds > > res5: Array[org.apache.spark.sql.Row] = Array([null,3761], [null,3766], > [null,3196]) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16998) select($"column1", explode($"column2")) is extremely slow
[ https://issues.apache.org/jira/browse/SPARK-16998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438429#comment-15438429 ] Takeshi Yamamuro commented on SPARK-16998: -- If no problem, I'll pick up the pr. > select($"column1", explode($"column2")) is extremely slow > - > > Key: SPARK-16998 > URL: https://issues.apache.org/jira/browse/SPARK-16998 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: TobiasP > > Using a Dataset containing 10.000 rows, each containing null and an array of > 5.000 Ints, I observe the following performance (in local mode): > {noformat} > scala> time(ds.select(explode($"value")).sample(false, 0.001, 1).collect) > 1.219052 seconds > > res9: Array[org.apache.spark.sql.Row] = Array([3761], [3766], [3196]) > scala> time(ds.select($"dummy", explode($"value")).sample(false, 0.001, > 1).collect) > 20.219447 seconds > > res5: Array[org.apache.spark.sql.Row] = Array([null,3761], [null,3766], > [null,3196]) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16998) select($"column1", explode($"column2")) is extremely slow
[ https://issues.apache.org/jira/browse/SPARK-16998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437598#comment-15437598 ] Herman van Hovell commented on SPARK-16998: --- I still have a code generation PR lying around: https://github.com/apache/spark/pull/13065 I could bring it up to date. > select($"column1", explode($"column2")) is extremely slow > - > > Key: SPARK-16998 > URL: https://issues.apache.org/jira/browse/SPARK-16998 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: TobiasP > > Using a Dataset containing 10.000 rows, each containing null and an array of > 5.000 Ints, I observe the following performance (in local mode): > {noformat} > scala> time(ds.select(explode($"value")).sample(false, 0.001, 1).collect) > 1.219052 seconds > > res9: Array[org.apache.spark.sql.Row] = Array([3761], [3766], [3196]) > scala> time(ds.select($"dummy", explode($"value")).sample(false, 0.001, > 1).collect) > 20.219447 seconds > > res5: Array[org.apache.spark.sql.Row] = Array([null,3761], [null,3766], > [null,3196]) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16998) select($"column1", explode($"column2")) is extremely slow
[ https://issues.apache.org/jira/browse/SPARK-16998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437130#comment-15437130 ] Takeshi Yamamuro commented on SPARK-16998: -- I checked performance; {code} $./bin/spark-shell --master=local[1] import org.apache.spark.sql.types._ import org.apache.spark.sql.Row def timer[R](block: => R): R = { val t0 = System.nanoTime() val result = block val t1 = System.nanoTime() println("Elapsed time: " + ((t1 - t0 + 0.0) / 10.0)+ "s") result } val numArray = X val sqlCtx = new org.apache.spark.sql.SQLContext(sc) val schema = StructType(StructField("c0", IntegerType):: StructField("c1", ArrayType(IntegerType)) :: Nil) val rdd = sc.parallelize(0 :: Nil, 1).flatMap { _ => (0 until 1000).map(j => Row(j, (0 until numArray).toArray)) } val df = sqlCtx.createDataFrame(rdd, schema).cache df.queryExecution.executedPlan(0).execute().foreach(x => Unit) timer { df.select($"c0", explode($"c1")).queryExecution.executedPlan(2).execute().foreach(x => Unit) } {code} Performance results are as follows; {code} numArray: Int = 1024, Elapsed time: 0.485094303s numArray: Int = 2048, Elapsed time: 1.78344344s numArray: Int = 4096, Elapsed time: 7.037558308s numArray: Int = 8192, Elapsed time: 26.498065697s numArray: Int = 16384, Elapsed time: 117.13229056s {code} The elapsed time exponentially grew with the increase of `numArray`. It seems the root cause of this bottleneck is many object(JoinedRow)-copys occurred in `GenerateExec` with join=true. However, there is not a simpler way to fix this. > select($"column1", explode($"column2")) is extremely slow > - > > Key: SPARK-16998 > URL: https://issues.apache.org/jira/browse/SPARK-16998 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: TobiasP > > Using a Dataset containing 10.000 rows, each containing null and an array of > 5.000 Ints, I observe the following performance (in local mode): > {noformat} > scala> time(ds.select(explode($"value")).sample(false, 0.001, 1).collect) > 1.219052 seconds > > res9: Array[org.apache.spark.sql.Row] = Array([3761], [3766], [3196]) > scala> time(ds.select($"dummy", explode($"value")).sample(false, 0.001, > 1).collect) > 20.219447 seconds > > res5: Array[org.apache.spark.sql.Row] = Array([null,3761], [null,3766], > [null,3196]) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org