[ https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16222312#comment-16222312 ]
Ohad Raviv edited comment on SPARK-21657 at 10/27/17 12:53 PM: --------------------------------------------------------------- Hi, Just ran a profiler for this code: {code:java} val BASE = 100000000 val N = 100000 val df = sc.parallelize(List(("1234567890", (BASE to (BASE+N)).map(x => (x.toString, (x+1).toString, (x+2).toString, (x+3).toString)).toList ))).toDF("c1", "c_arr") val df_exploded = df.select(expr("c1"), explode($"c_arr").as("c2")) df_exploded.write.mode("overwrite").format("json").save("/tmp/blah_explode") {code} and it looks like [~srowen] is right, most of the time is spent in scala.collection.immutable.List.apply() (72.1%). inside: org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext() (100%) I logged the generated code and found the problematic code: {code:java} if (serializefromobject_funcResult1 != null) { serializefromobject_value5 = (scala.collection.immutable.List) serializefromobject_funcResult1; } else { serializefromobject_isNull5 = true; } . . . while (serializefromobject_loopIndex < serializefromobject_dataLength) { MapObjects_loopValue0 = (scala.Tuple4) (serializefromobject_value5.apply(serializefromobject_loopIndex)); {code} so that causes the quadratic time complexity. However, I'm not sure where is the code that generates this list instead of array for the exploded array. was (Author: uzadude): Hi, Just ran a profiler for this code: {code:scala} val BASE = 100000000 val N = 100000 val df = sc.parallelize(List(("1234567890", (BASE to (BASE+N)).map(x => (x.toString, (x+1).toString, (x+2).toString, (x+3).toString)).toList ))).toDF("c1", "c_arr") val df_exploded = df.select(expr("c1"), explode($"c_arr").as("c2")) df_exploded.write.mode("overwrite").format("json").save("/tmp/blah_explode") {code} and it looks like [~srowen] is right, most of the time is spent in scala.collection.immutable.List.apply() (72.1%). inside: org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext() (100%) I logged the generated code and found the problematic code: {code:scala} if (serializefromobject_funcResult1 != null) { serializefromobject_value5 = (scala.collection.immutable.List) serializefromobject_funcResult1; } else { serializefromobject_isNull5 = true; } . . . while (serializefromobject_loopIndex < serializefromobject_dataLength) { MapObjects_loopValue0 = (scala.Tuple4) (serializefromobject_value5.apply(serializefromobject_loopIndex)); {code} so that causes the quadratic time complexity. However, I'm not sure where is the code that generates this list instead of array for the exploded array. > Spark has exponential time complexity to explode(array of structs) > ------------------------------------------------------------------ > > Key: SPARK-21657 > URL: https://issues.apache.org/jira/browse/SPARK-21657 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL > Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0 > Reporter: Ruslan Dautkhanov > Labels: cache, caching, collections, nested_types, performance, > pyspark, sparksql, sql > Attachments: ExponentialTimeGrowth.PNG, > nested-data-generator-and-test.py > > > It can take up to half a day to explode a modest-sized nested collection > (0.5m). > On a recent Xeon processors. > See attached pyspark script that reproduces this problem. > {code} > cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + > table_name).cache() > print sqlc.count() > {code} > This script generate a number of tables, with the same total number of > records across all nested collection (see `scaling` variable in loops). > `scaling` variable scales up how many nested elements in each record, but by > the same factor scales down number of records in the table. So total number > of records stays the same. > Time grows exponentially (notice log-10 vertical axis scale): > !ExponentialTimeGrowth.PNG! > At scaling of 50,000 (see attached pyspark script), it took 7 hours to > explode the nested collections (\!) of 8k records. > After 1000 elements in nested collection, time grows exponentially. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org