[jira] [Comment Edited] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

Ohad Raviv (JIRA) Fri, 27 Oct 2017 05:54:43 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16222312#comment-16222312
 ]


Ohad Raviv edited comment on SPARK-21657 at 10/27/17 12:53 PM:
---------------------------------------------------------------

Hi,
Just ran a profiler for this code:
{code:java}
val BASE = 100000000
val N = 100000
val df = sc.parallelize(List(("1234567890", (BASE to (BASE+N)).map(x => 
(x.toString, (x+1).toString, (x+2).toString, (x+3).toString)).toList 
))).toDF("c1", "c_arr")
val df_exploded = df.select(expr("c1"), explode($"c_arr").as("c2"))
df_exploded.write.mode("overwrite").format("json").save("/tmp/blah_explode")
{code}

and it looks like [~srowen] is right, most of the time is spent in 
scala.collection.immutable.List.apply()       (72.1%). inside:
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext()
  (100%)

I logged the generated code and found the problematic code:
{code:java}
 if (serializefromobject_funcResult1 != null) {
             serializefromobject_value5 = (scala.collection.immutable.List) 
serializefromobject_funcResult1;
           } else {
             serializefromobject_isNull5 = true;
         }
.
.
.
 while (serializefromobject_loopIndex < serializefromobject_dataLength) {
           MapObjects_loopValue0 = (scala.Tuple4) 
(serializefromobject_value5.apply(serializefromobject_loopIndex));
{code}

so that causes the quadratic time complexity.
However, I'm not sure where is the code that generates this list instead of 
array for the exploded array.


was (Author: uzadude):
Hi,
Just ran a profiler for this code:
{code:scala}
val BASE = 100000000
val N = 100000
val df = sc.parallelize(List(("1234567890", (BASE to (BASE+N)).map(x => 
(x.toString, (x+1).toString, (x+2).toString, (x+3).toString)).toList 
))).toDF("c1", "c_arr")
val df_exploded = df.select(expr("c1"), explode($"c_arr").as("c2"))
df_exploded.write.mode("overwrite").format("json").save("/tmp/blah_explode")
{code}

and it looks like [~srowen] is right, most of the time is spent in 
scala.collection.immutable.List.apply()       (72.1%). inside:
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext()
  (100%)

I logged the generated code and found the problematic code:
{code:scala}
 if (serializefromobject_funcResult1 != null) {
             serializefromobject_value5 = (scala.collection.immutable.List) 
serializefromobject_funcResult1;
           } else {
             serializefromobject_isNull5 = true;
         }
.
.
.
 while (serializefromobject_loopIndex < serializefromobject_dataLength) {
           MapObjects_loopValue0 = (scala.Tuple4) 
(serializefromobject_value5.apply(serializefromobject_loopIndex));
{code}

so that causes the quadratic time complexity.
However, I'm not sure where is the code that generates this list instead of 
array for the exploded array.

> Spark has exponential time complexity to explode(array of structs)
> ------------------------------------------------------------------
>
>                 Key: SPARK-21657
>                 URL: https://issues.apache.org/jira/browse/SPARK-21657
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0
>            Reporter: Ruslan Dautkhanov
>              Labels: cache, caching, collections, nested_types, performance, 
> pyspark, sparksql, sql
>         Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sized nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling of 50,000 (see attached pyspark script), it took 7 hours to 
> explode the nested collections (\!) of 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

Reply via email to