[ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16224365#comment-16224365
 ] 

Ohad Raviv commented on SPARK-21657:
------------------------------------

After futher investigating I believe that my assesment is correct, the former 
case creates a generator with join=true while the later with join=false, as you 
can see in plans above (I also debugged). this causes the very long array of 
size 100k to be duplicated 100k times and afterwards get pruned because its 
column is not in the final projection. 
I'm not sure what's the best way to address this issue - ammend the generate 
operator according to the projection.
in the meanwhile, in our case, I worked around that by manually adding the 
outer fields into each of structs of the array and then exploded only the 
array. it's an ugly solution but reduces our query time from 6 hours to about 2 
mins.

> Spark has exponential time complexity to explode(array of structs)
> ------------------------------------------------------------------
>
>                 Key: SPARK-21657
>                 URL: https://issues.apache.org/jira/browse/SPARK-21657
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0
>            Reporter: Ruslan Dautkhanov
>              Labels: cache, caching, collections, nested_types, performance, 
> pyspark, sparksql, sql
>         Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sized nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling of 50,000 (see attached pyspark script), it took 7 hours to 
> explode the nested collections (\!) of 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to