[ https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16117051#comment-16117051 ]
Ruslan Dautkhanov commented on SPARK-21657: ------------------------------------------- Absolutely, this is a real use case. We have a lot of production data that rely on that kind of schema for BI reporting. Other Hadoop sql engines, including Hive and Impala scale its time to explode nested collections linearly. Spark has exponential complexity to explode nested collection. There is definitely a room for improvement, as after ~40k+ records in a nested collection, most time of the job is spent in exploding; after ~200k+ records in a nested collection, Spark is not usable. > Spark has exponential time complexity to explode(array of structs) > ------------------------------------------------------------------ > > Key: SPARK-21657 > URL: https://issues.apache.org/jira/browse/SPARK-21657 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL > Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0 > Reporter: Ruslan Dautkhanov > Attachments: ExponentialTimeGrowth.PNG, > nested-data-generator-and-test.py > > > It can take up to half a day to explode a modest-sizes nested collection > (0.5m). > On a recent Xeon processors. > See attached pyspark script that reproduces this problem. > {code} > cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + > table_name).cache() > print sqlc.count() > {code} > This script generate a number of tables, with the same total number of > records across all nested collection (see `scaling` variable in loops). > `scaling` variable scales up how many nested elements in each record, but by > the same factor scales down number of records in the table. So total number > of records stays the same. > Time grows exponentially (notice log-10 vertical axis scale): > !ExponentialTimeGrowth.PNG! > At scaling 50,000 it took 7 hours to explode the nested collections (\!) of > 8k records. > After 1000 elements in nested collection, time grows exponentially. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org