[ https://issues.apache.org/jira/browse/SPARK-7637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Armbrust resolved SPARK-7637. ------------------------------------- Resolution: Fixed Fix Version/s: 1.5.0 > StructType.merge slow with large nenormalised tables O(N2) > ---------------------------------------------------------- > > Key: SPARK-7637 > URL: https://issues.apache.org/jira/browse/SPARK-7637 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.3.1 > Reporter: Rowan Chattaway > Priority: Minor > Fix For: 1.5.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > StructType.merge does a linear scan through the left schema and for each > element scans the right schema. This results in a O(N2) algorithm. > I have found this to be very slow when dealing with large denormalised > parquet files. > I would like to make a small change to this function to map the fields of > both the left and right schemas resulting in O(N). > This has a sizable increase in performance for large denormalised schemas. > 10000x10000 column merge > 2891ms Original > 32ms with mapped field approach. > This merge can be called many times depending upon the number of files that > you need to merge the schemas for, compounding the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org