[jira] [Assigned] (SPARK-7637) StructType.merge slow with large nenormalised tables O(N2)
[ https://issues.apache.org/jira/browse/SPARK-7637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7637: --- Assignee: Apache Spark StructType.merge slow with large nenormalised tables O(N2) -- Key: SPARK-7637 URL: https://issues.apache.org/jira/browse/SPARK-7637 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Rowan Chattaway Assignee: Apache Spark Priority: Minor Original Estimate: 24h Remaining Estimate: 24h StructType.merge does a linear scan through the left schema and for each element scans the right schema. This results in a O(N2) algorithm. I have found this to be very slow when dealing with large denormalised parquet files. I would like to make a small change to this function to map the fields of both the left and right schemas resulting in O(N). This has a sizable increase in performance for large denormalised schemas. 1x1 column merge 2891ms Original 32ms with mapped field approach. This merge can be called many times depending upon the number of files that you need to merge the schemas for, compounding the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7637) StructType.merge slow with large nenormalised tables O(N2)
[ https://issues.apache.org/jira/browse/SPARK-7637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7637: --- Assignee: (was: Apache Spark) StructType.merge slow with large nenormalised tables O(N2) -- Key: SPARK-7637 URL: https://issues.apache.org/jira/browse/SPARK-7637 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Rowan Chattaway Priority: Minor Original Estimate: 24h Remaining Estimate: 24h StructType.merge does a linear scan through the left schema and for each element scans the right schema. This results in a O(N2) algorithm. I have found this to be very slow when dealing with large denormalised parquet files. I would like to make a small change to this function to map the fields of both the left and right schemas resulting in O(N). This has a sizable increase in performance for large denormalised schemas. 1x1 column merge 2891ms Original 32ms with mapped field approach. This merge can be called many times depending upon the number of files that you need to merge the schemas for, compounding the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org