[jira] [Assigned] (SPARK-7637) StructType.merge slow with large nenormalised tables O(N2)

2015-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7637:
---

Assignee: Apache Spark

 StructType.merge slow with large nenormalised tables O(N2)
 --

 Key: SPARK-7637
 URL: https://issues.apache.org/jira/browse/SPARK-7637
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Rowan Chattaway
Assignee: Apache Spark
Priority: Minor
   Original Estimate: 24h
  Remaining Estimate: 24h

 StructType.merge does a linear scan through the left schema and for each 
 element scans the right schema. This results in a O(N2) algorithm. 
 I have found this to be very slow when dealing with large denormalised 
 parquet files.
 I would like to make a small change to this function to map the fields of 
 both the left and right schemas resulting in O(N).
 This has a sizable increase in performance for large denormalised schemas.
 1x1 column merge 
 2891ms Original  
 32ms with mapped field approach.
 This merge can be called many times depending upon the number of files that 
 you need to merge the schemas for, compounding the performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7637) StructType.merge slow with large nenormalised tables O(N2)

2015-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7637:
---

Assignee: (was: Apache Spark)

 StructType.merge slow with large nenormalised tables O(N2)
 --

 Key: SPARK-7637
 URL: https://issues.apache.org/jira/browse/SPARK-7637
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Rowan Chattaway
Priority: Minor
   Original Estimate: 24h
  Remaining Estimate: 24h

 StructType.merge does a linear scan through the left schema and for each 
 element scans the right schema. This results in a O(N2) algorithm. 
 I have found this to be very slow when dealing with large denormalised 
 parquet files.
 I would like to make a small change to this function to map the fields of 
 both the left and right schemas resulting in O(N).
 This has a sizable increase in performance for large denormalised schemas.
 1x1 column merge 
 2891ms Original  
 32ms with mapped field approach.
 This merge can be called many times depending upon the number of files that 
 you need to merge the schemas for, compounding the performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org