[jira] [Updated] (SPARK-11500) Not deterministic order of columns when using merging schemas.
[ https://issues.apache.org/jira/browse/SPARK-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-11500: - Description: When executing {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, pathTwo).printSchema()}} The order of columns is not deterministic, showing up in a different order sometimes. This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which {{ParquetRelation}} extends as you know). When {{FileStatusCache.listLeafFiles()}} is called, this returns {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}. So, after retrieving the list of leaf files including {{_metadata}} and {{_common_metadata}}, this starts to merge (separately and if necessary) the {{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different column order having the leading columns (of the first file) which the other files do not have. I think this can be resolved by using {{LinkedHashSet}}. was: When executing {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, pathTwo).printSchema()}} The order of columns is not deterministic, showing up in a different order sometimes. This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which {{ParquetRelation}} extends as you know). When {{FileStatusCache.listLeafFiles()}} is called, this returns {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}. So, after retrieving the list of leaf files including {{_metadata}} and {{_common_metadata}}, this starts to merge (separately and if necessary) the {{Set}}s of {{_metadata}}, {{_common_metadata}} and part-files in {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different column order having the leading columns (of the first file) which the other files do not have. I think this can be resolved by using {{LinkedHashSet}}. > Not deterministic order of columns when using merging schemas. > -- > > Key: SPARK-11500 > URL: https://issues.apache.org/jira/browse/SPARK-11500 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Hyukjin Kwon > > When executing > {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, > pathTwo).printSchema()}} > The order of columns is not deterministic, showing up in a different order > sometimes. > This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which > {{ParquetRelation}} extends as you know). When > {{FileStatusCache.listLeafFiles()}} is called, this returns > {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}. > So, after retrieving the list of leaf files including {{_metadata}} and > {{_common_metadata}}, this starts to merge (separately and if necessary) the > {{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in > {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different > column order having the leading columns (of the first file) which the other > files do not have. > I think this can be resolved by using {{LinkedHashSet}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11500) Not deterministic order of columns when using merging schemas.
[ https://issues.apache.org/jira/browse/SPARK-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-11500: - Description: When executing {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, pathTwo).printSchema()}} The order of columns is not deterministic, showing up in a different order sometimes. This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which {{ParquetRelation}} extends as you know). When {{FileStatusCache.listLeafFiles()}} is called, this returns {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}. So, after retrieving the list of leaf files including {{_metadata}} and {{_common_metadata}}, this starts to merge (separately and if necessary) the {{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different column order having the leading columns (of the first file) which the other files do not have. I think this can be resolved by using {{LinkedHashSet}}. in a simple view, If A file has 1,2,3 fields, and B file column 3,4,5, we can not ensure which column shows first since It is not deterministic. 1. Read file list (A and B) 2. Not deterministic order of (A and B or B and A) as I said. 3. It merges by {{reduceOption}} with retrieved schemas of (A and B or B and A), (which maybe also should be {{reduceOptionRight}} or {{reduceOptionLeft}}). 4. The output columns would be 1,2,3,4,5 when A and B, or 3.4.5.1.2 when B and A. was: When executing {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, pathTwo).printSchema()}} The order of columns is not deterministic, showing up in a different order sometimes. This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which {{ParquetRelation}} extends as you know). When {{FileStatusCache.listLeafFiles()}} is called, this returns {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}. So, after retrieving the list of leaf files including {{_metadata}} and {{_common_metadata}}, this starts to merge (separately and if necessary) the {{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different column order having the leading columns (of the first file) which the other files do not have. I think this can be resolved by using {{LinkedHashSet}}. > Not deterministic order of columns when using merging schemas. > -- > > Key: SPARK-11500 > URL: https://issues.apache.org/jira/browse/SPARK-11500 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Hyukjin Kwon > > When executing > {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, > pathTwo).printSchema()}} > The order of columns is not deterministic, showing up in a different order > sometimes. > This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which > {{ParquetRelation}} extends as you know). When > {{FileStatusCache.listLeafFiles()}} is called, this returns > {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}. > So, after retrieving the list of leaf files including {{_metadata}} and > {{_common_metadata}}, this starts to merge (separately and if necessary) the > {{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in > {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different > column order having the leading columns (of the first file) which the other > files do not have. > I think this can be resolved by using {{LinkedHashSet}}. > in a simple view, > If A file has 1,2,3 fields, and B file column 3,4,5, we can not ensure which > column shows first since It is not deterministic. > 1. Read file list (A and B) > 2. Not deterministic order of (A and B or B and A) as I said. > 3. It merges by {{reduceOption}} with retrieved schemas of (A and B or B and > A), (which maybe also should be {{reduceOptionRight}} or > {{reduceOptionLeft}}). > 4. The output columns would be 1,2,3,4,5 when A and B, or 3.4.5.1.2 when B > and A. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11500) Not deterministic order of columns when using merging schemas.
[ https://issues.apache.org/jira/browse/SPARK-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-11500: --- Assignee: Hyukjin Kwon > Not deterministic order of columns when using merging schemas. > -- > > Key: SPARK-11500 > URL: https://issues.apache.org/jira/browse/SPARK-11500 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon > > When executing > {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, > pathTwo).printSchema()}} > The order of columns is not deterministic, showing up in a different order > sometimes. > This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which > {{ParquetRelation}} extends as you know). When > {{FileStatusCache.listLeafFiles()}} is called, this returns > {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}. > So, after retrieving the list of leaf files including {{_metadata}} and > {{_common_metadata}}, this starts to merge (separately and if necessary) the > {{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in > {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different > column order having the leading columns (of the first file) which the other > files do not have. > I think this can be resolved by using {{LinkedHashSet}}. > in a simple view, > If A file has 1,2,3 fields, and B file column 3,4,5, we can not ensure which > column shows first since It is not deterministic. > 1. Read file list (A and B) > 2. Not deterministic order of (A and B or B and A) as I said. > 3. It merges by {{reduceOption}} with retrieved schemas of (A and B or B and > A), (which maybe also should be {{reduceOptionRight}} or > {{reduceOptionLeft}}). > 4. The output columns would be 1,2,3,4,5 when A and B, or 3.4.5.1.2 when B > and A. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11500) Not deterministic order of columns when using merging schemas.
[ https://issues.apache.org/jira/browse/SPARK-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11500: - Target Version/s: 1.6.0 > Not deterministic order of columns when using merging schemas. > -- > > Key: SPARK-11500 > URL: https://issues.apache.org/jira/browse/SPARK-11500 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon > > When executing > {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, > pathTwo).printSchema()}} > The order of columns is not deterministic, showing up in a different order > sometimes. > This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which > {{ParquetRelation}} extends as you know). When > {{FileStatusCache.listLeafFiles()}} is called, this returns > {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}. > So, after retrieving the list of leaf files including {{_metadata}} and > {{_common_metadata}}, this starts to merge (separately and if necessary) the > {{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in > {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different > column order having the leading columns (of the first file) which the other > files do not have. > I think this can be resolved by using {{LinkedHashSet}}. > in a simple view, > If A file has 1,2,3 fields, and B file column 3,4,5, we can not ensure which > column shows first since It is not deterministic. > 1. Read file list (A and B) > 2. Not deterministic order of (A and B or B and A) as I said. > 3. It merges by {{reduceOption}} with retrieved schemas of (A and B or B and > A), (which maybe also should be {{reduceOptionRight}} or > {{reduceOptionLeft}}). > 4. The output columns would be 1,2,3,4,5 when A and B, or 3.4.5.1.2 when B > and A. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11500) Not deterministic order of columns when using merging schemas.
[ https://issues.apache.org/jira/browse/SPARK-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11500: --- Fix Version/s: 1.6.0 > Not deterministic order of columns when using merging schemas. > -- > > Key: SPARK-11500 > URL: https://issues.apache.org/jira/browse/SPARK-11500 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon > Fix For: 1.6.0, 1.7.0 > > > When executing > {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, > pathTwo).printSchema()}} > The order of columns is not deterministic, showing up in a different order > sometimes. > This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which > {{ParquetRelation}} extends as you know). When > {{FileStatusCache.listLeafFiles()}} is called, this returns > {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}. > So, after retrieving the list of leaf files including {{_metadata}} and > {{_common_metadata}}, this starts to merge (separately and if necessary) the > {{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in > {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different > column order having the leading columns (of the first file) which the other > files do not have. > I think this can be resolved by using {{LinkedHashSet}}. > in a simple view, > If A file has 1,2,3 fields, and B file column 3,4,5, we can not ensure which > column shows first since It is not deterministic. > 1. Read file list (A and B) > 2. Not deterministic order of (A and B or B and A) as I said. > 3. It merges by {{reduceOption}} with retrieved schemas of (A and B or B and > A), (which maybe also should be {{reduceOptionRight}} or > {{reduceOptionLeft}}). > 4. The output columns would be 1,2,3,4,5 when A and B, or 3.4.5.1.2 when B > and A. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org