subject:"\[jira\] \[Updated\] \(SPARK\-11500\) Not deterministic order of columns when using merging schemas."

[jira] [Updated] (SPARK-11500) Not deterministic order of columns when using merging schemas.

2015-11-03 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-11500:
-
Description: 
When executing 

{{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, 
pathTwo).printSchema()}}

The order of columns is not deterministic, showing up in a different order 
sometimes.

This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which 
{{ParquetRelation}} extends as you know). When 
{{FileStatusCache.listLeafFiles()}} is called, this returns {{Set[FileStatus]}} 
which messes up the order of {{Array[FileStatus]}}.

So, after retrieving the list of leaf files including {{_metadata}} and 
{{_common_metadata}},  this starts to merge (separately and if necessary) the 
{{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in 
{{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different 
column order having the leading columns (of the first file) which the other 
files do not have.

I think this can be resolved by using {{LinkedHashSet}}.

  was:
When executing 

{{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, 
pathTwo).printSchema()}}

The order of columns is not deterministic, showing up in a different order 
sometimes.

This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which 
{{ParquetRelation}} extends as you know). When 
{{FileStatusCache.listLeafFiles()}} is called, this returns {{Set[FileStatus]}} 
which messes up the order of {{Array[FileStatus]}}.

So, after retrieving the list of leaf files including {{_metadata}} and 
{{_common_metadata}},  this starts to merge (separately and if necessary) the 
{{Set}}s of {{_metadata}}, {{_common_metadata}} and part-files in 
{{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different 
column order having the leading columns (of the first file) which the other 
files do not have.

I think this can be resolved by using {{LinkedHashSet}}.


> Not deterministic order of columns when using merging schemas.
> --
>
> Key: SPARK-11500
> URL: https://issues.apache.org/jira/browse/SPARK-11500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> When executing 
> {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, 
> pathTwo).printSchema()}}
> The order of columns is not deterministic, showing up in a different order 
> sometimes.
> This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which 
> {{ParquetRelation}} extends as you know). When 
> {{FileStatusCache.listLeafFiles()}} is called, this returns 
> {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}.
> So, after retrieving the list of leaf files including {{_metadata}} and 
> {{_common_metadata}},  this starts to merge (separately and if necessary) the 
> {{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in 
> {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different 
> column order having the leading columns (of the first file) which the other 
> files do not have.
> I think this can be resolved by using {{LinkedHashSet}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11500) Not deterministic order of columns when using merging schemas.

2015-11-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-11500:
-
Description: 
When executing 

{{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, 
pathTwo).printSchema()}}

The order of columns is not deterministic, showing up in a different order 
sometimes.

This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which 
{{ParquetRelation}} extends as you know). When 
{{FileStatusCache.listLeafFiles()}} is called, this returns {{Set[FileStatus]}} 
which messes up the order of {{Array[FileStatus]}}.

So, after retrieving the list of leaf files including {{_metadata}} and 
{{_common_metadata}},  this starts to merge (separately and if necessary) the 
{{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in 
{{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different 
column order having the leading columns (of the first file) which the other 
files do not have.

I think this can be resolved by using {{LinkedHashSet}}.



in a simple view,
If A file has 1,2,3 fields, and B file column 3,4,5, we can not ensure which 
column shows first since It is not deterministic.

1. Read file list (A and B)

2. Not deterministic order of (A and B or B and A) as I said.

3. It merges by {{reduceOption}} with retrieved schemas of (A and B or B and 
A), (which maybe also should be {{reduceOptionRight}} or {{reduceOptionLeft}}).

4. The output columns would be 1,2,3,4,5 when A and B, or 3.4.5.1.2 when B and 
A.





  was:
When executing 

{{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, 
pathTwo).printSchema()}}

The order of columns is not deterministic, showing up in a different order 
sometimes.

This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which 
{{ParquetRelation}} extends as you know). When 
{{FileStatusCache.listLeafFiles()}} is called, this returns {{Set[FileStatus]}} 
which messes up the order of {{Array[FileStatus]}}.

So, after retrieving the list of leaf files including {{_metadata}} and 
{{_common_metadata}},  this starts to merge (separately and if necessary) the 
{{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in 
{{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different 
column order having the leading columns (of the first file) which the other 
files do not have.

I think this can be resolved by using {{LinkedHashSet}}.


> Not deterministic order of columns when using merging schemas.
> --
>
> Key: SPARK-11500
> URL: https://issues.apache.org/jira/browse/SPARK-11500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> When executing 
> {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, 
> pathTwo).printSchema()}}
> The order of columns is not deterministic, showing up in a different order 
> sometimes.
> This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which 
> {{ParquetRelation}} extends as you know). When 
> {{FileStatusCache.listLeafFiles()}} is called, this returns 
> {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}.
> So, after retrieving the list of leaf files including {{_metadata}} and 
> {{_common_metadata}},  this starts to merge (separately and if necessary) the 
> {{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in 
> {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different 
> column order having the leading columns (of the first file) which the other 
> files do not have.
> I think this can be resolved by using {{LinkedHashSet}}.
> in a simple view,
> If A file has 1,2,3 fields, and B file column 3,4,5, we can not ensure which 
> column shows first since It is not deterministic.
> 1. Read file list (A and B)
> 2. Not deterministic order of (A and B or B and A) as I said.
> 3. It merges by {{reduceOption}} with retrieved schemas of (A and B or B and 
> A), (which maybe also should be {{reduceOptionRight}} or 
> {{reduceOptionLeft}}).
> 4. The output columns would be 1,2,3,4,5 when A and B, or 3.4.5.1.2 when B 
> and A.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11500) Not deterministic order of columns when using merging schemas.

2015-11-05 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-11500:
---
Assignee: Hyukjin Kwon

> Not deterministic order of columns when using merging schemas.
> --
>
> Key: SPARK-11500
> URL: https://issues.apache.org/jira/browse/SPARK-11500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>
> When executing 
> {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, 
> pathTwo).printSchema()}}
> The order of columns is not deterministic, showing up in a different order 
> sometimes.
> This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which 
> {{ParquetRelation}} extends as you know). When 
> {{FileStatusCache.listLeafFiles()}} is called, this returns 
> {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}.
> So, after retrieving the list of leaf files including {{_metadata}} and 
> {{_common_metadata}},  this starts to merge (separately and if necessary) the 
> {{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in 
> {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different 
> column order having the leading columns (of the first file) which the other 
> files do not have.
> I think this can be resolved by using {{LinkedHashSet}}.
> in a simple view,
> If A file has 1,2,3 fields, and B file column 3,4,5, we can not ensure which 
> column shows first since It is not deterministic.
> 1. Read file list (A and B)
> 2. Not deterministic order of (A and B or B and A) as I said.
> 3. It merges by {{reduceOption}} with retrieved schemas of (A and B or B and 
> A), (which maybe also should be {{reduceOptionRight}} or 
> {{reduceOptionLeft}}).
> 4. The output columns would be 1,2,3,4,5 when A and B, or 3.4.5.1.2 when B 
> and A.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11500) Not deterministic order of columns when using merging schemas.

2015-11-06 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11500:
-
Target Version/s: 1.6.0

> Not deterministic order of columns when using merging schemas.
> --
>
> Key: SPARK-11500
> URL: https://issues.apache.org/jira/browse/SPARK-11500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>
> When executing 
> {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, 
> pathTwo).printSchema()}}
> The order of columns is not deterministic, showing up in a different order 
> sometimes.
> This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which 
> {{ParquetRelation}} extends as you know). When 
> {{FileStatusCache.listLeafFiles()}} is called, this returns 
> {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}.
> So, after retrieving the list of leaf files including {{_metadata}} and 
> {{_common_metadata}},  this starts to merge (separately and if necessary) the 
> {{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in 
> {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different 
> column order having the leading columns (of the first file) which the other 
> files do not have.
> I think this can be resolved by using {{LinkedHashSet}}.
> in a simple view,
> If A file has 1,2,3 fields, and B file column 3,4,5, we can not ensure which 
> column shows first since It is not deterministic.
> 1. Read file list (A and B)
> 2. Not deterministic order of (A and B or B and A) as I said.
> 3. It merges by {{reduceOption}} with retrieved schemas of (A and B or B and 
> A), (which maybe also should be {{reduceOptionRight}} or 
> {{reduceOptionLeft}}).
> 4. The output columns would be 1,2,3,4,5 when A and B, or 3.4.5.1.2 when B 
> and A.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11500) Not deterministic order of columns when using merging schemas.

2015-11-11 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-11500:
---
Fix Version/s: 1.6.0

> Not deterministic order of columns when using merging schemas.
> --
>
> Key: SPARK-11500
> URL: https://issues.apache.org/jira/browse/SPARK-11500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 1.6.0, 1.7.0
>
>
> When executing 
> {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, 
> pathTwo).printSchema()}}
> The order of columns is not deterministic, showing up in a different order 
> sometimes.
> This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which 
> {{ParquetRelation}} extends as you know). When 
> {{FileStatusCache.listLeafFiles()}} is called, this returns 
> {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}.
> So, after retrieving the list of leaf files including {{_metadata}} and 
> {{_common_metadata}},  this starts to merge (separately and if necessary) the 
> {{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in 
> {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different 
> column order having the leading columns (of the first file) which the other 
> files do not have.
> I think this can be resolved by using {{LinkedHashSet}}.
> in a simple view,
> If A file has 1,2,3 fields, and B file column 3,4,5, we can not ensure which 
> column shows first since It is not deterministic.
> 1. Read file list (A and B)
> 2. Not deterministic order of (A and B or B and A) as I said.
> 3. It merges by {{reduceOption}} with retrieved schemas of (A and B or B and 
> A), (which maybe also should be {{reduceOptionRight}} or 
> {{reduceOptionLeft}}).
> 4. The output columns would be 1,2,3,4,5 when A and B, or 3.4.5.1.2 when B 
> and A.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11500) Not deterministic order of columns when using merging schemas.

[jira] [Updated] (SPARK-11500) Not deterministic order of columns when using merging schemas.

[jira] [Updated] (SPARK-11500) Not deterministic order of columns when using merging schemas.

[jira] [Updated] (SPARK-11500) Not deterministic order of columns when using merging schemas.

[jira] [Updated] (SPARK-11500) Not deterministic order of columns when using merging schemas.

5 matches

Site Navigation

Mail list logo

Footer information