[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows

2016-05-31 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308754#comment-15308754
 ] 

Cheng Lian commented on SPARK-6859:
---

Yea, thanks. I'm closing it.

> Parquet File Binary column statistics error when reuse byte[] among rows
> 
>
> Key: SPARK-6859
> URL: https://issues.apache.org/jira/browse/SPARK-6859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0, 1.4.0
>Reporter: Yijie Shen
>Priority: Minor
> Fix For: 2.0.0
>
>
> Suppose I create a dataRDD which extends RDD\[Row\], and each row is 
> GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is 
> reused among rows but has different content each time. When I convert it to a 
> dataFrame and save it as Parquet File, the file's row group statistic(max & 
> min) of Binary column would be wrong.
> \\
> \\
> Here is the reason: In Parquet, BinaryStatistic just keep max & min as 
> parquet.io.api.Binary references, Spark sql would generate a new Binary 
> backed by the same Array\[Byte\] passed from row.
>   
> | |reference| |backed| |  
> |max: Binary|-->|ByteArrayBackedBinary|-->|Array\[Byte\]|
> Therefore, each time parquet updating row group's statistic, max & min would 
> always refer to the same Array\[Byte\], which has new content each time. When 
> parquet decides to save it into file, the last row's content would be saved 
> as both max & min.
> \\
> \\
> It seems it is a parquet bug because it's parquet's responsibility to update 
> statistics correctly.
> But not quite sure. Should I report it as a bug in parquet JIRA? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows

2016-05-31 Thread Ian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308660#comment-15308660
 ] 

Ian commented on SPARK-6859:


is this one fixed along with SPARK-9876?

> Parquet File Binary column statistics error when reuse byte[] among rows
> 
>
> Key: SPARK-6859
> URL: https://issues.apache.org/jira/browse/SPARK-6859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0, 1.4.0
>Reporter: Yijie Shen
>Priority: Minor
>
> Suppose I create a dataRDD which extends RDD\[Row\], and each row is 
> GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is 
> reused among rows but has different content each time. When I convert it to a 
> dataFrame and save it as Parquet File, the file's row group statistic(max & 
> min) of Binary column would be wrong.
> \\
> \\
> Here is the reason: In Parquet, BinaryStatistic just keep max & min as 
> parquet.io.api.Binary references, Spark sql would generate a new Binary 
> backed by the same Array\[Byte\] passed from row.
>   
> | |reference| |backed| |  
> |max: Binary|-->|ByteArrayBackedBinary|-->|Array\[Byte\]|
> Therefore, each time parquet updating row group's statistic, max & min would 
> always refer to the same Array\[Byte\], which has new content each time. When 
> parquet decides to save it into file, the last row's content would be saved 
> as both max & min.
> \\
> \\
> It seems it is a parquet bug because it's parquet's responsibility to update 
> statistics correctly.
> But not quite sure. Should I report it as a bug in parquet JIRA? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows

2015-10-16 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961318#comment-14961318
 ] 

Cheng Lian commented on SPARK-6859:
---

This issue was left unresolved because Parquet filter push-down wasn't enabled 
by default. But now in 1.5, it's turned on by default. Opened SPARK-11153 to 
disable filter push-down for strings and binaries.

> Parquet File Binary column statistics error when reuse byte[] among rows
> 
>
> Key: SPARK-6859
> URL: https://issues.apache.org/jira/browse/SPARK-6859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0, 1.4.0
>Reporter: Yijie Shen
>Priority: Minor
>
> Suppose I create a dataRDD which extends RDD\[Row\], and each row is 
> GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is 
> reused among rows but has different content each time. When I convert it to a 
> dataFrame and save it as Parquet File, the file's row group statistic(max & 
> min) of Binary column would be wrong.
> \\
> \\
> Here is the reason: In Parquet, BinaryStatistic just keep max & min as 
> parquet.io.api.Binary references, Spark sql would generate a new Binary 
> backed by the same Array\[Byte\] passed from row.
>   
> | |reference| |backed| |  
> |max: Binary|-->|ByteArrayBackedBinary|-->|Array\[Byte\]|
> Therefore, each time parquet updating row group's statistic, max & min would 
> always refer to the same Array\[Byte\], which has new content each time. When 
> parquet decides to save it into file, the last row's content would be saved 
> as both max & min.
> \\
> \\
> It seems it is a parquet bug because it's parquet's responsibility to update 
> statistics correctly.
> But not quite sure. Should I report it as a bug in parquet JIRA? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows

2015-04-13 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492667#comment-14492667
 ] 

Cheng Lian commented on SPARK-6859:
---

[~rdblue] pointed out 1 fact that I missed in PARQUET-251: we need to work out 
a way to ignore (binary) min/max stats for all existing data.

So from Spark SQL side, we have to disable filter push-down for binary columns.

> Parquet File Binary column statistics error when reuse byte[] among rows
> 
>
> Key: SPARK-6859
> URL: https://issues.apache.org/jira/browse/SPARK-6859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0, 1.4.0
>Reporter: Yijie Shen
>Priority: Minor
>
> Suppose I create a dataRDD which extends RDD\[Row\], and each row is 
> GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is 
> reused among rows but has different content each time. When I convert it to a 
> dataFrame and save it as Parquet File, the file's row group statistic(max & 
> min) of Binary column would be wrong.
> \\
> \\
> Here is the reason: In Parquet, BinaryStatistic just keep max & min as 
> parquet.io.api.Binary references, Spark sql would generate a new Binary 
> backed by the same Array\[Byte\] passed from row.
>   
> | |reference| |backed| |  
> |max: Binary|-->|ByteArrayBackedBinary|-->|Array\[Byte\]|
> Therefore, each time parquet updating row group's statistic, max & min would 
> always refer to the same Array\[Byte\], which has new content each time. When 
> parquet decides to save it into file, the last row's content would be saved 
> as both max & min.
> \\
> \\
> It seems it is a parquet bug because it's parquet's responsibility to update 
> statistics correctly.
> But not quite sure. Should I report it as a bug in parquet JIRA? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows

2015-04-12 Thread Yijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491820#comment-14491820
 ] 

Yijie Shen commented on SPARK-6859:
---

I opened a JIRA ticket in Parquet: 
[PARQUET-251|https://issues.apache.org/jira/browse/PARQUET-251]

> Parquet File Binary column statistics error when reuse byte[] among rows
> 
>
> Key: SPARK-6859
> URL: https://issues.apache.org/jira/browse/SPARK-6859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0, 1.4.0
>Reporter: Yijie Shen
>Priority: Minor
>
> Suppose I create a dataRDD which extends RDD\[Row\], and each row is 
> GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is 
> reused among rows but has different content each time. When I convert it to a 
> dataFrame and save it as Parquet File, the file's row group statistic(max & 
> min) of Binary column would be wrong.
> \\
> \\
> Here is the reason: In Parquet, BinaryStatistic just keep max & min as 
> parquet.io.api.Binary references, Spark sql would generate a new Binary 
> backed by the same Array\[Byte\] passed from row.
>   
> | |reference| |backed| |  
> |max: Binary|-->|ByteArrayBackedBinary|-->|Array\[Byte\]|
> Therefore, each time parquet updating row group's statistic, max & min would 
> always refer to the same Array\[Byte\], which has new content each time. When 
> parquet decides to save it into file, the last row's content would be saved 
> as both max & min.
> \\
> \\
> It seems it is a parquet bug because it's parquet's responsibility to update 
> statistics correctly.
> But not quite sure. Should I report it as a bug in parquet JIRA? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows

2015-04-12 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491575#comment-14491575
 ] 

Cheng Lian commented on SPARK-6859:
---

A better way can be defensive copy while inserting byte arrays to parquet, so 
that we don't suffer read performance regression.

> Parquet File Binary column statistics error when reuse byte[] among rows
> 
>
> Key: SPARK-6859
> URL: https://issues.apache.org/jira/browse/SPARK-6859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0, 1.4.0
>Reporter: Yijie Shen
>Priority: Minor
>
> Suppose I create a dataRDD which extends RDD\[Row\], and each row is 
> GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is 
> reused among rows but has different content each time. When I convert it to a 
> dataFrame and save it as Parquet File, the file's row group statistic(max & 
> min) of Binary column would be wrong.
> \\
> \\
> Here is the reason: In Parquet, BinaryStatistic just keep max & min as 
> parquet.io.api.Binary references, Spark sql would generate a new Binary 
> backed by the same Array\[Byte\] passed from row.
>   
> | |reference| |backed| |  
> |max: Binary|-->|ByteArrayBackedBinary|-->|Array\[Byte\]|
> Therefore, each time parquet updating row group's statistic, max & min would 
> always refer to the same Array\[Byte\], which has new content each time. When 
> parquet decides to save it into file, the last row's content would be saved 
> as both max & min.
> \\
> \\
> It seems it is a parquet bug because it's parquet's responsibility to update 
> statistics correctly.
> But not quite sure. Should I report it as a bug in parquet JIRA? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows

2015-04-12 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491558#comment-14491558
 ] 

Cheng Lian commented on SPARK-6859:
---

For 1.3 and prior versions, this issue isn't that serious, since strings are 
immutable. But in 1.4 we are adding mutable UTF8String ([PR 
#5350|https://github.com/apache/spark/pull/5350]).

> Parquet File Binary column statistics error when reuse byte[] among rows
> 
>
> Key: SPARK-6859
> URL: https://issues.apache.org/jira/browse/SPARK-6859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0, 1.4.0
>Reporter: Yijie Shen
>Priority: Minor
>
> Suppose I create a dataRDD which extends RDD\[Row\], and each row is 
> GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is 
> reused among rows but has different content each time. When I convert it to a 
> dataFrame and save it as Parquet File, the file's row group statistic(max & 
> min) of Binary column would be wrong.
> \\
> \\
> Here is the reason: In Parquet, BinaryStatistic just keep max & min as 
> parquet.io.api.Binary references, Spark sql would generate a new Binary 
> backed by the same Array\[Byte\] passed from row.
>   
> | |reference| |backed| |  
> |max: Binary|-->|ByteArrayBackedBinary|-->|Array\[Byte\]|
> Therefore, each time parquet updating row group's statistic, max & min would 
> always refer to the same Array\[Byte\], which has new content each time. When 
> parquet decides to save it into file, the last row's content would be saved 
> as both max & min.
> \\
> \\
> It seems it is a parquet bug because it's parquet's responsibility to update 
> statistics correctly.
> But not quite sure. Should I report it as a bug in parquet JIRA? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows

2015-04-12 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491548#comment-14491548
 ] 

Cheng Lian commented on SPARK-6859:
---

[~yijieshen] Thanks for reporting! And yes, please also open a JIRA ticket for 
Parquet and link it with this one so that it's easier to track.

[~marmbrus] I guess we should disable pushing down filters involving binary 
type before this bug is fixed in Parquet.

> Parquet File Binary column statistics error when reuse byte[] among rows
> 
>
> Key: SPARK-6859
> URL: https://issues.apache.org/jira/browse/SPARK-6859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0, 1.4.0
>Reporter: Yijie Shen
>Priority: Minor
>
> Suppose I create a dataRDD which extends RDD\[Row\], and each row is 
> GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is 
> reused among rows but has different content each time. When I convert it to a 
> dataFrame and save it as Parquet File, the file's row group statistic(max & 
> min) of Binary column would be wrong.
> \\
> \\
> Here is the reason: In Parquet, BinaryStatistic just keep max & min as 
> parquet.io.api.Binary references, Spark sql would generate a new Binary 
> backed by the same Array\[Byte\] passed from row.
>   
> | |reference| |backed| |  
> |max: Binary|-->|ByteArrayBackedBinary|-->|Array\[Byte\]|
> Therefore, each time parquet updating row group's statistic, max & min would 
> always refer to the same Array\[Byte\], which has new content each time. When 
> parquet decides to save it into file, the last row's content would be saved 
> as both max & min.
> \\
> \\
> It seems it is a parquet bug because it's parquet's responsibility to update 
> statistics correctly.
> But not quite sure. Should I report it as a bug in parquet JIRA? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org