[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows
[ https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308754#comment-15308754 ] Cheng Lian commented on SPARK-6859: --- Yea, thanks. I'm closing it. > Parquet File Binary column statistics error when reuse byte[] among rows > > > Key: SPARK-6859 > URL: https://issues.apache.org/jira/browse/SPARK-6859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.3.0, 1.4.0 >Reporter: Yijie Shen >Priority: Minor > Fix For: 2.0.0 > > > Suppose I create a dataRDD which extends RDD\[Row\], and each row is > GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is > reused among rows but has different content each time. When I convert it to a > dataFrame and save it as Parquet File, the file's row group statistic(max & > min) of Binary column would be wrong. > \\ > \\ > Here is the reason: In Parquet, BinaryStatistic just keep max & min as > parquet.io.api.Binary references, Spark sql would generate a new Binary > backed by the same Array\[Byte\] passed from row. > > | |reference| |backed| | > |max: Binary|-->|ByteArrayBackedBinary|-->|Array\[Byte\]| > Therefore, each time parquet updating row group's statistic, max & min would > always refer to the same Array\[Byte\], which has new content each time. When > parquet decides to save it into file, the last row's content would be saved > as both max & min. > \\ > \\ > It seems it is a parquet bug because it's parquet's responsibility to update > statistics correctly. > But not quite sure. Should I report it as a bug in parquet JIRA? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows
[ https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308660#comment-15308660 ] Ian commented on SPARK-6859: is this one fixed along with SPARK-9876? > Parquet File Binary column statistics error when reuse byte[] among rows > > > Key: SPARK-6859 > URL: https://issues.apache.org/jira/browse/SPARK-6859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.3.0, 1.4.0 >Reporter: Yijie Shen >Priority: Minor > > Suppose I create a dataRDD which extends RDD\[Row\], and each row is > GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is > reused among rows but has different content each time. When I convert it to a > dataFrame and save it as Parquet File, the file's row group statistic(max & > min) of Binary column would be wrong. > \\ > \\ > Here is the reason: In Parquet, BinaryStatistic just keep max & min as > parquet.io.api.Binary references, Spark sql would generate a new Binary > backed by the same Array\[Byte\] passed from row. > > | |reference| |backed| | > |max: Binary|-->|ByteArrayBackedBinary|-->|Array\[Byte\]| > Therefore, each time parquet updating row group's statistic, max & min would > always refer to the same Array\[Byte\], which has new content each time. When > parquet decides to save it into file, the last row's content would be saved > as both max & min. > \\ > \\ > It seems it is a parquet bug because it's parquet's responsibility to update > statistics correctly. > But not quite sure. Should I report it as a bug in parquet JIRA? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows
[ https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961318#comment-14961318 ] Cheng Lian commented on SPARK-6859: --- This issue was left unresolved because Parquet filter push-down wasn't enabled by default. But now in 1.5, it's turned on by default. Opened SPARK-11153 to disable filter push-down for strings and binaries. > Parquet File Binary column statistics error when reuse byte[] among rows > > > Key: SPARK-6859 > URL: https://issues.apache.org/jira/browse/SPARK-6859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.3.0, 1.4.0 >Reporter: Yijie Shen >Priority: Minor > > Suppose I create a dataRDD which extends RDD\[Row\], and each row is > GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is > reused among rows but has different content each time. When I convert it to a > dataFrame and save it as Parquet File, the file's row group statistic(max & > min) of Binary column would be wrong. > \\ > \\ > Here is the reason: In Parquet, BinaryStatistic just keep max & min as > parquet.io.api.Binary references, Spark sql would generate a new Binary > backed by the same Array\[Byte\] passed from row. > > | |reference| |backed| | > |max: Binary|-->|ByteArrayBackedBinary|-->|Array\[Byte\]| > Therefore, each time parquet updating row group's statistic, max & min would > always refer to the same Array\[Byte\], which has new content each time. When > parquet decides to save it into file, the last row's content would be saved > as both max & min. > \\ > \\ > It seems it is a parquet bug because it's parquet's responsibility to update > statistics correctly. > But not quite sure. Should I report it as a bug in parquet JIRA? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows
[ https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492667#comment-14492667 ] Cheng Lian commented on SPARK-6859: --- [~rdblue] pointed out 1 fact that I missed in PARQUET-251: we need to work out a way to ignore (binary) min/max stats for all existing data. So from Spark SQL side, we have to disable filter push-down for binary columns. > Parquet File Binary column statistics error when reuse byte[] among rows > > > Key: SPARK-6859 > URL: https://issues.apache.org/jira/browse/SPARK-6859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.3.0, 1.4.0 >Reporter: Yijie Shen >Priority: Minor > > Suppose I create a dataRDD which extends RDD\[Row\], and each row is > GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is > reused among rows but has different content each time. When I convert it to a > dataFrame and save it as Parquet File, the file's row group statistic(max & > min) of Binary column would be wrong. > \\ > \\ > Here is the reason: In Parquet, BinaryStatistic just keep max & min as > parquet.io.api.Binary references, Spark sql would generate a new Binary > backed by the same Array\[Byte\] passed from row. > > | |reference| |backed| | > |max: Binary|-->|ByteArrayBackedBinary|-->|Array\[Byte\]| > Therefore, each time parquet updating row group's statistic, max & min would > always refer to the same Array\[Byte\], which has new content each time. When > parquet decides to save it into file, the last row's content would be saved > as both max & min. > \\ > \\ > It seems it is a parquet bug because it's parquet's responsibility to update > statistics correctly. > But not quite sure. Should I report it as a bug in parquet JIRA? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows
[ https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491820#comment-14491820 ] Yijie Shen commented on SPARK-6859: --- I opened a JIRA ticket in Parquet: [PARQUET-251|https://issues.apache.org/jira/browse/PARQUET-251] > Parquet File Binary column statistics error when reuse byte[] among rows > > > Key: SPARK-6859 > URL: https://issues.apache.org/jira/browse/SPARK-6859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.3.0, 1.4.0 >Reporter: Yijie Shen >Priority: Minor > > Suppose I create a dataRDD which extends RDD\[Row\], and each row is > GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is > reused among rows but has different content each time. When I convert it to a > dataFrame and save it as Parquet File, the file's row group statistic(max & > min) of Binary column would be wrong. > \\ > \\ > Here is the reason: In Parquet, BinaryStatistic just keep max & min as > parquet.io.api.Binary references, Spark sql would generate a new Binary > backed by the same Array\[Byte\] passed from row. > > | |reference| |backed| | > |max: Binary|-->|ByteArrayBackedBinary|-->|Array\[Byte\]| > Therefore, each time parquet updating row group's statistic, max & min would > always refer to the same Array\[Byte\], which has new content each time. When > parquet decides to save it into file, the last row's content would be saved > as both max & min. > \\ > \\ > It seems it is a parquet bug because it's parquet's responsibility to update > statistics correctly. > But not quite sure. Should I report it as a bug in parquet JIRA? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows
[ https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491575#comment-14491575 ] Cheng Lian commented on SPARK-6859: --- A better way can be defensive copy while inserting byte arrays to parquet, so that we don't suffer read performance regression. > Parquet File Binary column statistics error when reuse byte[] among rows > > > Key: SPARK-6859 > URL: https://issues.apache.org/jira/browse/SPARK-6859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.3.0, 1.4.0 >Reporter: Yijie Shen >Priority: Minor > > Suppose I create a dataRDD which extends RDD\[Row\], and each row is > GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is > reused among rows but has different content each time. When I convert it to a > dataFrame and save it as Parquet File, the file's row group statistic(max & > min) of Binary column would be wrong. > \\ > \\ > Here is the reason: In Parquet, BinaryStatistic just keep max & min as > parquet.io.api.Binary references, Spark sql would generate a new Binary > backed by the same Array\[Byte\] passed from row. > > | |reference| |backed| | > |max: Binary|-->|ByteArrayBackedBinary|-->|Array\[Byte\]| > Therefore, each time parquet updating row group's statistic, max & min would > always refer to the same Array\[Byte\], which has new content each time. When > parquet decides to save it into file, the last row's content would be saved > as both max & min. > \\ > \\ > It seems it is a parquet bug because it's parquet's responsibility to update > statistics correctly. > But not quite sure. Should I report it as a bug in parquet JIRA? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows
[ https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491558#comment-14491558 ] Cheng Lian commented on SPARK-6859: --- For 1.3 and prior versions, this issue isn't that serious, since strings are immutable. But in 1.4 we are adding mutable UTF8String ([PR #5350|https://github.com/apache/spark/pull/5350]). > Parquet File Binary column statistics error when reuse byte[] among rows > > > Key: SPARK-6859 > URL: https://issues.apache.org/jira/browse/SPARK-6859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.3.0, 1.4.0 >Reporter: Yijie Shen >Priority: Minor > > Suppose I create a dataRDD which extends RDD\[Row\], and each row is > GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is > reused among rows but has different content each time. When I convert it to a > dataFrame and save it as Parquet File, the file's row group statistic(max & > min) of Binary column would be wrong. > \\ > \\ > Here is the reason: In Parquet, BinaryStatistic just keep max & min as > parquet.io.api.Binary references, Spark sql would generate a new Binary > backed by the same Array\[Byte\] passed from row. > > | |reference| |backed| | > |max: Binary|-->|ByteArrayBackedBinary|-->|Array\[Byte\]| > Therefore, each time parquet updating row group's statistic, max & min would > always refer to the same Array\[Byte\], which has new content each time. When > parquet decides to save it into file, the last row's content would be saved > as both max & min. > \\ > \\ > It seems it is a parquet bug because it's parquet's responsibility to update > statistics correctly. > But not quite sure. Should I report it as a bug in parquet JIRA? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows
[ https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491548#comment-14491548 ] Cheng Lian commented on SPARK-6859: --- [~yijieshen] Thanks for reporting! And yes, please also open a JIRA ticket for Parquet and link it with this one so that it's easier to track. [~marmbrus] I guess we should disable pushing down filters involving binary type before this bug is fixed in Parquet. > Parquet File Binary column statistics error when reuse byte[] among rows > > > Key: SPARK-6859 > URL: https://issues.apache.org/jira/browse/SPARK-6859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.3.0, 1.4.0 >Reporter: Yijie Shen >Priority: Minor > > Suppose I create a dataRDD which extends RDD\[Row\], and each row is > GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is > reused among rows but has different content each time. When I convert it to a > dataFrame and save it as Parquet File, the file's row group statistic(max & > min) of Binary column would be wrong. > \\ > \\ > Here is the reason: In Parquet, BinaryStatistic just keep max & min as > parquet.io.api.Binary references, Spark sql would generate a new Binary > backed by the same Array\[Byte\] passed from row. > > | |reference| |backed| | > |max: Binary|-->|ByteArrayBackedBinary|-->|Array\[Byte\]| > Therefore, each time parquet updating row group's statistic, max & min would > always refer to the same Array\[Byte\], which has new content each time. When > parquet decides to save it into file, the last row's content would be saved > as both max & min. > \\ > \\ > It seems it is a parquet bug because it's parquet's responsibility to update > statistics correctly. > But not quite sure. Should I report it as a bug in parquet JIRA? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org