[ https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308754#comment-15308754 ]
Cheng Lian commented on SPARK-6859: ----------------------------------- Yea, thanks. I'm closing it. > Parquet File Binary column statistics error when reuse byte[] among rows > ------------------------------------------------------------------------ > > Key: SPARK-6859 > URL: https://issues.apache.org/jira/browse/SPARK-6859 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.2.0, 1.3.0, 1.4.0 > Reporter: Yijie Shen > Priority: Minor > Fix For: 2.0.0 > > > Suppose I create a dataRDD which extends RDD\[Row\], and each row is > GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is > reused among rows but has different content each time. When I convert it to a > dataFrame and save it as Parquet File, the file's row group statistic(max & > min) of Binary column would be wrong. > \\ > \\ > Here is the reason: In Parquet, BinaryStatistic just keep max & min as > parquet.io.api.Binary references, Spark sql would generate a new Binary > backed by the same Array\[Byte\] passed from row. > > | |reference| |backed| | > |max: Binary|---------->|ByteArrayBackedBinary|---------->|Array\[Byte\]| > Therefore, each time parquet updating row group's statistic, max & min would > always refer to the same Array\[Byte\], which has new content each time. When > parquet decides to save it into file, the last row's content would be saved > as both max & min. > \\ > \\ > It seems it is a parquet bug because it's parquet's responsibility to update > statistics correctly. > But not quite sure. Should I report it as a bug in parquet JIRA? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org