[ https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181486#comment-17181486 ]
Robert Joseph Evans commented on SPARK-32672: --------------------------------------------- I added some debugging to the compression code and it looks like in the 8th CompressedBatch of 10,000 entries the number of nulls seen was different from the number expected. 619 expected and 618 seen. I'll try to debug this a bit more tomorrow. > Data corruption in some cached compressed boolean columns > --------------------------------------------------------- > > Key: SPARK-32672 > URL: https://issues.apache.org/jira/browse/SPARK-32672 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.6, 3.0.0, 3.0.1 > Reporter: Robert Joseph Evans > Priority: Blocker > Labels: correctness > Attachments: bad_order.snappy.parquet > > > I found that when sorting some boolean data into the cache that the results > can change when the data is read back out. > It needs to be a non-trivial amount of data, and it is highly dependent on > the order of the data. If I disable compression in the cache the issue goes > away. I was able to make this happen in 3.0.0. I am going to try and > reproduce it in other versions too. > I'll attach the parquet file with boolean data in an order that causes this > to happen. As you can see after the data is cached a single null values > switches over to be false. > {code} > scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet") > bad_order: org.apache.spark.sql.DataFrame = [b: boolean] > > scala> bad_order.groupBy("b").count.show > +-----+-----+ > | b|count| > +-----+-----+ > | null| 7153| > | true|54334| > |false|54021| > +-----+-----+ > scala> bad_order.cache() > res1: bad_order.type = [b: boolean] > scala> bad_order.groupBy("b").count.show > +-----+-----+ > | b|count| > +-----+-----+ > | null| 7152| > | true|54334| > |false|54022| > +-----+-----+ > scala> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org