[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns
[ https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182345#comment-17182345 ] Robert Joseph Evans commented on SPARK-32672: - Honestly, it is not a big deal what happened. I have worked on enough open-source projects that I know that all of this is best-effort run by volunteers. Plus my involvement in the Spark project has not been frequent enough for a lot of people to know that I am a PMC member and honestly I have not been involved enough lately to know the process myself. So I am happy to have people correct me or treat me like I am a contributor instead. The important thing is that we fixed the bug, and it should start to roll out soon. > Data corruption in some cached compressed boolean columns > - > > Key: SPARK-32672 > URL: https://issues.apache.org/jira/browse/SPARK-32672 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.6, 3.0.0, 3.0.1, 3.1.0 >Reporter: Robert Joseph Evans >Assignee: Robert Joseph Evans >Priority: Blocker > Labels: correctness > Fix For: 2.4.7, 3.0.1, 3.1.0 > > Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet > > > I found that when sorting some boolean data into the cache that the results > can change when the data is read back out. > It needs to be a non-trivial amount of data, and it is highly dependent on > the order of the data. If I disable compression in the cache the issue goes > away. I was able to make this happen in 3.0.0. I am going to try and > reproduce it in other versions too. > I'll attach the parquet file with boolean data in an order that causes this > to happen. As you can see after the data is cached a single null values > switches over to be false. > {code} > scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet") > bad_order: org.apache.spark.sql.DataFrame = [b: boolean] > > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7153| > | true|54334| > |false|54021| > +-+-+ > scala> bad_order.cache() > res1: bad_order.type = [b: boolean] > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7152| > | true|54334| > |false|54022| > +-+-+ > scala> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns
[ https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182225#comment-17182225 ] Hyukjin Kwon commented on SPARK-32672: -- [~revans2] is a PMC, and it is a correctness issue. This indeed is a blocker. I think the initial action was a mistake. I think [~cltlfcjin] referred: {quote} Set to Major or below; higher priorities are generally reserved for committers to set. {quote} I fully agree that ideally we should first evaluate what they are reporting with stating the reason. The problem is that we don't have a lot of manpower here in triaging/managing JIRAs. It's just that there are not so many people who do. Given this situation, I would like to encourage to aggressively triage - there are many JIRAs that set priority incorrectly. For example, many JIRAs just ask questions and/or investigations with setting the priority as a blocker. Such blockers matter for release managers. If we want more fine-grained and ideal evaluation of JIRAs, I would encourage our PMC members to take a look more often. > Data corruption in some cached compressed boolean columns > - > > Key: SPARK-32672 > URL: https://issues.apache.org/jira/browse/SPARK-32672 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.6, 3.0.0, 3.0.1, 3.1.0 >Reporter: Robert Joseph Evans >Assignee: Robert Joseph Evans >Priority: Blocker > Labels: correctness > Fix For: 2.4.7, 3.0.1, 3.1.0 > > Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet > > > I found that when sorting some boolean data into the cache that the results > can change when the data is read back out. > It needs to be a non-trivial amount of data, and it is highly dependent on > the order of the data. If I disable compression in the cache the issue goes > away. I was able to make this happen in 3.0.0. I am going to try and > reproduce it in other versions too. > I'll attach the parquet file with boolean data in an order that causes this > to happen. As you can see after the data is cached a single null values > switches over to be false. > {code} > scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet") > bad_order: org.apache.spark.sql.DataFrame = [b: boolean] > > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7153| > | true|54334| > |false|54021| > +-+-+ > scala> bad_order.cache() > res1: bad_order.type = [b: boolean] > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7152| > | true|54334| > |false|54022| > +-+-+ > scala> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns
[ https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182156#comment-17182156 ] Dongjoon Hyun commented on SPARK-32672: --- Thank you, [~revans2]. > Data corruption in some cached compressed boolean columns > - > > Key: SPARK-32672 > URL: https://issues.apache.org/jira/browse/SPARK-32672 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.3, 2.3.4, 2.4.6, 3.0.0, 3.0.1, 3.1.0 >Reporter: Robert Joseph Evans >Priority: Blocker > Labels: correctness > Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet > > > I found that when sorting some boolean data into the cache that the results > can change when the data is read back out. > It needs to be a non-trivial amount of data, and it is highly dependent on > the order of the data. If I disable compression in the cache the issue goes > away. I was able to make this happen in 3.0.0. I am going to try and > reproduce it in other versions too. > I'll attach the parquet file with boolean data in an order that causes this > to happen. As you can see after the data is cached a single null values > switches over to be false. > {code} > scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet") > bad_order: org.apache.spark.sql.DataFrame = [b: boolean] > > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7153| > | true|54334| > |false|54021| > +-+-+ > scala> bad_order.cache() > res1: bad_order.type = [b: boolean] > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7152| > | true|54334| > |false|54022| > +-+-+ > scala> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns
[ https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181910#comment-17181910 ] Apache Spark commented on SPARK-32672: -- User 'revans2' has created a pull request for this issue: https://github.com/apache/spark/pull/29506 > Data corruption in some cached compressed boolean columns > - > > Key: SPARK-32672 > URL: https://issues.apache.org/jira/browse/SPARK-32672 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0 >Reporter: Robert Joseph Evans >Priority: Blocker > Labels: correctness > Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet > > > I found that when sorting some boolean data into the cache that the results > can change when the data is read back out. > It needs to be a non-trivial amount of data, and it is highly dependent on > the order of the data. If I disable compression in the cache the issue goes > away. I was able to make this happen in 3.0.0. I am going to try and > reproduce it in other versions too. > I'll attach the parquet file with boolean data in an order that causes this > to happen. As you can see after the data is cached a single null values > switches over to be false. > {code} > scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet") > bad_order: org.apache.spark.sql.DataFrame = [b: boolean] > > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7153| > | true|54334| > |false|54021| > +-+-+ > scala> bad_order.cache() > res1: bad_order.type = [b: boolean] > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7152| > | true|54334| > |false|54022| > +-+-+ > scala> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns
[ https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181909#comment-17181909 ] Apache Spark commented on SPARK-32672: -- User 'revans2' has created a pull request for this issue: https://github.com/apache/spark/pull/29506 > Data corruption in some cached compressed boolean columns > - > > Key: SPARK-32672 > URL: https://issues.apache.org/jira/browse/SPARK-32672 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0 >Reporter: Robert Joseph Evans >Priority: Blocker > Labels: correctness > Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet > > > I found that when sorting some boolean data into the cache that the results > can change when the data is read back out. > It needs to be a non-trivial amount of data, and it is highly dependent on > the order of the data. If I disable compression in the cache the issue goes > away. I was able to make this happen in 3.0.0. I am going to try and > reproduce it in other versions too. > I'll attach the parquet file with boolean data in an order that causes this > to happen. As you can see after the data is cached a single null values > switches over to be false. > {code} > scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet") > bad_order: org.apache.spark.sql.DataFrame = [b: boolean] > > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7153| > | true|54334| > |false|54021| > +-+-+ > scala> bad_order.cache() > res1: bad_order.type = [b: boolean] > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7152| > | true|54334| > |false|54022| > +-+-+ > scala> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns
[ https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181873#comment-17181873 ] Robert Joseph Evans commented on SPARK-32672: - OK reading through the code I understand what is happening now. The compression format ignores nulls, which are stored separately. As such the bit set stored is only for non-null boolean values/bits. The number of entries stored in the compression format is the number of non-null boolean values that are stored. So the stopping condition on a batch decompress. {code} while (visitedLocal < countLocal) { {code} skips all of null values at the end. But because the length of the column is known ahead of time it falls back to the default value which is false. I'll try to get a patch up shortly to fix this. > Data corruption in some cached compressed boolean columns > - > > Key: SPARK-32672 > URL: https://issues.apache.org/jira/browse/SPARK-32672 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0 >Reporter: Robert Joseph Evans >Priority: Blocker > Labels: correctness > Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet > > > I found that when sorting some boolean data into the cache that the results > can change when the data is read back out. > It needs to be a non-trivial amount of data, and it is highly dependent on > the order of the data. If I disable compression in the cache the issue goes > away. I was able to make this happen in 3.0.0. I am going to try and > reproduce it in other versions too. > I'll attach the parquet file with boolean data in an order that causes this > to happen. As you can see after the data is cached a single null values > switches over to be false. > {code} > scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet") > bad_order: org.apache.spark.sql.DataFrame = [b: boolean] > > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7153| > | true|54334| > |false|54021| > +-+-+ > scala> bad_order.cache() > res1: bad_order.type = [b: boolean] > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7152| > | true|54334| > |false|54022| > +-+-+ > scala> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns
[ https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181868#comment-17181868 ] Robert Joseph Evans commented on SPARK-32672: - So I am able to reduce the corruption down to just a single 10,000 row chunk, and still get it to happen. I'll post a new parquet file soon that will hopefully make debugging a little simpler. {code} scala> val bad_order = spark.read.parquet("/home/roberte/src/rapids-plugin-4-spark/integration_tests/bad_order.snappy.parquet").selectExpr("b", "monotonically_increasing_id() as id").where(col("id")>=7 and col("id") < 8) bad_order: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [b: boolean, id: bigint] scala> bad_order.groupBy("b").count.show +-+-+ |b|count| +-+-+ | null| 619| | true| 4701| |false| 4680| +-+-+ scala> bad_order.cache() res2: bad_order.type = [b: boolean, id: bigint] scala> bad_order.groupBy("b").count.show +-+-+ |b|count| +-+-+ | null| 618| | true| 4701| |false| 4681| +-+-+ {code} > Data corruption in some cached compressed boolean columns > - > > Key: SPARK-32672 > URL: https://issues.apache.org/jira/browse/SPARK-32672 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0 >Reporter: Robert Joseph Evans >Priority: Blocker > Labels: correctness > Attachments: bad_order.snappy.parquet > > > I found that when sorting some boolean data into the cache that the results > can change when the data is read back out. > It needs to be a non-trivial amount of data, and it is highly dependent on > the order of the data. If I disable compression in the cache the issue goes > away. I was able to make this happen in 3.0.0. I am going to try and > reproduce it in other versions too. > I'll attach the parquet file with boolean data in an order that causes this > to happen. As you can see after the data is cached a single null values > switches over to be false. > {code} > scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet") > bad_order: org.apache.spark.sql.DataFrame = [b: boolean] > > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7153| > | true|54334| > |false|54021| > +-+-+ > scala> bad_order.cache() > res1: bad_order.type = [b: boolean] > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7152| > | true|54334| > |false|54022| > +-+-+ > scala> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns
[ https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181867#comment-17181867 ] Thomas Graves commented on SPARK-32672: --- [~cltlfcjin] Please do not be changing priority just because people are not committers. You should first evaluate what they are reporting. If you don't think its a blocker then we should state why the reason. I looked at this after it was filed and added correctness tag and it was already marked as Blocker so I didn't need to change it. As you can see from [https://spark.apache.org/contributing.html,|https://spark.apache.org/contributing.html] correctness issues should be marked as a blocker at least until it's investigated and discussed. > Data corruption in some cached compressed boolean columns > - > > Key: SPARK-32672 > URL: https://issues.apache.org/jira/browse/SPARK-32672 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0 >Reporter: Robert Joseph Evans >Priority: Blocker > Labels: correctness > Attachments: bad_order.snappy.parquet > > > I found that when sorting some boolean data into the cache that the results > can change when the data is read back out. > It needs to be a non-trivial amount of data, and it is highly dependent on > the order of the data. If I disable compression in the cache the issue goes > away. I was able to make this happen in 3.0.0. I am going to try and > reproduce it in other versions too. > I'll attach the parquet file with boolean data in an order that causes this > to happen. As you can see after the data is cached a single null values > switches over to be false. > {code} > scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet") > bad_order: org.apache.spark.sql.DataFrame = [b: boolean] > > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7153| > | true|54334| > |false|54021| > +-+-+ > scala> bad_order.cache() > res1: bad_order.type = [b: boolean] > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7152| > | true|54334| > |false|54022| > +-+-+ > scala> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns
[ https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181632#comment-17181632 ] Jungtaek Lim commented on SPARK-32672: -- Just FYI, he's a PMC member. And correctness issue goes normally a blocker unless there's some strong reason to not address the issue right now. > Data corruption in some cached compressed boolean columns > - > > Key: SPARK-32672 > URL: https://issues.apache.org/jira/browse/SPARK-32672 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0 >Reporter: Robert Joseph Evans >Priority: Critical > Labels: correctness > Attachments: bad_order.snappy.parquet > > > I found that when sorting some boolean data into the cache that the results > can change when the data is read back out. > It needs to be a non-trivial amount of data, and it is highly dependent on > the order of the data. If I disable compression in the cache the issue goes > away. I was able to make this happen in 3.0.0. I am going to try and > reproduce it in other versions too. > I'll attach the parquet file with boolean data in an order that causes this > to happen. As you can see after the data is cached a single null values > switches over to be false. > {code} > scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet") > bad_order: org.apache.spark.sql.DataFrame = [b: boolean] > > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7153| > | true|54334| > |false|54021| > +-+-+ > scala> bad_order.cache() > res1: bad_order.type = [b: boolean] > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7152| > | true|54334| > |false|54022| > +-+-+ > scala> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns
[ https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181553#comment-17181553 ] Lantao Jin commented on SPARK-32672: Changed to Critical, Blocker is reserved for committer > Data corruption in some cached compressed boolean columns > - > > Key: SPARK-32672 > URL: https://issues.apache.org/jira/browse/SPARK-32672 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0 >Reporter: Robert Joseph Evans >Priority: Blocker > Labels: correctness > Attachments: bad_order.snappy.parquet > > > I found that when sorting some boolean data into the cache that the results > can change when the data is read back out. > It needs to be a non-trivial amount of data, and it is highly dependent on > the order of the data. If I disable compression in the cache the issue goes > away. I was able to make this happen in 3.0.0. I am going to try and > reproduce it in other versions too. > I'll attach the parquet file with boolean data in an order that causes this > to happen. As you can see after the data is cached a single null values > switches over to be false. > {code} > scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet") > bad_order: org.apache.spark.sql.DataFrame = [b: boolean] > > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7153| > | true|54334| > |false|54021| > +-+-+ > scala> bad_order.cache() > res1: bad_order.type = [b: boolean] > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7152| > | true|54334| > |false|54022| > +-+-+ > scala> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns
[ https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181486#comment-17181486 ] Robert Joseph Evans commented on SPARK-32672: - I added some debugging to the compression code and it looks like in the 8th CompressedBatch of 10,000 entries the number of nulls seen was different from the number expected. 619 expected and 618 seen. I'll try to debug this a bit more tomorrow. > Data corruption in some cached compressed boolean columns > - > > Key: SPARK-32672 > URL: https://issues.apache.org/jira/browse/SPARK-32672 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0, 3.0.1 >Reporter: Robert Joseph Evans >Priority: Blocker > Labels: correctness > Attachments: bad_order.snappy.parquet > > > I found that when sorting some boolean data into the cache that the results > can change when the data is read back out. > It needs to be a non-trivial amount of data, and it is highly dependent on > the order of the data. If I disable compression in the cache the issue goes > away. I was able to make this happen in 3.0.0. I am going to try and > reproduce it in other versions too. > I'll attach the parquet file with boolean data in an order that causes this > to happen. As you can see after the data is cached a single null values > switches over to be false. > {code} > scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet") > bad_order: org.apache.spark.sql.DataFrame = [b: boolean] > > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7153| > | true|54334| > |false|54021| > +-+-+ > scala> bad_order.cache() > res1: bad_order.type = [b: boolean] > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7152| > | true|54334| > |false|54022| > +-+-+ > scala> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns
[ https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181478#comment-17181478 ] Robert Joseph Evans commented on SPARK-32672: - I did a little debugging and found that `BooleanBitSet$Encoder` is being used for compression. There are other data orderings that use the same encoder and produce correct results though. > Data corruption in some cached compressed boolean columns > - > > Key: SPARK-32672 > URL: https://issues.apache.org/jira/browse/SPARK-32672 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0, 3.0.1 >Reporter: Robert Joseph Evans >Priority: Blocker > Labels: correctness > Attachments: bad_order.snappy.parquet > > > I found that when sorting some boolean data into the cache that the results > can change when the data is read back out. > It needs to be a non-trivial amount of data, and it is highly dependent on > the order of the data. If I disable compression in the cache the issue goes > away. I was able to make this happen in 3.0.0. I am going to try and > reproduce it in other versions too. > I'll attach the parquet file with boolean data in an order that causes this > to happen. As you can see after the data is cached a single null values > switches over to be false. > {code} > scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet") > bad_order: org.apache.spark.sql.DataFrame = [b: boolean] > > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7153| > | true|54334| > |false|54021| > +-+-+ > scala> bad_order.cache() > res1: bad_order.type = [b: boolean] > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7152| > | true|54334| > |false|54022| > +-+-+ > scala> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns
[ https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181468#comment-17181468 ] Thomas Graves commented on SPARK-32672: --- [~cloud_fan] [~ruifengz] > Data corruption in some cached compressed boolean columns > - > > Key: SPARK-32672 > URL: https://issues.apache.org/jira/browse/SPARK-32672 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0, 3.0.1 >Reporter: Robert Joseph Evans >Priority: Blocker > Labels: correctness > Attachments: bad_order.snappy.parquet > > > I found that when sorting some boolean data into the cache that the results > can change when the data is read back out. > It needs to be a non-trivial amount of data, and it is highly dependent on > the order of the data. If I disable compression in the cache the issue goes > away. I was able to make this happen in 3.0.0. I am going to try and > reproduce it in other versions too. > I'll attach the parquet file with boolean data in an order that causes this > to happen. As you can see after the data is cached a single null values > switches over to be false. > {code} > scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet") > bad_order: org.apache.spark.sql.DataFrame = [b: boolean] > > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7153| > | true|54334| > |false|54021| > +-+-+ > scala> bad_order.cache() > res1: bad_order.type = [b: boolean] > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7152| > | true|54334| > |false|54022| > +-+-+ > scala> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns
[ https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181466#comment-17181466 ] Robert Joseph Evans commented on SPARK-32672: - I verified that this is still happening on 3.1.0-SNAPSHOT too > Data corruption in some cached compressed boolean columns > - > > Key: SPARK-32672 > URL: https://issues.apache.org/jira/browse/SPARK-32672 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Robert Joseph Evans >Priority: Blocker > Attachments: bad_order.snappy.parquet > > > I found that when sorting some boolean data into the cache that the results > can change when the data is read back out. > It needs to be a non-trivial amount of data, and it is highly dependent on > the order of the data. If I disable compression in the cache the issue goes > away. I was able to make this happen in 3.0.0. I am going to try and > reproduce it in other versions too. > I'll attach the parquet file with boolean data in an order that causes this > to happen. As you can see after the data is cached a single null values > switches over to be false. > {code} > scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet") > bad_order: org.apache.spark.sql.DataFrame = [b: boolean] > > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7153| > | true|54334| > |false|54021| > +-+-+ > scala> bad_order.cache() > res1: bad_order.type = [b: boolean] > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7152| > | true|54334| > |false|54022| > +-+-+ > scala> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns
[ https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181459#comment-17181459 ] Robert Joseph Evans commented on SPARK-32672: - I verified that this is still happening on 3.0.2-SNAPSHOT > Data corruption in some cached compressed boolean columns > - > > Key: SPARK-32672 > URL: https://issues.apache.org/jira/browse/SPARK-32672 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Robert Joseph Evans >Priority: Blocker > Attachments: bad_order.snappy.parquet > > > I found that when sorting some boolean data into the cache that the results > can change when the data is read back out. > It needs to be a non-trivial amount of data, and it is highly dependent on > the order of the data. If I disable compression in the cache the issue goes > away. I was able to make this happen in 3.0.0. I am going to try and > reproduce it in other versions too. > I'll attach the parquet file with boolean data in an order that causes this > to happen. As you can see after the data is cached a single null values > switches over to be false. > {code} > scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet") > bad_order: org.apache.spark.sql.DataFrame = [b: boolean] > > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7153| > | true|54334| > |false|54021| > +-+-+ > scala> bad_order.cache() > res1: bad_order.type = [b: boolean] > scala> bad_order.groupBy("b").count.show > +-+-+ > |b|count| > +-+-+ > | null| 7152| > | true|54334| > |false|54022| > +-+-+ > scala> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org