[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-22 Thread Robert Joseph Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182345#comment-17182345
 ] 

Robert Joseph Evans commented on SPARK-32672:
-

Honestly, it is not a big deal what happened.  I have worked on enough 
open-source projects that I know that all of this is best-effort run by 
volunteers.  Plus my involvement in the Spark project has not been frequent 
enough for a lot of people to know that I am a PMC member and honestly I have 
not been involved enough lately to know the process myself.  So I am happy to 
have people correct me or treat me like I am a contributor instead. The 
important thing is that we fixed the bug, and it should start to roll out soon.

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
> Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182225#comment-17182225
 ] 

Hyukjin Kwon commented on SPARK-32672:
--

[~revans2] is a PMC, and it is a correctness issue. This indeed is a blocker. I 
think the initial action was a mistake.

I think [~cltlfcjin] referred:

{quote}
Set to Major or below; higher priorities are generally reserved for committers 
to set. 
{quote}

I fully agree that ideally we should first evaluate what they are reporting 
with stating the reason.

The problem is that we don't have a lot of manpower here in triaging/managing 
JIRAs. It's just that there are not so many people who do.
Given this situation, I would like to encourage to aggressively triage - there 
are many JIRAs that set priority incorrectly.

For example, many JIRAs just ask questions and/or investigations with setting 
the priority as a blocker. Such blockers matter for release managers.
If we want more fine-grained and ideal evaluation of JIRAs, I would encourage 
our PMC members to take a look more often.


> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
> Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182156#comment-17182156
 ] 

Dongjoon Hyun commented on SPARK-32672:
---

Thank you, [~revans2]. 

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181910#comment-17181910
 ] 

Apache Spark commented on SPARK-32672:
--

User 'revans2' has created a pull request for this issue:
https://github.com/apache/spark/pull/29506

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181909#comment-17181909
 ] 

Apache Spark commented on SPARK-32672:
--

User 'revans2' has created a pull request for this issue:
https://github.com/apache/spark/pull/29506

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Robert Joseph Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181873#comment-17181873
 ] 

Robert Joseph Evans commented on SPARK-32672:
-

OK reading through the code I understand what is happening now.  The 
compression format ignores nulls, which are stored separately.  As such the bit 
set stored is only for non-null boolean values/bits. The number of entries 
stored in the compression format is  the number of non-null boolean values that 
are stored.

So the stopping condition on a batch decompress.

{code}
while (visitedLocal < countLocal) {
{code}

skips all of null values at the end.  But because the length of the column is 
known ahead of time it falls back to the default value which is false.

I'll try to get a patch up shortly to fix this.

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet, small_bad.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Robert Joseph Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181868#comment-17181868
 ] 

Robert Joseph Evans commented on SPARK-32672:
-

So I am able to reduce the corruption down to just a single 10,000 row chunk, 
and still get it to happen. I'll post a new parquet file soon that will 
hopefully make debugging a little simpler.

{code}
scala> val bad_order = 
spark.read.parquet("/home/roberte/src/rapids-plugin-4-spark/integration_tests/bad_order.snappy.parquet").selectExpr("b",
 "monotonically_increasing_id() as id").where(col("id")>=7 and col("id") < 
8)
bad_order: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [b: 
boolean, id: bigint]

scala> bad_order.groupBy("b").count.show
+-+-+
|b|count|
+-+-+
| null|  619|
| true| 4701|
|false| 4680|
+-+-+


scala> bad_order.cache()
res2: bad_order.type = [b: boolean, id: bigint]

scala> bad_order.groupBy("b").count.show
+-+-+
|b|count|
+-+-+
| null|  618|
| true| 4701|
|false| 4681|
+-+-+

{code}

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-21 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181867#comment-17181867
 ] 

Thomas Graves commented on SPARK-32672:
---

[~cltlfcjin]  Please do not be changing priority just because people are not 
committers. You should first evaluate what they are reporting.  If you don't 
think its a blocker then we should state why the reason.

I looked at this after it was filed and added correctness tag and it was 
already marked as Blocker so I didn't need to change it.  As you can see from 
[https://spark.apache.org/contributing.html,|https://spark.apache.org/contributing.html]
 correctness issues should be marked as a blocker at least until it's 
investigated and discussed. 

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181632#comment-17181632
 ] 

Jungtaek Lim commented on SPARK-32672:
--

Just FYI, he's a PMC member. And correctness issue goes normally a blocker 
unless there's some strong reason to not address the issue right now.

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Critical
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181553#comment-17181553
 ] 

Lantao Jin commented on SPARK-32672:


Changed to Critical, Blocker is reserved for committer

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Robert Joseph Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181486#comment-17181486
 ] 

Robert Joseph Evans commented on SPARK-32672:
-

I added some debugging to the compression code and it looks like in the 8th 
CompressedBatch of 10,000 entries the number of nulls seen was different from 
the number expected.

619 expected and 618 seen.  I'll try to debug this a bit more tomorrow.

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Robert Joseph Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181478#comment-17181478
 ] 

Robert Joseph Evans commented on SPARK-32672:
-

I did a little debugging and found that `BooleanBitSet$Encoder` is being used 
for compression.  There are other data orderings that use the same encoder and 
produce correct results though.

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181468#comment-17181468
 ] 

Thomas Graves commented on SPARK-32672:
---

[~cloud_fan] [~ruifengz]

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Robert Joseph Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181466#comment-17181466
 ] 

Robert Joseph Evans commented on SPARK-32672:
-

I verified that this is still happening on 3.1.0-SNAPSHOT  too

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32672) Data corruption in some cached compressed boolean columns

2020-08-20 Thread Robert Joseph Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181459#comment-17181459
 ] 

Robert Joseph Evans commented on SPARK-32672:
-

I verified that this is still happening on 3.0.2-SNAPSHOT

> Data corruption in some cached compressed boolean columns
> -
>
> Key: SPARK-32672
> URL: https://issues.apache.org/jira/browse/SPARK-32672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Blocker
> Attachments: bad_order.snappy.parquet
>
>
> I found that when sorting some boolean data into the cache that the results 
> can change when the data is read back out.
> It needs to be a non-trivial amount of data, and it is highly dependent on 
> the order of the data.  If I disable compression in the cache the issue goes 
> away.  I was able to make this happen in 3.0.0.  I am going to try and 
> reproduce it in other versions too.
> I'll attach the parquet file with boolean data in an order that causes this 
> to happen. As you can see after the data is cached a single null values 
> switches over to be false.
> {code}
> scala> val bad_order = spark.read.parquet("./bad_order.snappy.parquet")
> bad_order: org.apache.spark.sql.DataFrame = [b: boolean]  
>   
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7153|
> | true|54334|
> |false|54021|
> +-+-+
> scala> bad_order.cache()
> res1: bad_order.type = [b: boolean]
> scala> bad_order.groupBy("b").count.show
> +-+-+
> |b|count|
> +-+-+
> | null| 7152|
> | true|54334|
> |false|54022|
> +-+-+
> scala> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org