[jira] [Commented] (SPARK-48950) Corrupt data from parquet scans

Thomas Newton (Jira) Mon, 23 Sep 2024 10:21:06 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-48950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883967#comment-17883967
 ]


Thomas Newton commented on SPARK-48950:
---------------------------------------

I've been doing a bit more experimentation and I have some more info to share. 

Firstly I've attached dumps of some of the outputs I've got from running my 
reproduce script. [^corrupt_data_examples.zip]

Checking these we can see that the majority of the time the bug occurs in just 
a simple double type column but there is also one occurrence in the array of 
double column. 

 

I also did some testing just using local spark session on a single node and I 
was also able to reproduce. NOTE: to reproduce its important to enable retries 
e.g. `SparkSession.builder.master("local[12, 10]")`. Its also generally a bit 
slower than reproducing on a multi-node cluster, I assume because the code is 
just running slower. 

I also tried reproducing with the data stored on the local filesystem. In this 
case I've not been able to reproduce, so I suspect the problem is originating 
from getting corrupt data from hadoop-azure when reading Azure blob storage. 
Maybe the problem is just a case of we need to be using some checksums (from 
what I could tell spark does not make use of the parquet checksums) but I would 
really like to understand why reverting 
[https://github.com/apache/spark/pull/39950] solves or at least reduces the 
frequency of the problem. 

> Corrupt data from parquet scans
> -------------------------------
>
>                 Key: SPARK-48950
>                 URL: https://issues.apache.org/jira/browse/SPARK-48950
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 3.5.0, 4.0.0, 3.5.1, 3.5.2
>         Environment: Spark 3.5.0
> Running on kubernetes
> Using Azure Blob storage with hierarchical namespace enabled 
>            Reporter: Thomas Newton
>            Priority: Major
>              Labels: correctness
>         Attachments: corrupt_data_examples.zip, example_task_errors.txt, 
> generate_data_to_reproduce_spark-48950.ipynb, job_dag.png, 
> reproduce_spark-48950.py, sql_query_plan.png
>
>
> Its very rare and non-deterministic but since Spark 3.5.0 we have started 
> seeing a correctness bug in parquet scans when using the vectorized reader. 
> We've noticed this on double type columns where occasionally small groups 
> (typically 10s to 100s) of rows are replaced with crazy values like 
> `-1.29996470e+029, 3.56717569e-184, 7.23323243e+307, -1.05929677e+045, 
> -7.60562076e+240, -3.18088886e-064, 2.89435993e-116`. I think this is the 
> result of interpreting uniform random bits as a double type. Most of my 
> testing has been on an array of double type column but we have also seen it 
> on un-nested plain double type columns. 
> I've been testing this by adding a filter that should return zero results but 
> will return non-zero if the parquet scan has problems. I've attached 
> screenshots of this from the Spark UI. 
> I did a `git bisect` and found that the problem starts with 
> [https://github.com/apache/spark/pull/39950], but I haven't yet understood 
> why. Its possible that this change is fine but it reveals a problem 
> elsewhere? I did also notice  [https://github.com/apache/spark/pull/44853] 
> which appears to be a different implementation of the same thing so maybe 
> that could help. 
> Its not a major problem by itself but another symptom appears to be that 
> Parquet scan tasks fail at a rate of approximately 0.03% with errors like 
> those in the attached `example_task_errors.txt`. If I revert 
> [https://github.com/apache/spark/pull/39950] I get exactly 0 task failures on 
> the same test. 
>  
> The problem seems to be a bit dependant on how the parquet files happen to be 
> organised on blob storage so I don't yet have a reproduce that I can share 
> that doesn't depend on private data. 
> I tested on a pre-release 4.0.0 and the problem was still present. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-48950) Corrupt data from parquet scans

Reply via email to