[ https://issues.apache.org/jira/browse/SPARK-48950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870190#comment-17870190 ]
Kent Yao commented on SPARK-48950: ---------------------------------- Thank you [~Tom_Newton] for the additional inputs. I’d retarget this to 3.5.3 to unblock 3.5.2. WDYT?[~dongjoon] > Corrupt data from parquet scans > ------------------------------- > > Key: SPARK-48950 > URL: https://issues.apache.org/jira/browse/SPARK-48950 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 3.5.0, 4.0.0, 3.5.1 > Environment: Spark 3.5.0 > Running on kubernetes > Using Azure Blob storage with hierarchical namespace enabled > Reporter: Thomas Newton > Priority: Major > Labels: correctness > Attachments: example_task_errors.txt, job_dag.png, sql_query_plan.png > > > Its very rare and non-deterministic but since Spark 3.5.0 we have started > seeing a correctness bug in parquet scans when using the vectorized reader. > We've noticed this on double type columns where occasionally small groups > (typically 10s to 100s) of rows are replaced with crazy values like > `-1.29996470e+029, 3.56717569e-184, 7.23323243e+307, -1.05929677e+045, > -7.60562076e+240, -3.18088886e-064, 2.89435993e-116`. I think this is the > result of interpreting uniform random bits as a double type. Most of my > testing has been on an array of double type column but we have also seen it > on un-nested plain double type columns. > I've been testing this by adding a filter that should return zero results but > will return non-zero if the parquet scan has problems. I've attached > screenshots of this from the Spark UI. > I did a `git bisect` and found that the problem starts with > [https://github.com/apache/spark/pull/39950], but I haven't yet understood > why. Its possible that this change is fine but it reveals a problem > elsewhere? I did also notice [https://github.com/apache/spark/pull/44853] > which appears to be a different implementation of the same thing so maybe > that could help. > Its not a major problem by itself but another symptom appears to be that > Parquet scan tasks fail at a rate of approximately 0.03% with errors like > those in the attached `example_task_errors.txt`. If I revert > [https://github.com/apache/spark/pull/39950] I get exactly 0 task failures on > the same test. > > The problem seems to be a bit dependant on how the parquet files happen to be > organised on blob storage so I don't yet have a reproduce that I can share > that doesn't depend on private data. > I tested on a pre-release 4.0.0 and the problem was still present. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org