[ https://issues.apache.org/jira/browse/SPARK-48950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879306#comment-17879306 ]
Bruce Robbins commented on SPARK-48950: --------------------------------------- By the way, there was a vectorization-related correctness issue in 3.5.0 and 3.5.1: {noformat} spark-sql (default)> select version(); 3.5.1 fd86f85e181fc2dc0f50a096855acf83a6cc5d9c Time taken: 0.043 seconds, Fetched 1 row(s) spark-sql (default)> drop table if exists t1; Time taken: 0.127 seconds spark-sql (default)> create table t1 using parquet as select * from values (named_struct('f1', array(1, 2, 3), 'f2', array(1, null, 2))) as (value); Time taken: 0.197 seconds spark-sql (default)> select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from t1; {"f1":[1.0,2.0,3.0],"f2":[1,0,2]} Time taken: 0.112 seconds, Fetched 1 row(s) spark-sql (default)> {noformat} The 0 in the 2nd slot of field f2 is wrong (should be null). I believe this was fixed by SPARK-48019 for 3.5.2: As far as I know, this correctness bug affected only nested values. I see that the reporter reverted SPARK-42388 and the problem seemed to disappear, but maybe that just changed timing or memory layout such that the bug was less noticeable? Either that or the reporter's issue really is tied to SPARK-42388. Hopefully I have not muddied the waters. > Corrupt data from parquet scans > ------------------------------- > > Key: SPARK-48950 > URL: https://issues.apache.org/jira/browse/SPARK-48950 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 3.5.0, 4.0.0, 3.5.1 > Environment: Spark 3.5.0 > Running on kubernetes > Using Azure Blob storage with hierarchical namespace enabled > Reporter: Thomas Newton > Priority: Major > Labels: correctness > Attachments: example_task_errors.txt, job_dag.png, sql_query_plan.png > > > Its very rare and non-deterministic but since Spark 3.5.0 we have started > seeing a correctness bug in parquet scans when using the vectorized reader. > We've noticed this on double type columns where occasionally small groups > (typically 10s to 100s) of rows are replaced with crazy values like > `-1.29996470e+029, 3.56717569e-184, 7.23323243e+307, -1.05929677e+045, > -7.60562076e+240, -3.18088886e-064, 2.89435993e-116`. I think this is the > result of interpreting uniform random bits as a double type. Most of my > testing has been on an array of double type column but we have also seen it > on un-nested plain double type columns. > I've been testing this by adding a filter that should return zero results but > will return non-zero if the parquet scan has problems. I've attached > screenshots of this from the Spark UI. > I did a `git bisect` and found that the problem starts with > [https://github.com/apache/spark/pull/39950], but I haven't yet understood > why. Its possible that this change is fine but it reveals a problem > elsewhere? I did also notice [https://github.com/apache/spark/pull/44853] > which appears to be a different implementation of the same thing so maybe > that could help. > Its not a major problem by itself but another symptom appears to be that > Parquet scan tasks fail at a rate of approximately 0.03% with errors like > those in the attached `example_task_errors.txt`. If I revert > [https://github.com/apache/spark/pull/39950] I get exactly 0 task failures on > the same test. > > The problem seems to be a bit dependant on how the parquet files happen to be > organised on blob storage so I don't yet have a reproduce that I can share > that doesn't depend on private data. > I tested on a pre-release 4.0.0 and the problem was still present. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org