[ https://issues.apache.org/jira/browse/SPARK-26859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-26859: ---------------------------------- Summary: Fix field writer index bug in non-vectorized ORC deserializer (was: Reading ORC files with explicit schema can result in wrong data) > Fix field writer index bug in non-vectorized ORC deserializer > ------------------------------------------------------------- > > Key: SPARK-26859 > URL: https://issues.apache.org/jira/browse/SPARK-26859 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0 > Reporter: Ivan Vergiliev > Priority: Major > Labels: correctness > > There is a bug in the ORC deserialization code that, when triggered, results > in completely wrong data being read. I've marked this as a Blocker as per the > docs in https://spark.apache.org/contributing.html as it's a data correctness > issue. > The bug is triggered when the following set of conditions are all met: > - the non-vectorized ORC reader is being used; > - a schema is explicitly specified when reading the ORC file > - the provided schema has columns not present in the ORC file, and these > columns are in the middle of the schema > - the ORC file being read contains null values in the columns after the ones > added by the schema. > When all of these are met: > - the internal state of the ORC deserializer gets messed up, and, as a result > - the null values from the ORC file end up being set on wrong columns, not > the one they're in, and > - the old values from the null columns don't get cleared from the previous > record. > Here's a concrete example. Let's consider the following DataFrame: > {code:scala} > val rdd = sparkContext.parallelize(Seq((1, 2, "abc"), (4, 5, "def"), > (8, 9, null))) > val df = rdd.toDF("col1", "col2", "col3") > {code} > and the following schema: > {code:scala} > col1 int, col4 int, col2 int, col3 string > {code} > Notice the `col4 int` added in the middle that doesn't exist in the dataframe. > Saving this dataframe to ORC and then reading it back with the specified > schema should result in reading the same values, with nulls for `col4`. > Instead, we get the following back: > {code:java} > [1,null,2,abc] > [4,null,5,def] > [8,null,null,def] > {code} > Notice how the `def` from the second record doesn't get properly cleared and > ends up in the third record as well; also, instead of `col2 = 9` in the last > record as expected, we get the null that should've been in column 3 instead. > *Impact* > When this issue is triggered, it results in completely wrong results being > read from the ORC file. The set of conditions under which it gets triggered > is somewhat narrow so the set of affected users is probably limited. There > are possibly also people that are affected but haven't realized it because > the conditions are so obscure. > *Bug details* > The issue is caused by calling `setNullAt` with a wrong index in > `OrcDeserializer.scala:deserialize()`. I have a fix that I'll send out for > review shortly. > *Workaround* > This bug is currently only triggered when new columns are added to the middle > of the schema. This means that it can be worked around by only adding new > columns at the end. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org