Re: SPARK-22267 issue: Spark SQL incorrectly reads ORC file when column order is different

Dongjoon Hyun Tue, 14 Nov 2017 09:59:37 -0800

Hi, Mark.

That is one of the reasons why I left it behind from the previous PR
(below) and I'm focusing is the second approach; use OrcFileFormat with
convertMetastoreOrc.


https://github.com/apache/spark/pull/19470
[SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark schema to read ORC
table instead of ORC file schema

With `convertMetastoreOrc=true`, Spark 2.3 will become stabler and faster.
Also, it's the default Spark way to handle Parquet.

BTW, thank you for looking at SPARK-22267. So far, I'm not looking at that
issue.
If we have a fix for SPARK-22267 in Spark 2.3, it would be great!

Bests,
Dongjoon.


On Tue, Nov 14, 2017 at 3:46 AM, Mark Petruska <petruska.m...@gmail.com>
wrote:

>   Hi,
> I'm very new to spark development, and would like to get guidance from
> more experienced members.
> Sorry this email will be long as I try to explain the details.
>
> Started to investigate the issue SPARK-22267
> <https://issues.apache.org/jira/browse/SPARK-22267>; added some test
> cases to highlight the problem in the PR
> <https://github.com/apache/spark/pull/19744>. Here are my findings:
>
> - for parquet the test case succeeds as expected
>
> - the sql test case for orc:
>     - when CONVERT_METASTORE_ORC is set to "true" the data fields are
> presented in the desired order
>     - when it is "false" the columns are read in the wrong order
>     - Reason: when `isConvertible` returns true in `RelationConversions`
> the plan executes `convertToLogicalRelation`, which in turn uses
> `OrcFileFormat` to read the data; otherwise it uses the classes in
> "hive-exec:1.2.1".
>
> - the HadoopRDD test case was added to further investigate the parameter
> values to discover a working combination, but unfortunately no combination
> of "serialization.ddl" and "columns" result in success. It seems that those
> fields do not have any effect on the order of the resulting data fields.
>
>
> At this point I do not see any option to fix this issue without risking
> "backward compatibility" problems.
> The possible actions (as I see them):
> - link a new version of "hive-exec": surely this bug has been fixed in a
> newer version
> - use `OrcFileFormat` for reading orc data regardless of the setting of
> CONVERT_METASTORE_ORC
> - also there's an `OrcNewInputFormat` class in "hive-exec", but it
> implements an InputFormat interface from a different package, hence it is
> incompatible with HadoopRDD at the moment
>
> Please help me. Did I miss some viable options?
>
> Thanks,
> Mark
>
>

Re: SPARK-22267 issue: Spark SQL incorrectly reads ORC file when column order is different

Reply via email to