Re: SPARK-22267 issue: Spark SQL incorrectly reads ORC file when column order is different

2017-11-15 Thread Mark Petruska
  Hi Dongjoon,
Thanks for the info.
Unfortunately I did not find any means to fix the issue without
forcing CONVERT_METASTORE_ORC
or changing the ORC reader implementation.
Closing the PR, as it was only used to demonstrate the root cause.
Best regards,
Mark

On Tue, Nov 14, 2017 at 6:58 PM, Dongjoon Hyun 
wrote:

> Hi, Mark.
>
> That is one of the reasons why I left it behind from the previous PR
> (below) and I'm focusing is the second approach; use OrcFileFormat with
> convertMetastoreOrc.
>
> https://github.com/apache/spark/pull/19470
> [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark schema to read ORC
> table instead of ORC file schema
>
> With `convertMetastoreOrc=true`, Spark 2.3 will become stabler and faster.
> Also, it's the default Spark way to handle Parquet.
>
> BTW, thank you for looking at SPARK-22267. So far, I'm not looking at that
> issue.
> If we have a fix for SPARK-22267 in Spark 2.3, it would be great!
>
> Bests,
> Dongjoon.
>
>
> On Tue, Nov 14, 2017 at 3:46 AM, Mark Petruska 
> wrote:
>
>>   Hi,
>> I'm very new to spark development, and would like to get guidance from
>> more experienced members.
>> Sorry this email will be long as I try to explain the details.
>>
>> Started to investigate the issue SPARK-22267
>> ; added some test
>> cases to highlight the problem in the PR
>> . Here are my findings:
>>
>> - for parquet the test case succeeds as expected
>>
>> - the sql test case for orc:
>> - when CONVERT_METASTORE_ORC is set to "true" the data fields are
>> presented in the desired order
>> - when it is "false" the columns are read in the wrong order
>> - Reason: when `isConvertible` returns true in `RelationConversions`
>> the plan executes `convertToLogicalRelation`, which in turn uses
>> `OrcFileFormat` to read the data; otherwise it uses the classes in
>> "hive-exec:1.2.1".
>>
>> - the HadoopRDD test case was added to further investigate the parameter
>> values to discover a working combination, but unfortunately no combination
>> of "serialization.ddl" and "columns" result in success. It seems that those
>> fields do not have any effect on the order of the resulting data fields.
>>
>>
>> At this point I do not see any option to fix this issue without risking
>> "backward compatibility" problems.
>> The possible actions (as I see them):
>> - link a new version of "hive-exec": surely this bug has been fixed in a
>> newer version
>> - use `OrcFileFormat` for reading orc data regardless of the setting of
>> CONVERT_METASTORE_ORC
>> - also there's an `OrcNewInputFormat` class in "hive-exec", but it
>> implements an InputFormat interface from a different package, hence it is
>> incompatible with HadoopRDD at the moment
>>
>> Please help me. Did I miss some viable options?
>>
>> Thanks,
>> Mark
>>
>>
>


Re: SPARK-22267 issue: Spark SQL incorrectly reads ORC file when column order is different

2017-11-14 Thread Dongjoon Hyun
Hi, Mark.

That is one of the reasons why I left it behind from the previous PR
(below) and I'm focusing is the second approach; use OrcFileFormat with
convertMetastoreOrc.

https://github.com/apache/spark/pull/19470
[SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark schema to read ORC
table instead of ORC file schema

With `convertMetastoreOrc=true`, Spark 2.3 will become stabler and faster.
Also, it's the default Spark way to handle Parquet.

BTW, thank you for looking at SPARK-22267. So far, I'm not looking at that
issue.
If we have a fix for SPARK-22267 in Spark 2.3, it would be great!

Bests,
Dongjoon.


On Tue, Nov 14, 2017 at 3:46 AM, Mark Petruska 
wrote:

>   Hi,
> I'm very new to spark development, and would like to get guidance from
> more experienced members.
> Sorry this email will be long as I try to explain the details.
>
> Started to investigate the issue SPARK-22267
> ; added some test
> cases to highlight the problem in the PR
> . Here are my findings:
>
> - for parquet the test case succeeds as expected
>
> - the sql test case for orc:
> - when CONVERT_METASTORE_ORC is set to "true" the data fields are
> presented in the desired order
> - when it is "false" the columns are read in the wrong order
> - Reason: when `isConvertible` returns true in `RelationConversions`
> the plan executes `convertToLogicalRelation`, which in turn uses
> `OrcFileFormat` to read the data; otherwise it uses the classes in
> "hive-exec:1.2.1".
>
> - the HadoopRDD test case was added to further investigate the parameter
> values to discover a working combination, but unfortunately no combination
> of "serialization.ddl" and "columns" result in success. It seems that those
> fields do not have any effect on the order of the resulting data fields.
>
>
> At this point I do not see any option to fix this issue without risking
> "backward compatibility" problems.
> The possible actions (as I see them):
> - link a new version of "hive-exec": surely this bug has been fixed in a
> newer version
> - use `OrcFileFormat` for reading orc data regardless of the setting of
> CONVERT_METASTORE_ORC
> - also there's an `OrcNewInputFormat` class in "hive-exec", but it
> implements an InputFormat interface from a different package, hence it is
> incompatible with HadoopRDD at the moment
>
> Please help me. Did I miss some viable options?
>
> Thanks,
> Mark
>
>


SPARK-22267 issue: Spark SQL incorrectly reads ORC file when column order is different

2017-11-14 Thread Mark Petruska
  Hi,
I'm very new to spark development, and would like to get guidance from more
experienced members.
Sorry this email will be long as I try to explain the details.

Started to investigate the issue SPARK-22267
; added some test cases
to highlight the problem in the PR
. Here are my findings:

- for parquet the test case succeeds as expected

- the sql test case for orc:
- when CONVERT_METASTORE_ORC is set to "true" the data fields are
presented in the desired order
- when it is "false" the columns are read in the wrong order
- Reason: when `isConvertible` returns true in `RelationConversions`
the plan executes `convertToLogicalRelation`, which in turn uses
`OrcFileFormat` to read the data; otherwise it uses the classes in
"hive-exec:1.2.1".

- the HadoopRDD test case was added to further investigate the parameter
values to discover a working combination, but unfortunately no combination
of "serialization.ddl" and "columns" result in success. It seems that those
fields do not have any effect on the order of the resulting data fields.


At this point I do not see any option to fix this issue without risking
"backward compatibility" problems.
The possible actions (as I see them):
- link a new version of "hive-exec": surely this bug has been fixed in a
newer version
- use `OrcFileFormat` for reading orc data regardless of the setting of
CONVERT_METASTORE_ORC
- also there's an `OrcNewInputFormat` class in "hive-exec", but it
implements an InputFormat interface from a different package, hence it is
incompatible with HadoopRDD at the moment

Please help me. Did I miss some viable options?

Thanks,
Mark