Hi, Mark. That is one of the reasons why I left it behind from the previous PR (below) and I'm focusing is the second approach; use OrcFileFormat with convertMetastoreOrc.
https://github.com/apache/spark/pull/19470 [SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark schema to read ORC table instead of ORC file schema With `convertMetastoreOrc=true`, Spark 2.3 will become stabler and faster. Also, it's the default Spark way to handle Parquet. BTW, thank you for looking at SPARK-22267. So far, I'm not looking at that issue. If we have a fix for SPARK-22267 in Spark 2.3, it would be great! Bests, Dongjoon. On Tue, Nov 14, 2017 at 3:46 AM, Mark Petruska <petruska.m...@gmail.com> wrote: > Hi, > I'm very new to spark development, and would like to get guidance from > more experienced members. > Sorry this email will be long as I try to explain the details. > > Started to investigate the issue SPARK-22267 > <https://issues.apache.org/jira/browse/SPARK-22267>; added some test > cases to highlight the problem in the PR > <https://github.com/apache/spark/pull/19744>. Here are my findings: > > - for parquet the test case succeeds as expected > > - the sql test case for orc: > - when CONVERT_METASTORE_ORC is set to "true" the data fields are > presented in the desired order > - when it is "false" the columns are read in the wrong order > - Reason: when `isConvertible` returns true in `RelationConversions` > the plan executes `convertToLogicalRelation`, which in turn uses > `OrcFileFormat` to read the data; otherwise it uses the classes in > "hive-exec:1.2.1". > > - the HadoopRDD test case was added to further investigate the parameter > values to discover a working combination, but unfortunately no combination > of "serialization.ddl" and "columns" result in success. It seems that those > fields do not have any effect on the order of the resulting data fields. > > > At this point I do not see any option to fix this issue without risking > "backward compatibility" problems. > The possible actions (as I see them): > - link a new version of "hive-exec": surely this bug has been fixed in a > newer version > - use `OrcFileFormat` for reading orc data regardless of the setting of > CONVERT_METASTORE_ORC > - also there's an `OrcNewInputFormat` class in "hive-exec", but it > implements an InputFormat interface from a different package, hence it is > incompatible with HadoopRDD at the moment > > Please help me. Did I miss some viable options? > > Thanks, > Mark > >