[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ramakrishna Prasad K S updated SPARK-32558: ------------------------------------------- Attachment: image-2020-08-09-14-07-00-521.png > ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 > (work-around of using spark.sql.orc.impl=hive is also not working) > ----------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-32558 > URL: https://issues.apache.org/jira/browse/SPARK-32558 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0 > Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. > (Linux Redhat) > Reporter: Ramakrishna Prasad K S > Priority: Major > Attachments: image-2020-08-09-14-07-00-521.png > > > Steps to reproduce the issue: > ------------------------------- > Download Spark_3.0 from [https://spark.apache.org/downloads.html] > > Step 1) Create ORC File by using the default Spark_3.0 Native API from spark > shell . > {code} > [linuxuser1@irlrhellinux1 bin]$ ./spark-shell > Welcome to Spark version 3.0.0 > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191) > Type in expressions to have them evaluated. Type :help for more information. > scala> spark.sql("set spark.sql.orc.impl").show() > +-------------------------+ > | key| value| > +-------------------------+ > |spark.sql.orc.impl|native| > +-------------------------+ > > scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: > org.apache.spark.sql.DataFrame = [] > scala> spark.sql("insert into df_table values('col1val1','col2val1')") > org.apache.spark.sql.DataFrame = [] > scala> val dFrame = spark.sql("select * from df_table") dFrame: > org.apache.spark.sql.DataFrame = [col1: string, col2: string] > scala> dFrame.show() > +-----------------+ > | col1| col2| > +-----------------+ > |col1val1|col2val1| > +-----------------+ > scala> > dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") > {code} > > Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop > cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following > command to analyze or read metadata from the ORC files. As you see below, it > fails to fetch the metadata from the ORC file. > {code} > adpqa@irlhadoop1 bug]$ hive --orcfiledump > /tmp/df_table/part-00000-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc > Processing data file > /tmp/df_table/part-00000-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc > [length: 414] > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 > at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) > at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) > at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:385) > at org.apache.orc.OrcFile.createReader(OrcFile.java:222) > at org.apache.orc.tools.FileDump.getReader(FileDump.java:255) > at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) > at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307) > at org.apache.orc.tools.FileDump.main(FileDump.java:154) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:313) > at org.apache.hadoop.util.RunJar.main(RunJar.java:227) > {code} > Step 3) Now Create ORC File using the Hive API (as suggested by Spark in > [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting > spark.sql.orc.impl as hive) > {code} > scala> spark.sql("set spark.sql.orc.impl=hive") > res6: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> spark.sql("set spark.sql.orc.impl").show() > +------------------------+ > | key|value| > +------------------------+ > |spark.sql.orc.impl| hive| > +------------------------+ > scala> spark.sql("CREATE table df_table2(col1 string,col2 string)") > scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: > org.apache.spark.sql.DataFrame = [] > scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: > org.apache.spark.sql.DataFrame = [col1: string, col2: string] > scala> > dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2") > {code} > Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop > cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following > command to analyze or read metadata from the ORC files. As you see below, it > fails with the same exception to fetch the metadata even after following the > workaround suggested by spark to set spark.sql.orc.impl to hive > {code} > [adpqa@irlhadoop1 bug]$ hive --orcfiledump > /tmp/df_table2/part-00000-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc > Processing data file > /tmp/df_table2/part-00000-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc > [length: 414] > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 > at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) > at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) > at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:385) > at org.apache.orc.OrcFile.createReader(OrcFile.java:222) > at org.apache.orc.tools.FileDump.getReader(FileDump.java:255) > at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) > at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307) > at org.apache.orc.tools.FileDump.main(FileDump.java:154) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:313) > at org.apache.hadoop.util.RunJar.main(RunJar.java:227) > {code} > *Note: The same case works fine if you try metadata fetch from Hive_2.3 or > above versions.* > *So the main concern here is that setting {{spark.sql.orc.impl}} to hive is > not producing ORC files that will work with Hive_2.1.1 or below. > Can someone help here. Is there any other workaround available? Can this be > looked into on priority? Thank you. > > References: > [https://spark.apache.org/docs/latest/sql-migration-guide.html] (workaround > of setting spark.sql.orc.impl=hive is mentioned here which is not working): > {quote} > Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC > files by default. To do that, spark.sql.orc.impl and > spark.sql.orc.filterPushdown change their default values to native and true > respectively. ORC files created by native ORC writer cannot be read by some > old Apache Hive releases. Use spark.sql.orc.impl=hive to create the files > shared with Hive 2.1.1 and older."" > {quote} > https://issues.apache.org/jira/browse/SPARK-26932 > https://issues.apache.org/jira/browse/HIVE-16683 -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org