[ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
-------------------------------------------
    Description: 
Steps to reproduce the issue:

------------------------------- 

 

Download Spark_3.0 from [https://spark.apache.org/downloads.html]

 

Step 1) Create ORC File by using the default Spark_3.0 Native API from the 
spark shell .

 

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

---------------------+ |               key| value| --------------------+ 
|spark.sql.orc.impl|*native*| ---------------------+

 

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table values('col1val1','col2val1')")

org.apache.spark.sql.DataFrame = []

 

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> dFrame.show()

+ |    col1|    col2| ------------+ |col1val1|col2val1| -------------+

 

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

 

adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-00000-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-00000-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

 

scala> spark.sql("set spark.sql.orc.impl=hive")

res6: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("set spark.sql.orc.impl").show()

------------------+ |               key|value| +------------------- 
|spark.sql.orc.impl| *hive*| +----------------------

 

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")

20/08/04 22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []

 

scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

 

 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive

 

[adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-00000-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc

Processing data file 
/tmp/df_table2/part-00000-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

*Note: The same case works fine if you try metadata fetch from Hive_2.3 or 
above versions.*

*So the main concern here is that setting spark.sql.orc.impl to hive is not 
producing ORC files that will work with Hive_2.1.1 or below.  Can someone help 
here. Is there any other workaround available? Can this be looked into on 
priority? Thank you.*

 

References:
 [https://spark.apache.org/docs/latest/sql-migration-guide.html]  (workaround 
of setting spark.sql.orc.impl=hive is mentioned here which is not working):

""Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC 
files by default. To do that, spark.sql.orc.impl and 
spark.sql.orc.filterPushdown change their default values to native and true 
respectively. ORC files created by native ORC writer cannot be read by some old 
Apache Hive releases. Use spark.sql.orc.impl=hive to create the files shared 
with Hive 2.1.1 and older.""

https://issues.apache.org/jira/browse/SPARK-26932

https://issues.apache.org/jira/browse/HIVE-16683

  was:
Steps to reproduce the issue:

------------------------------- 

 

Download Spark_3.0 from [https://spark.apache.org/downloads.html]

 

Step 1) Create ORC File by using the default Spark_3.0 Native API from the 
spark shell .

Launch Spark Shell:

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

---------------------+ |               key| value| --------------------+ 
|spark.sql.orc.impl|*native*| ---------------------+

 

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table values('col1val1','col2val1')")

org.apache.spark.sql.DataFrame = []

 

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> dFrame.show()

+ |    col1|    col2| ------------+ |col1val1|col2val1| -------------+

 

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

 

adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-00000-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-00000-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

 

scala> spark.sql("set spark.sql.orc.impl=hive")

res6: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("set spark.sql.orc.impl").show()

------------------+ |               key|value| +------------------- 
|spark.sql.orc.impl| *hive*| +----------------------

 

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")

20/08/04 22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []

 

scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

 

 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive

 

[adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-00000-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc

Processing data file 
/tmp/df_table2/part-00000-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

*Note: The same case works fine if you try metadata fetch from Hive_2.3 or 
above versions.*

*So the main concern here is that setting spark.sql.orc.impl to hive is not 
producing ORC files that will work with Hive_2.1.1 or below.  Can someone help 
here. Is there any other workaround available? Can this be looked into on 
priority? Thank you.*

 

References:
 [https://spark.apache.org/docs/latest/sql-migration-guide.html]  (workaround 
of setting spark.sql.orc.impl=hive is mentioned here which is not working):

""Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC 
files by default. To do that, spark.sql.orc.impl and 
spark.sql.orc.filterPushdown change their default values to native and true 
respectively. ORC files created by native ORC writer cannot be read by some old 
Apache Hive releases. Use spark.sql.orc.impl=hive to create the files shared 
with Hive 2.1.1 and older.""

https://issues.apache.org/jira/browse/SPARK-26932

https://issues.apache.org/jira/browse/HIVE-16683


> ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 
> (work-around of using spark.sql.orc.impl=hive is also not working)
> -----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-32558
>                 URL: https://issues.apache.org/jira/browse/SPARK-32558
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.0
>         Environment: Spark 3.0 on Linux and Hadoop cluster having Hive_2.1.1 
> version.
>            Reporter: Ramakrishna Prasad K S
>            Priority: Blocker
>             Fix For: 3.0.0
>
>
> Steps to reproduce the issue:
> ------------------------------- 
>  
> Download Spark_3.0 from [https://spark.apache.org/downloads.html]
>  
> Step 1) Create ORC File by using the default Spark_3.0 Native API from the 
> spark shell .
>  
> [linuxuser1@irlrhellinux1 bin]$ ./spark-shell
> Welcome to Spark version 3.0.0
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
> Type in expressions to have them evaluated. Type :help for more information.
>  scala> spark.sql("set spark.sql.orc.impl").show()
> ---------------------+ |               key| value| --------------------+ 
> |spark.sql.orc.impl|*native*| ---------------------+
>  
> scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> spark.sql("insert into df_table values('col1val1','col2val1')")
> org.apache.spark.sql.DataFrame = []
>  
> scala> val dFrame = spark.sql("select * from df_table") dFrame: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
>  
> scala> dFrame.show()
> + |    col1|    col2| ------------+ |col1val1|col2val1| -------------+
>  
> scala> 
> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
>  
>  Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata from the ORC files. As you see below, it 
> fails to fetch the metadata from the ORC file.
>  
> adpqa@irlhadoop1 bug]$ hive --orcfiledump 
> /tmp/df_table/part-00000-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> Processing data file 
> /tmp/df_table/part-00000-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
>  
> Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
> [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
> spark.sql.orc.impl as hive)
>  
> scala> spark.sql("set spark.sql.orc.impl=hive")
> res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("set spark.sql.orc.impl").show()
> ------------------+ |               key|value| +------------------- 
> |spark.sql.orc.impl| *hive*| +----------------------
>  
> scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")
> 20/08/04 22:43:26 WARN HiveMetaStore: Location: 
> [file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
>  specified for non-external table:df_table2 res5: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
>  
> scala> 
> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")
>  
>  Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata from the ORC files. As you see below, it 
> fails with the same exception to fetch the metadata even after following the 
> workaround suggested by spark to set spark.sql.orc.impl to hive
>  
> [adpqa@irlhadoop1 bug]$ hive --orcfiledump 
> /tmp/df_table2/part-00000-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc
> Processing data file 
> /tmp/df_table2/part-00000-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc
>  [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
>  
> *Note: The same case works fine if you try metadata fetch from Hive_2.3 or 
> above versions.*
> *So the main concern here is that setting spark.sql.orc.impl to hive is not 
> producing ORC files that will work with Hive_2.1.1 or below.  Can someone 
> help here. Is there any other workaround available? Can this be looked into 
> on priority? Thank you.*
>  
> References:
>  [https://spark.apache.org/docs/latest/sql-migration-guide.html]  (workaround 
> of setting spark.sql.orc.impl=hive is mentioned here which is not working):
> ""Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for 
> ORC files by default. To do that, spark.sql.orc.impl and 
> spark.sql.orc.filterPushdown change their default values to native and true 
> respectively. ORC files created by native ORC writer cannot be read by some 
> old Apache Hive releases. Use spark.sql.orc.impl=hive to create the files 
> shared with Hive 2.1.1 and older.""
> https://issues.apache.org/jira/browse/SPARK-26932
> https://issues.apache.org/jira/browse/HIVE-16683



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to