[jira] [Comment Edited] (SPARK-32234) Spark sql commands are failing on select Queries for the orc tables
[ https://issues.apache.org/jira/browse/SPARK-32234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177708#comment-17177708 ] Ramakrishna Prasad K S edited comment on SPARK-32234 at 8/14/20, 11:22 AM: --- Thanks [~saurabhc100] I am going ahead and merging these changes to our product which is on Spark_3.0. I hope there is no regression or side effects due to these changes. Just wanted to know why this bug is still in resolved state. Is any test still pending to be run? Thank you. was (Author: ramks): Thanks [~saurabhc100] I am going ahead and merging these changes to my local Spark_3.0 setup. I hope there is no regression or side effects due to these changes. Just wanted to know why this bug is still in resolved state. Is any test still pending to be run? Thank you. > Spark sql commands are failing on select Queries for the orc tables > > > Key: SPARK-32234 > URL: https://issues.apache.org/jira/browse/SPARK-32234 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Saurabh Chawla >Assignee: Saurabh Chawla >Priority: Blocker > Fix For: 3.0.1, 3.1.0 > > Attachments: e17f6887c06d47f6a62c0140c1ad569c_00 > > > Spark sql commands are failing on select Queries for the orc tables > Steps to reproduce > > {code:java} > val table = """CREATE TABLE `date_dim` ( > `d_date_sk` INT, > `d_date_id` STRING, > `d_date` TIMESTAMP, > `d_month_seq` INT, > `d_week_seq` INT, > `d_quarter_seq` INT, > `d_year` INT, > `d_dow` INT, > `d_moy` INT, > `d_dom` INT, > `d_qoy` INT, > `d_fy_year` INT, > `d_fy_quarter_seq` INT, > `d_fy_week_seq` INT, > `d_day_name` STRING, > `d_quarter_name` STRING, > `d_holiday` STRING, > `d_weekend` STRING, > `d_following_holiday` STRING, > `d_first_dom` INT, > `d_last_dom` INT, > `d_same_day_ly` INT, > `d_same_day_lq` INT, > `d_current_day` STRING, > `d_current_week` STRING, > `d_current_month` STRING, > `d_current_quarter` STRING, > `d_current_year` STRING) > USING orc > LOCATION '/Users/test/tpcds_scale5data/date_dim' > TBLPROPERTIES ( > 'transient_lastDdlTime' = '1574682806')""" > spark.sql(table).collect > val u = """select date_dim.d_date_id from date_dim limit 5""" > spark.sql(u).collect > {code} > > > Exception > > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 > (TID 2, 192.168.0.103, executor driver): > java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:141) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:203) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:620) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:343) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:895) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:895) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:336) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:133) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:445) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1489) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:448) >
[jira] [Commented] (SPARK-32234) Spark sql commands are failing on select Queries for the orc tables
[ https://issues.apache.org/jira/browse/SPARK-32234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177708#comment-17177708 ] Ramakrishna Prasad K S commented on SPARK-32234: Thanks [~saurabhc100] I am going ahead and merging these changes to my local Spark_3.0 setup. I hope there is no regression or side effects due to these changes. Just wanted to know why this bug is still in resolved state. Is any test still pending to be run? Thank you. > Spark sql commands are failing on select Queries for the orc tables > > > Key: SPARK-32234 > URL: https://issues.apache.org/jira/browse/SPARK-32234 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Saurabh Chawla >Assignee: Saurabh Chawla >Priority: Blocker > Fix For: 3.0.1, 3.1.0 > > Attachments: e17f6887c06d47f6a62c0140c1ad569c_00 > > > Spark sql commands are failing on select Queries for the orc tables > Steps to reproduce > > {code:java} > val table = """CREATE TABLE `date_dim` ( > `d_date_sk` INT, > `d_date_id` STRING, > `d_date` TIMESTAMP, > `d_month_seq` INT, > `d_week_seq` INT, > `d_quarter_seq` INT, > `d_year` INT, > `d_dow` INT, > `d_moy` INT, > `d_dom` INT, > `d_qoy` INT, > `d_fy_year` INT, > `d_fy_quarter_seq` INT, > `d_fy_week_seq` INT, > `d_day_name` STRING, > `d_quarter_name` STRING, > `d_holiday` STRING, > `d_weekend` STRING, > `d_following_holiday` STRING, > `d_first_dom` INT, > `d_last_dom` INT, > `d_same_day_ly` INT, > `d_same_day_lq` INT, > `d_current_day` STRING, > `d_current_week` STRING, > `d_current_month` STRING, > `d_current_quarter` STRING, > `d_current_year` STRING) > USING orc > LOCATION '/Users/test/tpcds_scale5data/date_dim' > TBLPROPERTIES ( > 'transient_lastDdlTime' = '1574682806')""" > spark.sql(table).collect > val u = """select date_dim.d_date_id from date_dim limit 5""" > spark.sql(u).collect > {code} > > > Exception > > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 > (TID 2, 192.168.0.103, executor driver): > java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:141) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:203) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:620) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:343) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:895) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:895) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:336) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:133) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:445) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1489) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:448) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > > > The reason behind this initBatch is not getting the schema that is needed to > find out the column value in OrcFi
[jira] [Commented] (SPARK-32234) Spark sql commands are failing on select Queries for the orc tables
[ https://issues.apache.org/jira/browse/SPARK-32234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173924#comment-17173924 ] Ramakrishna Prasad K S commented on SPARK-32234: [~saurabhc100] has the fix for this bug been verified? We are observing the same issue as reported here when we upgraded to Spark_3.0 and would like to patch the fix on our product. Our ORC source file contains three fields: ___col1 string,_col2 string,_col3 string and reading it fails with the below exception: java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:183) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$2(OrcFileFormat.scala:216) Thanks and Regards, Ramakrishna > Spark sql commands are failing on select Queries for the orc tables > > > Key: SPARK-32234 > URL: https://issues.apache.org/jira/browse/SPARK-32234 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Saurabh Chawla >Assignee: Saurabh Chawla >Priority: Blocker > Fix For: 3.0.1, 3.1.0 > > Attachments: e17f6887c06d47f6a62c0140c1ad569c_00 > > > Spark sql commands are failing on select Queries for the orc tables > Steps to reproduce > > {code:java} > val table = """CREATE TABLE `date_dim` ( > `d_date_sk` INT, > `d_date_id` STRING, > `d_date` TIMESTAMP, > `d_month_seq` INT, > `d_week_seq` INT, > `d_quarter_seq` INT, > `d_year` INT, > `d_dow` INT, > `d_moy` INT, > `d_dom` INT, > `d_qoy` INT, > `d_fy_year` INT, > `d_fy_quarter_seq` INT, > `d_fy_week_seq` INT, > `d_day_name` STRING, > `d_quarter_name` STRING, > `d_holiday` STRING, > `d_weekend` STRING, > `d_following_holiday` STRING, > `d_first_dom` INT, > `d_last_dom` INT, > `d_same_day_ly` INT, > `d_same_day_lq` INT, > `d_current_day` STRING, > `d_current_week` STRING, > `d_current_month` STRING, > `d_current_quarter` STRING, > `d_current_year` STRING) > USING orc > LOCATION '/Users/test/tpcds_scale5data/date_dim' > TBLPROPERTIES ( > 'transient_lastDdlTime' = '1574682806')""" > spark.sql(table).collect > val u = """select date_dim.d_date_id from date_dim limit 5""" > spark.sql(u).collect > {code} > > > Exception > > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 > (TID 2, 192.168.0.103, executor driver): > java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:141) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:203) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:620) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:343) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:895) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:895) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:336) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:133) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:445) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1489) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.s
[jira] [Comment Edited] (SPARK-32234) Spark sql commands are failing on select Queries for the orc tables
[ https://issues.apache.org/jira/browse/SPARK-32234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173924#comment-17173924 ] Ramakrishna Prasad K S edited comment on SPARK-32234 at 8/9/20, 5:14 PM: - [~saurabhc100] has the fix for this bug been verified? We are observing the same issue as reported here when we upgraded to Spark_3.0 and would like to patch the fix on our product. Our ORC source file contains three fields: _col1 string,_col2 string,_col3 string and reading it fails with the below exception: java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:183) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$2(OrcFileFormat.scala:216) Thanks and Regards, Ramakrishna was (Author: ramks): [~saurabhc100] has the fix for this bug been verified? We are observing the same issue as reported here when we upgraded to Spark_3.0 and would like to patch the fix on our product. Our ORC source file contains three fields: ___col1 string,_col2 string,_col3 string and reading it fails with the below exception: java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:183) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$2(OrcFileFormat.scala:216) Thanks and Regards, Ramakrishna > Spark sql commands are failing on select Queries for the orc tables > > > Key: SPARK-32234 > URL: https://issues.apache.org/jira/browse/SPARK-32234 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Saurabh Chawla >Assignee: Saurabh Chawla >Priority: Blocker > Fix For: 3.0.1, 3.1.0 > > Attachments: e17f6887c06d47f6a62c0140c1ad569c_00 > > > Spark sql commands are failing on select Queries for the orc tables > Steps to reproduce > > {code:java} > val table = """CREATE TABLE `date_dim` ( > `d_date_sk` INT, > `d_date_id` STRING, > `d_date` TIMESTAMP, > `d_month_seq` INT, > `d_week_seq` INT, > `d_quarter_seq` INT, > `d_year` INT, > `d_dow` INT, > `d_moy` INT, > `d_dom` INT, > `d_qoy` INT, > `d_fy_year` INT, > `d_fy_quarter_seq` INT, > `d_fy_week_seq` INT, > `d_day_name` STRING, > `d_quarter_name` STRING, > `d_holiday` STRING, > `d_weekend` STRING, > `d_following_holiday` STRING, > `d_first_dom` INT, > `d_last_dom` INT, > `d_same_day_ly` INT, > `d_same_day_lq` INT, > `d_current_day` STRING, > `d_current_week` STRING, > `d_current_month` STRING, > `d_current_quarter` STRING, > `d_current_year` STRING) > USING orc > LOCATION '/Users/test/tpcds_scale5data/date_dim' > TBLPROPERTIES ( > 'transient_lastDdlTime' = '1574682806')""" > spark.sql(table).collect > val u = """select date_dim.d_date_id from date_dim limit 5""" > spark.sql(u).collect > {code} > > > Exception > > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 > (TID 2, 192.168.0.103, executor driver): > java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:141) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:203) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:620) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:343) > at > org.apache.spark.rdd.RDD.$anonfun$mapP
[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna Prasad K S updated SPARK-32558: --- Description: Steps to reproduce the issue: --- Download Spark_3.0 from [https://spark.apache.org/downloads.html] Step 1) Create ORC File by using the default Spark_3.0 Native API from spark shell . {code:java} [linuxuser1@irlrhellinux1 bin]$ ./spark-shell Welcome to Spark version 3.0.0 Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("set spark.sql.orc.impl").show() +-+ | key| value| +-+ |spark.sql.orc.impl|native| +-+ scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table values('col1val1','col2val1')") org.apache.spark.sql.DataFrame = [] scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame.show() +-+ | col1| col2| +-+ |col1val1|col2val1| +-+ scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/ORC_File_Tgt1") {code} Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails to fetch the metadata from the ORC file. {code:java} adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/ORC_File_Tgt1/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc Processing data file /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385) at org.apache.orc.OrcFile.createReader(OrcFile.java:222) at org.apache.orc.tools.FileDump.getReader(FileDump.java:255) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307) at org.apache.orc.tools.FileDump.main(FileDump.java:154) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:313) at org.apache.hadoop.util.RunJar.main(RunJar.java:227) {code} Step 3) Now Create ORC File using the Hive API (as suggested by Spark in [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting spark.sql.orc.impl as hive) {code:java} scala> spark.sql("set spark.sql.orc.impl=hive") res6: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("set spark.sql.orc.impl").show() ++ | key|value| ++ |spark.sql.orc.impl| hive| ++ scala> spark.sql("CREATE table df_table2(col1 string,col2 string)") scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: org.apache.spark.sql.DataFrame = [] scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/ORC_File_Tgt2") {code} Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails with the same exception to fetch the metadata even after following the workaround suggested by spark to set spark.sql.orc.impl to hive {code:java} [adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/ORC_File_Tgt2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc Processing data file /tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385) at org.apache.orc.OrcFile.createReader(OrcFile.java:222) at org.apache.orc.tools.FileDump.getReader(FileDump.java:255) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) at org.apache.
[jira] [Comment Edited] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173779#comment-17173779 ] Ramakrishna Prasad K S edited comment on SPARK-32558 at 8/9/20, 9:00 AM: - Hi [~dongjoon] Thanks for pointing that it is a Hive bug. I am already aware that there is a Hive bug related to this which I have put in the Jira description itself. But according to the latest spark documentation, it was mentioned that if you set spark.sql.orc.impl=hive, it would generate orc files that would work with Hive_2.1.1 or below. That is why I raised this bug because the workaround mentioned in spark documentation was not working for me. [https://spark.apache.org/docs/latest/sql-migration-guide.html] !image-2020-08-09-14-07-00-521.png|width=725,height=91! It is clearly mentioned here that use spark.sql.orc.impl=hive to create files that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the fix for this issue, how is this a valid workaround here??) And regarding the way I am generating the ORC file, I do not agree with your comments. Please look at this closely. First I am creating a normal spark sql table here and loading data to it using insert query. In the second half what I am doing is, reading all the data from the spark sql table and loading it into spark dataframe. Finally I am writing the spark data frame content into a new ORC file: (which works for all the file formats) {code:java} scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table values('col1val1','col2val1')") org.apache.spark.sql.DataFrame = [] scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame.show() - col1 col2 - col1val1 col2val1 - scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") {code} The source and target in this case are two independent entities I believe. Once the data comes to a Dataframe, it can be written into any files like ORC, Avro, Parquet irrespective of where the data has come into data frame. Please note, here I am not trying to create a ORC Spark SQL table and then I am trying to generate a ORC file using the Native Spark_3.0 APIs. scala> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2") /export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2 {color:#FF}this is just a location on my linux machine. It is not where the df_table2 spark sql table is stored. {color} How or from where the data comes to data frame is irrelevant I believe here, I am just loading simple string data as well. Even the below case is valid, data can be generated this way also and written into a ORC file. The issue is observed with the file created in this case as well. {code:java} import org.apache.spark.sql.SparkSession val spark = SparkSession.builder().getOrCreate() import spark.implicits._ case class Contact(name: String, phone: String) case class Person(name: String, age: Int, contacts: Seq[Contact]) val records = (1 to 100).map { i =>; Person(s"name_$i", i, (0 to 1).map \{ m => Contact(s"contact_$m", s"phone_$m") } ) } records.toDF().write.format("orc").save("/tmp/orc_tgt33") {code} We need to at-least fix the Spark3.0 documentation or give detailed explanation there as to what is the purpose of spark.sql.orc.impl=hive flag..!(If this flag cannot generate ORC files that is not compatible with Hive_2.1.1 and below, then what is the usage of the same) We are on Hive_2.1.1 version in our product and we make calls to Hive_2.1.1 ORC APIs. Regarding the tool I am using to retrieve metadata *hive --orcfiledump*, it internally calls Hive_2.1.1 APIs. Which is why I am using the same as with it I am able to replicate the problem. I am following up with our Hadoop vendor to give a back-port of HIVE-16683 to Hive_2.1.1. However, the confusing part in the spark documentation needs to be fixed I believe. Thank you. Ramakrishna was (Author: ramks): Hi [~dongjoon] Thanks for pointing that it is a Hive bug. I am already aware that there is a Hive bug related to this which I have put in the Jira description itself. But according to the latest spark documentation, it was mentioned that if you set spark.sql.orc.impl=hive, it would generate orc files that would work with Hive_2.1.1 or below. That is why I raised this bug because the workaround mentioned in spark documentation was not working for me. [https://spark.apache.org/docs/latest/sql-migration-guide.html] !image-2020-08-09-14-07-00-521.png|width=725,height=91! It is clearly mentioned here that use spark.sql.orc.imp
[jira] [Comment Edited] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173779#comment-17173779 ] Ramakrishna Prasad K S edited comment on SPARK-32558 at 8/9/20, 9:00 AM: - Hi [~dongjoon] Thanks for pointing that it is a Hive bug. I am already aware that there is a Hive bug related to this which I have put in the Jira description itself. But according to the latest spark documentation, it was mentioned that if you set spark.sql.orc.impl=hive, it would generate orc files that would work with Hive_2.1.1 or below. That is why I raised this bug because the workaround mentioned in spark documentation was not working for me. [https://spark.apache.org/docs/latest/sql-migration-guide.html] !image-2020-08-09-14-07-00-521.png|width=725,height=91! It is clearly mentioned here that use spark.sql.orc.impl=hive to create files that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the fix for this issue, how is this a valid workaround here??) And regarding the way I am generating the ORC file, I do not agree with your comments. Please look at this closely. First I am creating a normal spark sql table here and loading data to it using insert query. In the second half what I am doing is, reading all the data from the spark sql table and loading it into spark dataframe. Finally I am writing the spark data frame content into a new ORC file: (which works for all the file formats) {code:java} scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table values('col1val1','col2val1')") org.apache.spark.sql.DataFrame = [] scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame.show() - col1 col2 - col1val1 col2val1 - scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") {code} The source and target in this case are two independent entities I believe. Once the data comes to a Dataframe, it can be written into any files like ORC, Avro, Parquet irrespective of where the data has come into data frame. Please note, here I am not trying to create a ORC Spark SQL table and then I am trying to generate a ORC file using the Native Spark_3.0 APIs. scala> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2") /export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2 {color:#ff}this is just a location on my linux machine. It is not where the df_table2 spark sql table is stored. Will update the description if it is confusing.{color} How or from where the data comes to data frame is irrelevant I believe here, I am just loading simple string data as well. Even the below case is valid, data can be generated this way also and written into a ORC file. The issue is observed with the file created in this case as well. {code:java} import org.apache.spark.sql.SparkSession val spark = SparkSession.builder().getOrCreate() import spark.implicits._ case class Contact(name: String, phone: String) case class Person(name: String, age: Int, contacts: Seq[Contact]) val records = (1 to 100).map { i =>; Person(s"name_$i", i, (0 to 1).map \{ m => Contact(s"contact_$m", s"phone_$m") } ) } records.toDF().write.format("orc").save("/tmp/orc_tgt33") {code} We need to at-least fix the Spark3.0 documentation or give detailed explanation there as to what is the purpose of spark.sql.orc.impl=hive flag..!(If this flag cannot generate ORC files that is not compatible with Hive_2.1.1 and below, then what is the usage of the same) We are on Hive_2.1.1 version in our product and we make calls to Hive_2.1.1 ORC APIs. Regarding the tool I am using to retrieve metadata *hive --orcfiledump*, it internally calls Hive_2.1.1 APIs. Which is why I am using the same as with it I am able to replicate the problem. I am following up with our Hadoop vendor to give a back-port of HIVE-16683 to Hive_2.1.1. However, the confusing part in the spark documentation needs to be fixed I believe. Thank you. Ramakrishna was (Author: ramks): Hi [~dongjoon] Thanks for pointing that it is a Hive bug. I am already aware that there is a Hive bug related to this which I have put in the Jira description itself. But according to the latest spark documentation, it was mentioned that if you set spark.sql.orc.impl=hive, it would generate orc files that would work with Hive_2.1.1 or below. That is why I raised this bug because the workaround mentioned in spark documentation was not working for me. [https://spark.apache.org/docs/latest/sql-migration-guide.html] !image-2020-08-09-14-07-00-521.png|width=725,height=91! It i
[jira] [Comment Edited] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173779#comment-17173779 ] Ramakrishna Prasad K S edited comment on SPARK-32558 at 8/9/20, 8:57 AM: - Hi [~dongjoon] Thanks for pointing that it is a Hive bug. I am already aware that there is a Hive bug related to this which I have put in the Jira description itself. But according to the latest spark documentation, it was mentioned that if you set spark.sql.orc.impl=hive, it would generate orc files that would work with Hive_2.1.1 or below. That is why I raised this bug because the workaround mentioned in spark documentation was not working for me. [https://spark.apache.org/docs/latest/sql-migration-guide.html] !image-2020-08-09-14-07-00-521.png|width=725,height=91! It is clearly mentioned here that use spark.sql.orc.impl=hive to create files that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the fix for this issue, how is this a valid workaround here??) And regarding the way I am generating the ORC file, I do not agree with your comments. Please look at this closely. First I am creating a normal spark sql table here and loading data to it using insert query. In the second half what I am doing is, reading all the data from the spark sql table and loading it into spark dataframe. Finally I am writing the spark data frame content into a new ORC file: (which works for all the file formats) {code:java} scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table values('col1val1','col2val1')") org.apache.spark.sql.DataFrame = [] scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame.show() - col1 col2 - col1val1 col2val1 - scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") {code} The source and target in this case are two independent entities I believe. Once the data comes to a Dataframe, it can be written into any files like ORC, Avro, Parquet irrespective of where the data has come into data frame. Please note, here I am not trying to create a ORC Spark SQL table and then I am trying to generate a ORC file using the Native Spark_3.0 APIs. How or from where the data comes to data frame is irrelevant I believe here, I am just loading simple string data as well. Even the below case is valid, data can be generated this way also and written into a ORC file. The issue is observed with the file created in this case as well. {code:java} import org.apache.spark.sql.SparkSession val spark = SparkSession.builder().getOrCreate() import spark.implicits._ case class Contact(name: String, phone: String) case class Person(name: String, age: Int, contacts: Seq[Contact]) val records = (1 to 100).map { i =>; Person(s"name_$i", i, (0 to 1).map \{ m => Contact(s"contact_$m", s"phone_$m") } ) } records.toDF().write.format("orc").save("/tmp/orc_tgt33") {code} We need to at-least fix the Spark3.0 documentation or give detailed explanation there as to what is the purpose of spark.sql.orc.impl=hive flag..!(If this flag cannot generate ORC files that is not compatible with Hive_2.1.1 and below, then what is the usage of the same) We are on Hive_2.1.1 version in our product and we make calls to Hive_2.1.1 ORC APIs. Regarding the tool I am using to retrieve metadata *hive --orcfiledump*, it internally calls Hive_2.1.1 APIs. Which is why I am using the same as with it I am able to replicate the problem. I am following up with our Hadoop vendor to give a back-port of HIVE-16683 to Hive_2.1.1. However, the confusing part in the spark documentation needs to be fixed I believe. Thank you. Ramakrishna was (Author: ramks): Hi [~dongjoon] Thanks for pointing that it is a Hive bug. I am already aware that there is a Hive bug related to this which I have put in the Jira description itself. But according to the latest spark documentation, it was mentioned that if you set spark.sql.orc.impl=hive, it would generate orc files that would work with Hive_2.1.1 or below. That is why I raised this bug because the workaround mentioned in spark documentation was not working for me. [https://spark.apache.org/docs/latest/sql-migration-guide.html] !image-2020-08-09-14-07-00-521.png|width=725,height=91! It is clearly mentioned here that use spark.sql.orc.impl=hive to create files that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the fix for this issue, how is this a valid workaround here??) And regarding the way I am generating the ORC file, I do not agree with your comments. Please look at this closely. First I am creating a
[jira] [Comment Edited] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173779#comment-17173779 ] Ramakrishna Prasad K S edited comment on SPARK-32558 at 8/9/20, 8:57 AM: - Hi [~dongjoon] Thanks for pointing that it is a Hive bug. I am already aware that there is a Hive bug related to this which I have put in the Jira description itself. But according to the latest spark documentation, it was mentioned that if you set spark.sql.orc.impl=hive, it would generate orc files that would work with Hive_2.1.1 or below. That is why I raised this bug because the workaround mentioned in spark documentation was not working for me. [https://spark.apache.org/docs/latest/sql-migration-guide.html] !image-2020-08-09-14-07-00-521.png|width=725,height=91! It is clearly mentioned here that use spark.sql.orc.impl=hive to create files that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the fix for this issue, how is this a valid workaround here??) And regarding the way I am generating the ORC file, I do not agree with your comments. Please look at this closely. First I am creating a normal spark sql table here and loading data to it using insert query. In the second half what I am doing is, reading all the data from the spark sql table and loading it into spark dataframe. Finally I am writing the spark data frame content into a new ORC file: (which works for all the file formats) {code:java} scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table values('col1val1','col2val1')") org.apache.spark.sql.DataFrame = [] scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame.show() - col1 col2 - col1val1 col2val1 - scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") {code} The source and target in this case are two independent entities I believe. Once the data comes to a Dataframe, it can be written into any files like ORC, Avro, Parquet irrespective of where the data has come into data frame. Please note, here I am not trying to create a ORC Spark SQL table and then I am trying to generate a ORC file using the Native Spark_3.0 APIs. How or from where the data comes to data frame is irrelevant I believe here, I am just loading simple string data as well. Even the below case is valid, data can be generated this way also and written into a ORC file. The issue is observed with the file created in this case as well. {code:java} import org.apache.spark.sql.SparkSession val spark = SparkSession.builder().getOrCreate() import spark.implicits._ case class Contact(name: String, phone: String) case class Person(name: String, age: Int, contacts: Seq[Contact]) val records = (1 to 100).map { i =>; Person(s"name_$i", i, (0 to 1).map \{ m => Contact(s"contact_$m", s"phone_$m") } ) } records.toDF().write.format("orc").save("/tmp/orc_tgt33") {code} We need to at-least fix the Spark3.0 documentation or give detailed explanation there as to what is the purpose of spark.sql.orc.impl=hive flag..!(If this flag cannot generate ORC files that is not compatible with Hive_2.1.1 and below, then what is the usage of the same) We are on Hive_2.1.1 version in our product and we make calls to Hive_2.1.1 ORC APIs. Regarding the tool I am using to retrieve metadata *hive --orcfiledump*, it internally calls Hive_2.1.1 APIs. Which is why I am using the same as with it I am able to replicate the problem. I am following up with our Hadoop vendor to give a back-port of HIVE-16683 to Hive_2.1.1. However, the confusing part in the spark documentation needs to be fixed I believe. Thank you. Ramakrishna was (Author: ramks): Hi [~dongjoon] Thanks for pointing that it is a Hive bug. I am already aware that there is a Hive bug related to this which I have put in the Jira description itself. But according to the latest spark documentation, it was mentioned that if you set spark.sql.orc.impl=hive, it would generate orc files that would work with Hive_2.1.1 or below. That is why I raised this bug because the workaround mentioned in spark documentation was not working for me. [https://spark.apache.org/docs/latest/sql-migration-guide.html] !image-2020-08-09-14-07-00-521.png|width=725,height=91! It is clearly mentioned here that use spark.sql.orc.impl=hive to create files that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the fix for this issue, how is this a valid workaround here??) And regarding the way I am generating the ORC file, I do not agree with your comments. Please look at this closely. First I am creating
[jira] [Comment Edited] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173779#comment-17173779 ] Ramakrishna Prasad K S edited comment on SPARK-32558 at 8/9/20, 8:56 AM: - Hi [~dongjoon] Thanks for pointing that it is a Hive bug. I am already aware that there is a Hive bug related to this which I have put in the Jira description itself. But according to the latest spark documentation, it was mentioned that if you set spark.sql.orc.impl=hive, it would generate orc files that would work with Hive_2.1.1 or below. That is why I raised this bug because the workaround mentioned in spark documentation was not working for me. [https://spark.apache.org/docs/latest/sql-migration-guide.html] !image-2020-08-09-14-07-00-521.png|width=725,height=91! It is clearly mentioned here that use spark.sql.orc.impl=hive to create files that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the fix for this issue, how is this a valid workaround here??) And regarding the way I am generating the ORC file, I do not agree with your comments. Please look at this closely. First I am creating a normal spark sql table here and loading data to it using insert query. In the second half what I am doing is, reading all the data from the spark sql table and loading it into spark dataframe. Finally I am writing the spark data frame content into a new ORC file: (which works for all the file formats) {code:java} scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table values('col1val1','col2val1')") org.apache.spark.sql.DataFrame = [] scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame.show() - col1 col2 - col1val1 col2val1 - scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") {code} The source and target in this case are two independent entities I believe. Once the data comes to a Dataframe, it can be written into any files like ORC, Avro, Parquet irrespective of where the data has come into data frame. Please note, here I am not trying to create a ORC Spark SQL table. I am trying to generate a ORC file using the Native Spark_3.0 APIs. How or from where the data comes to data frame is irrelevant here, I am just loading simple string data. Even the below case is valid, data can be generated this way also and written into a ORC file. The issue is observed with the file created in this case as well. {code:java} import org.apache.spark.sql.SparkSession val spark = SparkSession.builder().getOrCreate() import spark.implicits._ case class Contact(name: String, phone: String) case class Person(name: String, age: Int, contacts: Seq[Contact]) val records = (1 to 100).map { i =>; Person(s"name_$i", i, (0 to 1).map \{ m => Contact(s"contact_$m", s"phone_$m") } ) } records.toDF().write.format("orc").save("/tmp/orc_tgt33") {code} We need to at-least fix the Spark3.0 documentation or give detailed explanation there as to what is the purpose of spark.sql.orc.impl=hive flag..!(If this flag cannot generate ORC files that is not compatible with Hive_2.1.1 and below, then what is the usage of the same) We are on Hive_2.1.1 version in our product and we make calls to Hive_2.1.1 ORC APIs. Regarding the tool I am using to retrieve metadata *hive --orcfiledump*, it internally calls Hive_2.1.1 APIs. Which is why I am using the same as with it I am able to replicate the problem. I am following up with our Hadoop vendor to give a back-port of HIVE-16683 to Hive_2.1.1. However, the confusing part in the spark documentation needs to be fixed I believe. Thank you. Ramakrishna was (Author: ramks): Hi [~dongjoon] Thanks for pointing that it is a Hive bug. I am already aware that there is a Hive bug related to this which I have put in the Jira description itself. But according to the latest spark documentation, it was mentioned that if you set spark.sql.orc.impl=hive, it would generate orc files that would work with Hive_2.1.1 or below. That is why I raised this bug because the workaround mentioned in spark documentation was not working for me. [https://spark.apache.org/docs/latest/sql-migration-guide.html] !image-2020-08-09-14-07-00-521.png|width=725,height=91! It is clearly mentioned here that use spark.sql.orc.impl=hive to create files that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the fix for this issue, how is this a valid workaround here??) And regarding the way I am generating the ORC file, I do not agree with your comments. Please look at this closely. First I am creating a normal spark sql table
[jira] [Comment Edited] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173779#comment-17173779 ] Ramakrishna Prasad K S edited comment on SPARK-32558 at 8/9/20, 8:52 AM: - Hi [~dongjoon] Thanks for pointing that it is a Hive bug. I am already aware that there is a Hive bug related to this which I have put in the Jira description itself. But according to the latest spark documentation, it was mentioned that if you set spark.sql.orc.impl=hive, it would generate orc files that would work with Hive_2.1.1 or below. That is why I raised this bug because the workaround mentioned in spark documentation was not working for me. [https://spark.apache.org/docs/latest/sql-migration-guide.html] !image-2020-08-09-14-07-00-521.png|width=725,height=91! It is clearly mentioned here that use spark.sql.orc.impl=hive to create files that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the fix for this issue, how is this a valid workaround here??) And regarding the way I am generating the ORC file, I do not agree with your comments. Please look at this closely. First I am creating a normal spark sql table here and loading data to it using insert query. In the second half what I am doing is, reading all the data from the spark sql table and loading it into spark dataframe. Finally I am writing the spark data frame content into the ORC file: (which works for ant file formats) {code:java} scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table values('col1val1','col2val1')") org.apache.spark.sql.DataFrame = [] scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame.show() - col1 col2 - col1val1 col2val1 - scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") {code} The source and target in this case are two independent entities I believe. Once the data comes to a Dataframe, it can be written into any files like ORC, Avro, Parquet irrespective of where the data has come into data frame. Please note, here I am not trying to create a ORC Spark SQL table. I am trying to generate a ORC file using the Native Spark_3.0 APIs. How or from where the data comes to data frame is irrelevant here, I am just loading simple string data. Even the below case is valid, data can be generated this way also and written into a ORC file. The issue is observed with the file created in this case as well. {code:java} import org.apache.spark.sql.SparkSession val spark = SparkSession.builder().getOrCreate() import spark.implicits._ case class Contact(name: String, phone: String) case class Person(name: String, age: Int, contacts: Seq[Contact]) val records = (1 to 100).map { i =>; Person(s"name_$i", i, (0 to 1).map \{ m => Contact(s"contact_$m", s"phone_$m") } ) } records.toDF().write.format("orc").save("/tmp/orc_tgt33") {code} We need to at-least fix the Spark3.0 documentation or give detailed explanation there as to what is the purpose of spark.sql.orc.impl=hive flag..!(If this flag cannot generate ORC files that is not compatible with Hive_2.1.1 and below, then what is the usage of the same) We are on Hive_2.1.1 version in our product and we make calls to Hive_2.1.1 ORC APIs. Regarding the tool I am using to retrieve metadata *hive --orcfiledump*, it internally calls Hive_2.1.1 APIs. Which is why I am using the same as with it I am able to replicate the problem. I am following up with our Hadoop vendor to give a back-port of HIVE-16683 to Hive_2.1.1. However, the confusing part in the spark documentation needs to be fixed I believe. Thank you. Ramakrishna was (Author: ramks): Hi [~dongjoon] Thanks for pointing that it is a Hive bug. I am already aware that there is a Hive bug related to this which I have put in the Jira description itself. But according to the latest spark documentation, it was mentioned that if you set spark.sql.orc.impl=hive, it would generate orc files that would work with Hive_2.1.1 or below. That is why I raised this bug because the workaround mentioned in spark documentation was not working for me. [https://spark.apache.org/docs/latest/sql-migration-guide.html] !image-2020-08-09-14-07-00-521.png|width=725,height=91! It is clearly mentioned here that use spark.sql.orc.impl=hive to create files that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the fix for this issue, how is this a valid workaround here??) And regarding the way I am generating the ORC file, I do not agree with your comments. Please look at this closely. First I am creating a normal spark sql table here a
[jira] [Comment Edited] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173779#comment-17173779 ] Ramakrishna Prasad K S edited comment on SPARK-32558 at 8/9/20, 8:51 AM: - Hi [~dongjoon] Thanks for pointing that it is a Hive bug. I am already aware that there is a Hive bug related to this which I have put in the Jira description itself. But according to the latest spark documentation, it was mentioned that if you set spark.sql.orc.impl=hive, it would generate orc files that would work with Hive_2.1.1 or below. That is why I raised this bug because the workaround mentioned in spark documentation was not working for me. [https://spark.apache.org/docs/latest/sql-migration-guide.html] !image-2020-08-09-14-07-00-521.png|width=725,height=91! It is clearly mentioned here that use spark.sql.orc.impl=hive to create files that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the fix for this issue, how is this a valid workaround here??) And regarding the way I am generating the ORC file, I do not agree with your comments. Please look at this closely. First I am creating a normal spark sql table here and loading data to it using insert query. In the second half what I am doing is, reading all the data from the spark sql table and loading it into spark dataframe. Finally I am writing the spark data frame content into the ORC file: (which works for ant file formats) {code:java} scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table values('col1val1','col2val1')") org.apache.spark.sql.DataFrame = [] scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame.show() - col1 col2 - col1val1 col2val1 - scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") {code} The source and target in this case are two independent entities I believe. Once the data comes to a Dataframe, it can be written into any files like ORC, Avro, Parquet irrespective of where the data has come into data frame. Please note, here I am not trying to create a ORC Spark SQL table. I am trying to generate a ORC file using the Native Spark_3.0 APIs. How or from where the data comes to data frame is irrelevant here, I am just loading simple string data. Even the below case is valid, data can be generated this way also and written into a ORC file. The issue is observed with the file created in this case as well. {code:java} import org.apache.spark.sql.SparkSession val spark = SparkSession.builder().getOrCreate() import spark.implicits._ case class Contact(name: String, phone: String) case class Person(name: String, age: Int, contacts: Seq[Contact]) val records = (1 to 100).map { i =>; Person(s"name_$i", i, (0 to 1).map \{ m => Contact(s"contact_$m", s"phone_$m") } ) } records.toDF().write.format("orc").save("/tmp/orc_tgt33") {code} We need to at-least fix the Spark3.0 documentation or give detailed explanation there as to what is the purpose of spark.sql.orc.impl=hive flag..!(If this flag cannot generate ORC files that will not be compatible with Hive_2.1.1 below, then what is the usage of the same) We are on Hive_2.1.1 version in our product and we make calls to Hive_2.1.1 ORC APIs. Regarding the tool I am using to retrieve metadata hive --orcfiledump, it internally calls Hive_2.1.1 APIs. Which is why I am using the same as with it I am able to replicate the problem. I am following up with our Hadoop vendor to give a back-port of HIVE-16683 to Hive_2.1.1. However, the confusing part in the spark documentation needs to be fixed I believe. Thank you. Ramakrishna was (Author: ramks): Hi [~dongjoon] Thanks for pointing that it is a Hive bug. I am already aware that there is a Hive bug related to this which I have put in the Jira description itself. But according to the latest spark documentation, it was mentioned that if you set spark.sql.orc.impl=hive, it would generate orc files that would work with Hive_2.1.1 or below. That is why I raised this bug because the workaround mentioned in spark documentation was not working for me. [https://spark.apache.org/docs/latest/sql-migration-guide.html] !image-2020-08-09-14-07-00-521.png|width=725,height=91! It is clearly mentioned here that use spark.sql.orc.impl=hive to create files that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the fix for this issue, how is this a valid workaround here??) And regarding the way I am generating the ORC file, I do not agree with your comments. Please look at this closely. First I am creating a normal spark sql table here
[jira] [Commented] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173779#comment-17173779 ] Ramakrishna Prasad K S commented on SPARK-32558: Hi [~dongjoon] Thanks for pointing that it is a Hive bug. I am already aware that there is a Hive bug related to this which I have put in the Jira description itself. But according to the latest spark documentation, it was mentioned that if you set spark.sql.orc.impl=hive, it would generate orc files that would work with Hive_2.1.1 or below. That is why I raised this bug because the workaround mentioned in spark documentation was not working for me. [https://spark.apache.org/docs/latest/sql-migration-guide.html] !image-2020-08-09-14-07-00-521.png|width=725,height=91! It is clearly mentioned here that use spark.sql.orc.impl=hive to create files that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the fix for this issue, how is this a valid workaround here??) And regarding the way I am generating the ORC file, I do not agree with your comments. Please look at this closely. First I am creating a normal spark sql table here and loading data to it using insert query. In the second half what I am doing is, reading all the data from the spark sql table and loading it into spark dataframe. Finally I am writing the spark data frame content into the ORC file: (which works for ant file formats) scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table values('col1val1','col2val1')") org.apache.spark.sql.DataFrame = [] scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame.show() +-+ | col1| col2| +-+ |col1val1|col2val1| +-+ scala> dFrame.toDF().write.format(*"orc"*).save(*"/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table"*) The source and target in this case are two independent entities I believe. Once the data comes to a Dataframe, it can be written into any files like ORC, Avro, Parquet irrespective of where the data has come into data frame. Please note, here I am not trying to create a ORC Spark SQL table. I am trying to generate a ORC file using the Native Spark_3.0 APIs. How or from where the data comes to data frame is irrelevant here, I am just loading simple string data. Even the below case is valid, data can be generated this way also and written into a ORC file. The issue is observed with the file created in this case as well. import org.apache.spark.sql.SparkSession val spark = SparkSession.builder().getOrCreate() import spark.implicits._ case class Contact(name: String, phone: String) case class Person(name: String, age: Int, contacts: Seq[Contact]) val records = (1 to 100).map { i =>; Person(s"name_$i", i, (0 to 1).map \{ m => Contact(s"contact_$m", s"phone_$m") }) } records.toDF().write.format("orc").save("/tmp/orc_tgt33") We need to at-least fix the Spark3.0 documentation or give detailed explanation there as to what is the purpose of spark.sql.orc.impl=hive flag..!(If this flag cannot generate ORC files that will not be compatible with Hive_2.1.1 below, then what is the usage of the same) We are on Hive_2.1.1 version in our product and we make calls to Hive_2.1.1 ORC APIs. Regarding the tool I am using to retrieve metadata hive --orcfiledump, it internally calls Hive_2.1.1 APIs. Which is why I am using the same as with it I am able to replicate the problem. I am following up with our Hadoop vendor to give a back-port of HIVE-16683 to Hive_2.1.1. However, the confusing part in the spark documentation needs to be fixed I believe. Thank you. Ramakrishna > ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 > (work-around of using spark.sql.orc.impl=hive is also not working) > - > > Key: SPARK-32558 > URL: https://issues.apache.org/jira/browse/SPARK-32558 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. > (Linux Redhat) >Reporter: Ramakrishna Prasad K S >Priority: Major > Attachments: image-2020-08-09-14-07-00-521.png > > > Steps to reproduce the issue: > --- > Download Spark_3.0 from [https://spark.apache.org/downloads.html] > > Step 1) Create ORC File by using the default Spark_3.0 Native API from spark > shell . > {code} > [linuxuser1@irlrhellinux1 bin]$ ./spark-shell > Welcome to Spark version 3.0.0 > Using Scala version 2.1
[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna Prasad K S updated SPARK-32558: --- Attachment: image-2020-08-09-14-07-00-521.png > ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 > (work-around of using spark.sql.orc.impl=hive is also not working) > - > > Key: SPARK-32558 > URL: https://issues.apache.org/jira/browse/SPARK-32558 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. > (Linux Redhat) >Reporter: Ramakrishna Prasad K S >Priority: Major > Attachments: image-2020-08-09-14-07-00-521.png > > > Steps to reproduce the issue: > --- > Download Spark_3.0 from [https://spark.apache.org/downloads.html] > > Step 1) Create ORC File by using the default Spark_3.0 Native API from spark > shell . > {code} > [linuxuser1@irlrhellinux1 bin]$ ./spark-shell > Welcome to Spark version 3.0.0 > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191) > Type in expressions to have them evaluated. Type :help for more information. > scala> spark.sql("set spark.sql.orc.impl").show() > +-+ > | key| value| > +-+ > |spark.sql.orc.impl|native| > +-+ > > scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: > org.apache.spark.sql.DataFrame = [] > scala> spark.sql("insert into df_table values('col1val1','col2val1')") > org.apache.spark.sql.DataFrame = [] > scala> val dFrame = spark.sql("select * from df_table") dFrame: > org.apache.spark.sql.DataFrame = [col1: string, col2: string] > scala> dFrame.show() > +-+ > | col1| col2| > +-+ > |col1val1|col2val1| > +-+ > scala> > dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") > {code} > > Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop > cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following > command to analyze or read metadata from the ORC files. As you see below, it > fails to fetch the metadata from the ORC file. > {code} > adpqa@irlhadoop1 bug]$ hive --orcfiledump > /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc > Processing data file > /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc > [length: 414] > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 > at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) > at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) > at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385) > at org.apache.orc.OrcFile.createReader(OrcFile.java:222) > at org.apache.orc.tools.FileDump.getReader(FileDump.java:255) > at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) > at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307) > at org.apache.orc.tools.FileDump.main(FileDump.java:154) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:313) > at org.apache.hadoop.util.RunJar.main(RunJar.java:227) > {code} > Step 3) Now Create ORC File using the Hive API (as suggested by Spark in > [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting > spark.sql.orc.impl as hive) > {code} > scala> spark.sql("set spark.sql.orc.impl=hive") > res6: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> spark.sql("set spark.sql.orc.impl").show() > ++ > | key|value| > ++ > |spark.sql.orc.impl| hive| > ++ > scala> spark.sql("CREATE table df_table2(col1 string,col2 string)") > scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: > org.apache.spark.sql.DataFrame = [] > scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: > org.apache.spark.sql.DataFrame = [col1: string, col2: string] > scala> > dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2") > {code} > Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop > cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following > command to analyze or read metadata
[jira] [Commented] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17172805#comment-17172805 ] Ramakrishna Prasad K S commented on SPARK-32558: [~rohitmishr1484] [~hyukjin.kwon] Thanks for letting me know about setting the Priority and target version. Sorry for the same. Can someone help me with this issue? This is critical and the workaround mentioned in [https://spark.apache.org/docs/latest/sql-migration-guide.html] is also not working. > ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 > (work-around of using spark.sql.orc.impl=hive is also not working) > - > > Key: SPARK-32558 > URL: https://issues.apache.org/jira/browse/SPARK-32558 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. > (Linux Redhat) >Reporter: Ramakrishna Prasad K S >Priority: Major > > Steps to reproduce the issue: > --- > Download Spark_3.0 from [https://spark.apache.org/downloads.html] > > Step 1) Create ORC File by using the default Spark_3.0 Native API from spark > shell . > {code} > [linuxuser1@irlrhellinux1 bin]$ ./spark-shell > Welcome to Spark version 3.0.0 > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191) > Type in expressions to have them evaluated. Type :help for more information. > scala> spark.sql("set spark.sql.orc.impl").show() > +-+ > | key| value| > +-+ > |spark.sql.orc.impl|native| > +-+ > > scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: > org.apache.spark.sql.DataFrame = [] > scala> spark.sql("insert into df_table values('col1val1','col2val1')") > org.apache.spark.sql.DataFrame = [] > scala> val dFrame = spark.sql("select * from df_table") dFrame: > org.apache.spark.sql.DataFrame = [col1: string, col2: string] > scala> dFrame.show() > +-+ > | col1| col2| > +-+ > |col1val1|col2val1| > +-+ > scala> > dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") > {code} > > Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop > cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following > command to analyze or read metadata from the ORC files. As you see below, it > fails to fetch the metadata from the ORC file. > {code} > adpqa@irlhadoop1 bug]$ hive --orcfiledump > /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc > Processing data file > /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc > [length: 414] > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 > at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) > at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) > at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385) > at org.apache.orc.OrcFile.createReader(OrcFile.java:222) > at org.apache.orc.tools.FileDump.getReader(FileDump.java:255) > at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) > at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307) > at org.apache.orc.tools.FileDump.main(FileDump.java:154) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:313) > at org.apache.hadoop.util.RunJar.main(RunJar.java:227) > {code} > Step 3) Now Create ORC File using the Hive API (as suggested by Spark in > [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting > spark.sql.orc.impl as hive) > {code} > scala> spark.sql("set spark.sql.orc.impl=hive") > res6: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> spark.sql("set spark.sql.orc.impl").show() > ++ > | key|value| > ++ > |spark.sql.orc.impl| hive| > ++ > scala> spark.sql("CREATE table df_table2(col1 string,col2 string)") > scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: > org.apache.spark.sql.DataFrame = [] > scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: > org.apache.spark.sql.DataFrame = [col1: string, col2: string] > scala> > dFrame2.toDF().write.format("orc").save("/export/home
[jira] [Comment Edited] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17172805#comment-17172805 ] Ramakrishna Prasad K S edited comment on SPARK-32558 at 8/7/20, 4:07 AM: - [~rohitmishr1484] [~hyukjin.kwon] Thanks for letting me know about setting the priority and target version. Sorry about the same. Can someone help me with this issue? This is critical and the workaround mentioned in [https://spark.apache.org/docs/latest/sql-migration-guide.html] is also not working. was (Author: ramks): [~rohitmishr1484] [~hyukjin.kwon] Thanks for letting me know about setting the Priority and target version. Sorry for the same. Can someone help me with this issue? This is critical and the workaround mentioned in [https://spark.apache.org/docs/latest/sql-migration-guide.html] is also not working. > ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 > (work-around of using spark.sql.orc.impl=hive is also not working) > - > > Key: SPARK-32558 > URL: https://issues.apache.org/jira/browse/SPARK-32558 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. > (Linux Redhat) >Reporter: Ramakrishna Prasad K S >Priority: Major > > Steps to reproduce the issue: > --- > Download Spark_3.0 from [https://spark.apache.org/downloads.html] > > Step 1) Create ORC File by using the default Spark_3.0 Native API from spark > shell . > {code} > [linuxuser1@irlrhellinux1 bin]$ ./spark-shell > Welcome to Spark version 3.0.0 > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191) > Type in expressions to have them evaluated. Type :help for more information. > scala> spark.sql("set spark.sql.orc.impl").show() > +-+ > | key| value| > +-+ > |spark.sql.orc.impl|native| > +-+ > > scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: > org.apache.spark.sql.DataFrame = [] > scala> spark.sql("insert into df_table values('col1val1','col2val1')") > org.apache.spark.sql.DataFrame = [] > scala> val dFrame = spark.sql("select * from df_table") dFrame: > org.apache.spark.sql.DataFrame = [col1: string, col2: string] > scala> dFrame.show() > +-+ > | col1| col2| > +-+ > |col1val1|col2val1| > +-+ > scala> > dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") > {code} > > Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop > cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following > command to analyze or read metadata from the ORC files. As you see below, it > fails to fetch the metadata from the ORC file. > {code} > adpqa@irlhadoop1 bug]$ hive --orcfiledump > /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc > Processing data file > /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc > [length: 414] > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 > at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) > at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) > at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385) > at org.apache.orc.OrcFile.createReader(OrcFile.java:222) > at org.apache.orc.tools.FileDump.getReader(FileDump.java:255) > at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) > at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307) > at org.apache.orc.tools.FileDump.main(FileDump.java:154) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:313) > at org.apache.hadoop.util.RunJar.main(RunJar.java:227) > {code} > Step 3) Now Create ORC File using the Hive API (as suggested by Spark in > [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting > spark.sql.orc.impl as hive) > {code} > scala> spark.sql("set spark.sql.orc.impl=hive") > res6: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> spark.sql("set spark.sql.orc.impl").show() > ++ > | key|value| > ++ > |spark.sql.orc.impl| hive| > ++ > scala> s
[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna Prasad K S updated SPARK-32558: --- Description: Steps to reproduce the issue: --- Download Spark_3.0 from [https://spark.apache.org/downloads.html] Step 1) Create ORC File by using the default Spark_3.0 Native API from the spark shell . [linuxuser1@irlrhellinux1 bin]$ ./spark-shell Welcome to Spark version 3.0.0 Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("set spark.sql.orc.impl").show() -+ | key| value| + |spark.sql.orc.impl|*native*| -+ scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table values('col1val1','col2val1')") org.apache.spark.sql.DataFrame = [] scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame.show() + | col1| col2| + |col1val1|col2val1| -+ scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails to fetch the metadata from the ORC file. adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc Processing data file /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385) at org.apache.orc.OrcFile.createReader(OrcFile.java:222) at org.apache.orc.tools.FileDump.getReader(FileDump.java:255) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307) at org.apache.orc.tools.FileDump.main(FileDump.java:154) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:313) at org.apache.hadoop.util.RunJar.main(RunJar.java:227) Step 3) Now Create ORC File using the Hive API (as suggested by Spark in [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting spark.sql.orc.impl as hive) scala> spark.sql("set spark.sql.orc.impl=hive") res6: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("set spark.sql.orc.impl").show() --+ | key|value| +--- |spark.sql.orc.impl| *hive*| +-- scala> spark.sql("CREATE table df_table2(col1 string,col2 string)") 20/08/04 22:43:26 WARN HiveMetaStore: Location: [file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2] specified for non-external table:df_table2 res5: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: org.apache.spark.sql.DataFrame = [] scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2") Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails with the same exception to fetch the metadata even after following the workaround suggested by spark to set spark.sql.orc.impl to hive [adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc Processing data file /tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) at org.apache.
[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna Prasad K S updated SPARK-32558: --- Description: Steps to reproduce the issue: --- Download Spark_3.0 from [https://spark.apache.org/downloads.html] Step 1) Create ORC File by using the default Spark_3.0 Native API from spark shell . [linuxuser1@irlrhellinux1 bin]$ ./spark-shell Welcome to Spark version 3.0.0 Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("set spark.sql.orc.impl").show() -+ | key| value| + |spark.sql.orc.impl|*native*| -+ scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table values('col1val1','col2val1')") org.apache.spark.sql.DataFrame = [] scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame.show() + | col1| col2| + |col1val1|col2val1| -+ scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails to fetch the metadata from the ORC file. adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc Processing data file /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385) at org.apache.orc.OrcFile.createReader(OrcFile.java:222) at org.apache.orc.tools.FileDump.getReader(FileDump.java:255) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307) at org.apache.orc.tools.FileDump.main(FileDump.java:154) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:313) at org.apache.hadoop.util.RunJar.main(RunJar.java:227) Step 3) Now Create ORC File using the Hive API (as suggested by Spark in [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting spark.sql.orc.impl as hive) scala> spark.sql("set spark.sql.orc.impl=hive") res6: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("set spark.sql.orc.impl").show() --+ | key|value| +--- |spark.sql.orc.impl| *hive*| +-- scala> spark.sql("CREATE table df_table2(col1 string,col2 string)") 20/08/04 22:43:26 WARN HiveMetaStore: Location: [file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2] specified for non-external table:df_table2 res5: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: org.apache.spark.sql.DataFrame = [] scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2") Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails with the same exception to fetch the metadata even after following the workaround suggested by spark to set spark.sql.orc.impl to hive [adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc Processing data file /tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) at org.apache.orc.
[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna Prasad K S updated SPARK-32558: --- Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. (was: Spark 3.0 on Linux and Hadoop cluster having Hive_2.1.1 version.) > ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 > (work-around of using spark.sql.orc.impl=hive is also not working) > - > > Key: SPARK-32558 > URL: https://issues.apache.org/jira/browse/SPARK-32558 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. >Reporter: Ramakrishna Prasad K S >Priority: Blocker > Fix For: 3.0.0 > > > Steps to reproduce the issue: > --- > > Download Spark_3.0 from [https://spark.apache.org/downloads.html] > > Step 1) Create ORC File by using the default Spark_3.0 Native API from the > spark shell . > > [linuxuser1@irlrhellinux1 bin]$ ./spark-shell > Welcome to Spark version 3.0.0 > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191) > Type in expressions to have them evaluated. Type :help for more information. > scala> spark.sql("set spark.sql.orc.impl").show() > -+ | key| value| + > |spark.sql.orc.impl|*native*| -+ > > scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: > org.apache.spark.sql.DataFrame = [] > > scala> spark.sql("insert into df_table values('col1val1','col2val1')") > org.apache.spark.sql.DataFrame = [] > > scala> val dFrame = spark.sql("select * from df_table") dFrame: > org.apache.spark.sql.DataFrame = [col1: string, col2: string] > > scala> dFrame.show() > + | col1| col2| + |col1val1|col2val1| -+ > > scala> > dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") > > Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop > cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following > command to analyze or read metadata from the ORC files. As you see below, it > fails to fetch the metadata from the ORC file. > > adpqa@irlhadoop1 bug]$ hive --orcfiledump > /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc > Processing data file > /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc > [length: 414] > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 > at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) > at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) > at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385) > at org.apache.orc.OrcFile.createReader(OrcFile.java:222) > at org.apache.orc.tools.FileDump.getReader(FileDump.java:255) > at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) > at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307) > at org.apache.orc.tools.FileDump.main(FileDump.java:154) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:313) > at org.apache.hadoop.util.RunJar.main(RunJar.java:227) > > Step 3) Now Create ORC File using the Hive API (as suggested by Spark in > [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting > spark.sql.orc.impl as hive) > > scala> spark.sql("set spark.sql.orc.impl=hive") > res6: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> spark.sql("set spark.sql.orc.impl").show() > --+ | key|value| +--- > |spark.sql.orc.impl| *hive*| +-- > > scala> spark.sql("CREATE table df_table2(col1 string,col2 string)") > 20/08/04 22:43:26 WARN HiveMetaStore: Location: > [file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2] > specified for non-external table:df_table2 res5: > org.apache.spark.sql.DataFrame = [] > > scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: > org.apache.spark.sql.DataFrame = [] > > scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: > org.apache.spark.sql.DataFrame = [col1: string, col2: string] > > scala> > d
[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna Prasad K S updated SPARK-32558: --- Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. (Linux Redhat) (was: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version.) > ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 > (work-around of using spark.sql.orc.impl=hive is also not working) > - > > Key: SPARK-32558 > URL: https://issues.apache.org/jira/browse/SPARK-32558 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. > (Linux Redhat) >Reporter: Ramakrishna Prasad K S >Priority: Blocker > Fix For: 3.0.0 > > > Steps to reproduce the issue: > --- > > Download Spark_3.0 from [https://spark.apache.org/downloads.html] > > Step 1) Create ORC File by using the default Spark_3.0 Native API from the > spark shell . > > [linuxuser1@irlrhellinux1 bin]$ ./spark-shell > Welcome to Spark version 3.0.0 > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191) > Type in expressions to have them evaluated. Type :help for more information. > scala> spark.sql("set spark.sql.orc.impl").show() > -+ | key| value| + > |spark.sql.orc.impl|*native*| -+ > > scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: > org.apache.spark.sql.DataFrame = [] > > scala> spark.sql("insert into df_table values('col1val1','col2val1')") > org.apache.spark.sql.DataFrame = [] > > scala> val dFrame = spark.sql("select * from df_table") dFrame: > org.apache.spark.sql.DataFrame = [col1: string, col2: string] > > scala> dFrame.show() > + | col1| col2| + |col1val1|col2val1| -+ > > scala> > dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") > > Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop > cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following > command to analyze or read metadata from the ORC files. As you see below, it > fails to fetch the metadata from the ORC file. > > adpqa@irlhadoop1 bug]$ hive --orcfiledump > /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc > Processing data file > /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc > [length: 414] > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 > at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) > at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) > at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385) > at org.apache.orc.OrcFile.createReader(OrcFile.java:222) > at org.apache.orc.tools.FileDump.getReader(FileDump.java:255) > at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) > at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307) > at org.apache.orc.tools.FileDump.main(FileDump.java:154) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:313) > at org.apache.hadoop.util.RunJar.main(RunJar.java:227) > > Step 3) Now Create ORC File using the Hive API (as suggested by Spark in > [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting > spark.sql.orc.impl as hive) > > scala> spark.sql("set spark.sql.orc.impl=hive") > res6: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> spark.sql("set spark.sql.orc.impl").show() > --+ | key|value| +--- > |spark.sql.orc.impl| *hive*| +-- > > scala> spark.sql("CREATE table df_table2(col1 string,col2 string)") > 20/08/04 22:43:26 WARN HiveMetaStore: Location: > [file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2] > specified for non-external table:df_table2 res5: > org.apache.spark.sql.DataFrame = [] > > scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: > org.apache.spark.sql.DataFrame = [] > > scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: > org.apache.spark.sql.DataFrame = [col1: string, col2: s
[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna Prasad K S updated SPARK-32558: --- Description: Steps to reproduce the issue: --- Download Spark_3.0 from [https://spark.apache.org/downloads.html] Step 1) Create ORC File by using the default Spark_3.0 Native API from the spark shell . Launch Spark Shell: [linuxuser1@irlrhellinux1 bin]$ ./spark-shell Welcome to Spark version 3.0.0 Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("set spark.sql.orc.impl").show() -+ | key| value| + |spark.sql.orc.impl|*native*| -+ scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table values('col1val1','col2val1')") org.apache.spark.sql.DataFrame = [] scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame.show() + | col1| col2| + |col1val1|col2val1| -+ scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails to fetch the metadata from the ORC file. adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc Processing data file /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385) at org.apache.orc.OrcFile.createReader(OrcFile.java:222) at org.apache.orc.tools.FileDump.getReader(FileDump.java:255) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307) at org.apache.orc.tools.FileDump.main(FileDump.java:154) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:313) at org.apache.hadoop.util.RunJar.main(RunJar.java:227) Step 3) Now Create ORC File using the Hive API (as suggested by Spark in [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting spark.sql.orc.impl as hive) scala> spark.sql("set spark.sql.orc.impl=hive") res6: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("set spark.sql.orc.impl").show() --+ | key|value| +--- |spark.sql.orc.impl| *hive*| +-- scala> spark.sql("CREATE table df_table2(col1 string,col2 string)") 20/08/04 22:43:26 WARN HiveMetaStore: Location: [file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2] specified for non-external table:df_table2 res5: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: org.apache.spark.sql.DataFrame = [] scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2") Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails with the same exception to fetch the metadata even after following the workaround suggested by spark to set spark.sql.orc.impl to hive [adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc Processing data file /tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.
[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna Prasad K S updated SPARK-32558: --- Description: Steps to reproduce the issue: --- Download Spark_3.0 from [https://spark.apache.org/downloads.html] Step 1) Create ORC File by using the default Spark_3.0 Native API from spark_3.0 spark shell . Launch Spark Shell: [linuxuser1@irlrhellinux1 bin]$ ./spark-shell Welcome to Spark version 3.0.0 Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("set spark.sql.orc.impl").show() -+ | key| value| + |spark.sql.orc.impl|*native*| -+ scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table values('col1val1','col2val1')") org.apache.spark.sql.DataFrame = [] scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame.show() + | col1| col2| + |col1val1|col2val1| -+ scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails to fetch the metadata from the ORC file. adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc Processing data file /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385) at org.apache.orc.OrcFile.createReader(OrcFile.java:222) at org.apache.orc.tools.FileDump.getReader(FileDump.java:255) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307) at org.apache.orc.tools.FileDump.main(FileDump.java:154) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:313) at org.apache.hadoop.util.RunJar.main(RunJar.java:227) Step 3) Now Create ORC File using the Hive API (as suggested by Spark in [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting spark.sql.orc.impl as hive) scala> spark.sql("set spark.sql.orc.impl=hive") res6: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("set spark.sql.orc.impl").show() --+ | key|value| +--- |spark.sql.orc.impl| *hive*| +-- scala> spark.sql("CREATE table df_table2(col1 string,col2 string)") 20/08/04 22:43:26 WARN HiveMetaStore: Location: [file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2] specified for non-external table:df_table2 res5: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: org.apache.spark.sql.DataFrame = [] scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2") Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails with the same exception to fetch the metadata even after following the workaround suggested by spark to set spark.sql.orc.impl to hive [adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc Processing data file /tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(Or
[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna Prasad K S updated SPARK-32558: --- Description: Steps to reproduce the issue: --- Download Spark_3.0 from [https://spark.apache.org/downloads.html] Step 1) Create ORC File by using the default Spark_3.0 Native API from the spark shell . [linuxuser1@irlrhellinux1 bin]$ ./spark-shell Welcome to Spark version 3.0.0 Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("set spark.sql.orc.impl").show() -+ | key| value| + |spark.sql.orc.impl|*native*| -+ scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table values('col1val1','col2val1')") org.apache.spark.sql.DataFrame = [] scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame.show() + | col1| col2| + |col1val1|col2val1| -+ scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails to fetch the metadata from the ORC file. adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc Processing data file /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385) at org.apache.orc.OrcFile.createReader(OrcFile.java:222) at org.apache.orc.tools.FileDump.getReader(FileDump.java:255) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307) at org.apache.orc.tools.FileDump.main(FileDump.java:154) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:313) at org.apache.hadoop.util.RunJar.main(RunJar.java:227) Step 3) Now Create ORC File using the Hive API (as suggested by Spark in [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting spark.sql.orc.impl as hive) scala> spark.sql("set spark.sql.orc.impl=hive") res6: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("set spark.sql.orc.impl").show() --+ | key|value| +--- |spark.sql.orc.impl| *hive*| +-- scala> spark.sql("CREATE table df_table2(col1 string,col2 string)") 20/08/04 22:43:26 WARN HiveMetaStore: Location: [file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2] specified for non-external table:df_table2 res5: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: org.apache.spark.sql.DataFrame = [] scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2") Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails with the same exception to fetch the metadata even after following the workaround suggested by spark to set spark.sql.orc.impl to hive [adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc Processing data file /tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) at org.a
[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna Prasad K S updated SPARK-32558: --- Description: Steps to reproduce the issue: --- Download Spark_3.0 [https://spark.apache.org/downloads.html] Step 1) Create ORC File by using the default Spark_3.0 Native API from spark_3.0 spark shell . Launch Spark Shell: [linuxuser1@irlrhellinux1 bin]$ ./spark-shell Welcome to Spark version 3.0.0 Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("set spark.sql.orc.impl").show() -+ | key| value| + |spark.sql.orc.impl|*native*| -+ scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table values('col1val1','col2val1')") org.apache.spark.sql.DataFrame = [] scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame.show() + | col1| col2| + |col1val1|col2val1| -+ scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails to fetch the metadata from the ORC file. adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc Processing data file /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385) at org.apache.orc.OrcFile.createReader(OrcFile.java:222) at org.apache.orc.tools.FileDump.getReader(FileDump.java:255) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307) at org.apache.orc.tools.FileDump.main(FileDump.java:154) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:313) at org.apache.hadoop.util.RunJar.main(RunJar.java:227) Step 3) Now Create ORC File using the Hive API (as suggested by Spark in [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting spark.sql.orc.impl as hive) scala> spark.sql("set spark.sql.orc.impl=hive") res6: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("set spark.sql.orc.impl").show() --+ | key|value| +--- |spark.sql.orc.impl| *hive*| +-- scala> spark.sql("CREATE table df_table2(col1 string,col2 string)") 20/08/04 22:43:26 WARN HiveMetaStore: Location: [file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2] specified for non-external table:df_table2 res5: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: org.apache.spark.sql.DataFrame = [] scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2") Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails with the same exception to fetch the metadata even after following the workaround suggested by spark to set spark.sql.orc.impl to hive [adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc Processing data file /tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTai
[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna Prasad K S updated SPARK-32558: --- Description: Steps to reproduce the issue: --- Download Spark_3.0 on Linux: Download Spark_3.0 on Linux: [https://spark.apache.org/downloads.html] Step 1) Create ORC File by using the default Spark_3.0 Native API from spark_3.0 spark shell . Launch Spark Shell: [linuxuser1@irlrhellinux1 bin]$ ./spark-shell Welcome to Spark version 3.0.0 Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("set spark.sql.orc.impl").show() -+ | key| value| + |spark.sql.orc.impl|*native*| -+ scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table values('col1val1','col2val1')") org.apache.spark.sql.DataFrame = [] scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame.show() + | col1| col2| + |col1val1|col2val1| -+ scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails to fetch the metadata from the ORC file. adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc Processing data file /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385) at org.apache.orc.OrcFile.createReader(OrcFile.java:222) at org.apache.orc.tools.FileDump.getReader(FileDump.java:255) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307) at org.apache.orc.tools.FileDump.main(FileDump.java:154) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:313) at org.apache.hadoop.util.RunJar.main(RunJar.java:227) Step 3) Now Create ORC File using the Hive API (as suggested by Spark in [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting spark.sql.orc.impl as hive) scala> spark.sql("set spark.sql.orc.impl=hive") res6: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("set spark.sql.orc.impl").show() --+ | key|value| +--- |spark.sql.orc.impl| *hive*| +-- scala> spark.sql("CREATE table df_table2(col1 string,col2 string)") 20/08/04 22:43:26 WARN HiveMetaStore: Location: [file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2] specified for non-external table:df_table2 res5: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: org.apache.spark.sql.DataFrame = [] scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2") Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails with the same exception to fetch the metadata even after following the workaround suggested by spark to set spark.sql.orc.impl to hive [adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc Processing data file /tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.o
[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna Prasad K S updated SPARK-32558: --- Description: Steps to reproduce the issue: --- Download Spark_3.0 on Linux: Download Spark_3.0 on Linux: [https://spark.apache.org/downloads.html] Step 1) Create ORC File by using the default Spark_3.0 Native API from spark_3.0 spark shell . Launch Spark Shell: [linuxuser1@irlrhellinux1 bin]$ ./spark-shell Welcome to Spark version 3.0.0 Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("set spark.sql.orc.impl").show() -+ | key| value| + |spark.sql.orc.impl|*native*| -+ scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table values('col1val1','col2val1')") org.apache.spark.sql.DataFrame = [] scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame.show() + | col1| col2| + |col1val1|col2val1| -+ scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails to fetch the metadata from the ORC file. adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc Processing data file /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385) at org.apache.orc.OrcFile.createReader(OrcFile.java:222) at org.apache.orc.tools.FileDump.getReader(FileDump.java:255) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307) at org.apache.orc.tools.FileDump.main(FileDump.java:154) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:313) at org.apache.hadoop.util.RunJar.main(RunJar.java:227) Step 3) Now Create ORC File using the Hive API (as suggested by Spark in [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting spark.sql.orc.impl as hive) scala> spark.sql("set spark.sql.orc.impl=hive") res6: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("set spark.sql.orc.impl").show() --+ | key|value| +--- |spark.sql.orc.impl| *hive*| +-- scala> spark.sql("CREATE table df_table2(col1 string,col2 string)") 20/08/04 22:43:26 WARN HiveMetaStore: Location: [file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2] specified for non-external table:df_table2 res5: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: org.apache.spark.sql.DataFrame = [] scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2") Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails with the same exception to fetch the metadata even after following the workaround suggested by spark to set spark.sql.orc.impl to hive [adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc Processing data file /tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.o
[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna Prasad K S updated SPARK-32558: --- Description: Steps to reproduce the issue: --- Download Spark_3.0 on Linux: Download Spark_3.0 on Linux: [https://spark.apache.org/downloads.html] Step 1) Create ORC File by using the default Spark_3.0 Native API from spark_3.0 spark shell . Launch Spark Shell: [linuxuser1@irlrhellinux1 bin]$ ./spark-shell Welcome to Spark version 3.0.0 Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("set spark.sql.orc.impl").show() -+ | key| value| + |spark.sql.orc.impl|native| -+ scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table values('col1val1','col2val1')") org.apache.spark.sql.DataFrame = [] scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame.show() + | col1| col2| + |col1val1|col2val1| -+ scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails to fetch the metadata from the ORC file. adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc Processing data file /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385) at org.apache.orc.OrcFile.createReader(OrcFile.java:222) at org.apache.orc.tools.FileDump.getReader(FileDump.java:255) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307) at org.apache.orc.tools.FileDump.main(FileDump.java:154) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:313) at org.apache.hadoop.util.RunJar.main(RunJar.java:227) Step 3) Now Create ORC File using the Hive API (as suggested by Spark in [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting spark.sql.orc.impl as hive) scala> spark.sql("set spark.sql.orc.impl=hive") res6: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("set spark.sql.orc.impl").show() ---+ | key|value| + |spark.sql.orc.impl| hive| +-- scala> spark.sql("CREATE table df_table2(col1 string,col2 string)") 20/08/04 22:43:26 WARN HiveMetaStore: Location: [file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2] specified for non-external table:df_table2 res5: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: org.apache.spark.sql.DataFrame = [] scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2") Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails with the same exception to fetch the metadata even after following the workaround suggested by spark to set spark.sql.orc.impl to hive [adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc Processing data file /tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc
[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna Prasad K S updated SPARK-32558: --- Description: Steps to reproduce the issue: --- Download Spark_3.0 on Linux: Download Spark_3.0 on Linux: [https://spark.apache.org/downloads.html] Step 1) Create ORC File by using the default Spark_3.0 Native API from spark_3.0 spark shell . Launch Spark Shell: [linuxuser1@irlrhellinux1 bin]$ ./spark-shell Welcome to Spark version 3.0.0 Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("set spark.sql.orc.impl").show() -+ | key| value| + |spark.sql.orc.impl|native| -+ scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table values('col1val1','col2val1')") org.apache.spark.sql.DataFrame = [] scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame.show() + | col1| col2| + |col1val1|col2val1| -+ scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails to fetch the metadata from the ORC file. adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc Processing data file /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385) at org.apache.orc.OrcFile.createReader(OrcFile.java:222) at org.apache.orc.tools.FileDump.getReader(FileDump.java:255) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307) at org.apache.orc.tools.FileDump.main(FileDump.java:154) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:313) at org.apache.hadoop.util.RunJar.main(RunJar.java:227) Step 3) Now Create ORC File using the Hive API (as suggested by Spark in [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting spark.sql.orc.impl as hive) scala> spark.sql("set spark.sql.orc.impl=hive") res6: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("set spark.sql.orc.impl").show() ---+ | key|value| + |spark.sql.orc.impl| hive| +-- scala> spark.sql("CREATE table df_table2(col1 string,col2 string)") 20/08/04 22:43:26 WARN HiveMetaStore: Location: [file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2] specified for non-external table:df_table2 res5: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: org.apache.spark.sql.DataFrame = [] scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2") Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails with the same exception to fetch the metadata even after following the workaround suggested by spark to set spark.sql.orc.impl to hive [adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc Processing data file /tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail
[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna Prasad K S updated SPARK-32558: --- Description: Steps to reproduce the issue: Download Spark_3.0 on Linux: Download Spark_3.0 on Linux: [https://spark.apache.org/downloads.html] Step 1) Create ORC File by using the default Spark_3.0 Native API from spark_3.0 spark shell . Launch Spark Shell: [linuxuser1@irlrhellinux1 bin]$ ./spark-shell Welcome to Spark version 3.0.0 Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("set spark.sql.orc.impl").show() -+ | key| value| + |spark.sql.orc.impl|native| -+ scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table values('col1val1','col2val1')") org.apache.spark.sql.DataFrame = [] scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame.show() + | col1| col2| + |col1val1|col2val1| -+ scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails to fetch the metadata from the ORC file. adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc Processing data file /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385) at org.apache.orc.OrcFile.createReader(OrcFile.java:222) at org.apache.orc.tools.FileDump.getReader(FileDump.java:255) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307) at org.apache.orc.tools.FileDump.main(FileDump.java:154) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:313) at org.apache.hadoop.util.RunJar.main(RunJar.java:227) Step 3) Now Create ORC File using the Hive API (as suggested by Spark in [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting spark.sql.orc.impl as hive) scala> spark.sql("set spark.sql.orc.impl=hive") res6: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("set spark.sql.orc.impl").show() + | key|value| +- |spark.sql.orc.impl| hive| +-- scala> spark.sql("CREATE table df_table2(col1 string,col2 string)") 20/08/04 22:43:26 WARN HiveMetaStore: Location: [file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2] specified for non-external table:df_table2 res5: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: org.apache.spark.sql.DataFrame = [] scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2") Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails with the same exception to fetch the metadata even after following the workaround suggested by spark to set spark.sql.orc.impl to hive [adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc Processing data file /tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
[ https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramakrishna Prasad K S updated SPARK-32558: --- Description: Steps to reproduce the issue: Download Spark_3.0 on Linux: Download Spark_3.0 on Linux: [https://spark.apache.org/downloads.html] Step 1) Create ORC File by using the default Spark_3.0 Native API from spark_3.0 spark shell . Launch Spark Shell: [linuxuser1@irlrhellinux1 bin]$ ./spark-shell Welcome to Spark version 3.0.0 Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("set spark.sql.orc.impl").show() -+ | key| value| + |spark.sql.orc.impl|native| -+ scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table values('col1val1','col2val1')") 20/08/04 22:40:18 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException res2: org.apache.spark.sql.DataFrame = [] scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame.show() -+ | col1| col2| + |col1val1|col2val1| -+ scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails to fetch the metadata from the ORC file. adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc Processing data file /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385) at org.apache.orc.OrcFile.createReader(OrcFile.java:222) at org.apache.orc.tools.FileDump.getReader(FileDump.java:255) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307) at org.apache.orc.tools.FileDump.main(FileDump.java:154) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:313) at org.apache.hadoop.util.RunJar.main(RunJar.java:227) Step 3) Now Create ORC File using the Hive API (as suggested by Spark in [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting spark.sql.orc.impl as hive) scala> spark.sql("set spark.sql.orc.impl=hive") res6: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("set spark.sql.orc.impl").show() +---++ | key|value| ++---+ |spark.sql.orc.impl| hive| ++---+ scala> spark.sql("CREATE table df_table2(col1 string,col2 string)") 20/08/04 22:43:26 WARN HiveMetaStore: Location: [file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2] specified for non-external table:df_table2 res5: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: org.apache.spark.sql.DataFrame = [] scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2") Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails with the same exception to fetch the metadata even after following the workaround suggested by spark to set spark.sql.orc.impl to hive
[jira] [Created] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)
Ramakrishna Prasad K S created SPARK-32558: -- Summary: ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working) Key: SPARK-32558 URL: https://issues.apache.org/jira/browse/SPARK-32558 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Environment: Spark 3.0 on Linux and Hadoop cluster having Hive_2.1.1 version. Reporter: Ramakrishna Prasad K S Fix For: 3.0.0 Steps to reproduce the issue: Download Spark_3.0 on Linux: Download Spark_3.0 on Linux: https://spark.apache.org/downloads.html Step 1) Create ORC File by using the default Spark_3.0 Native API from spark_3.0 spark shell Launch Spark Shell: [linuxuser1@irlrhellinux1 bin]$ ./spark-shell Welcome to Spark version 3.0.0 Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("set spark.sql.orc.impl").show() +--+--+ | key| value| +--+--+ |spark.sql.orc.impl|native| +--+--+ scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("insert into df_table values('col1val1','col2val1')") 20/08/04 22:40:18 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException res2: org.apache.spark.sql.DataFrame = [] scala> val dFrame = spark.sql("select * from df_table") dFrame: org.apache.spark.sql.DataFrame = [col1: string, col2: string] scala> dFrame.show() +++ | col1| col2| +++ |col1val1|col2val1| +++ scala> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table") Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following command to analyze or read metadata from the ORC files. As you see below, it fails to fetch the metadata from the ORC file. adpqa@irlhadoop1 bug]$ hive --orcfiledump /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc Processing data file /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc [length: 414] Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385) at org.apache.orc.OrcFile.createReader(OrcFile.java:222) at org.apache.orc.tools.FileDump.getReader(FileDump.java:255) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307) at org.apache.orc.tools.FileDump.main(FileDump.java:154) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:313) at org.apache.hadoop.util.RunJar.main(RunJar.java:227) (even after overriding spark.sql.orc.impl to hive) --- Step 3) Now Create ORC File using the Hive API (as suggested by Spark in https://spark.apache.org/docs/latest/sql-migration-guide.html by setting spark.sql.orc.impl as hive) --- scala> spark.sql("set spark.sql.orc.impl=hive") res6: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("set spark.sql.orc.impl").show() +--+-+ | key|value| +--+-+ |spark.sql.orc.impl| hive| +--+-+ scala> spark.sql("CREATE table df_table2(col1 string,col2 string)") 20/08/04