[jira] [Comment Edited] (SPARK-32234) Spark sql commands are failing on select Queries for the orc tables

2020-08-14 Thread Ramakrishna Prasad K S (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177708#comment-17177708
 ] 

Ramakrishna Prasad K S edited comment on SPARK-32234 at 8/14/20, 11:22 AM:
---

Thanks [~saurabhc100] I am going ahead and merging these changes to our product 
which is on Spark_3.0. I hope there is no regression or side effects due to 
these changes. Just wanted to know why this bug is still in resolved state. Is 
any test still pending to be run? Thank you.


was (Author: ramks):
Thanks [~saurabhc100] I am going ahead and merging these changes to my local 
Spark_3.0 setup. I hope there is no regression or side effects due to these 
changes. Just wanted to know why this bug is still in resolved state. Is any 
test still pending to be run? Thank you.

> Spark sql commands are failing on select Queries for the  orc tables
> 
>
> Key: SPARK-32234
> URL: https://issues.apache.org/jira/browse/SPARK-32234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Saurabh Chawla
>Assignee: Saurabh Chawla
>Priority: Blocker
> Fix For: 3.0.1, 3.1.0
>
> Attachments: e17f6887c06d47f6a62c0140c1ad569c_00
>
>
> Spark sql commands are failing on select Queries for the orc tables
> Steps to reproduce
>  
> {code:java}
> val table = """CREATE TABLE `date_dim` (
>   `d_date_sk` INT,
>   `d_date_id` STRING,
>   `d_date` TIMESTAMP,
>   `d_month_seq` INT,
>   `d_week_seq` INT,
>   `d_quarter_seq` INT,
>   `d_year` INT,
>   `d_dow` INT,
>   `d_moy` INT,
>   `d_dom` INT,
>   `d_qoy` INT,
>   `d_fy_year` INT,
>   `d_fy_quarter_seq` INT,
>   `d_fy_week_seq` INT,
>   `d_day_name` STRING,
>   `d_quarter_name` STRING,
>   `d_holiday` STRING,
>   `d_weekend` STRING,
>   `d_following_holiday` STRING,
>   `d_first_dom` INT,
>   `d_last_dom` INT,
>   `d_same_day_ly` INT,
>   `d_same_day_lq` INT,
>   `d_current_day` STRING,
>   `d_current_week` STRING,
>   `d_current_month` STRING,
>   `d_current_quarter` STRING,
>   `d_current_year` STRING)
> USING orc
> LOCATION '/Users/test/tpcds_scale5data/date_dim'
> TBLPROPERTIES (
>   'transient_lastDdlTime' = '1574682806')"""
> spark.sql(table).collect
> val u = """select date_dim.d_date_id from date_dim limit 5"""
> spark.sql(u).collect
> {code}
>  
>  
> Exception
>  
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 
> (TID 2, 192.168.0.103, executor driver): 
> java.lang.ArrayIndexOutOfBoundsException: 1
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:141)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:203)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:620)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:343)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:895)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:895)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:336)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:133)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:445)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1489)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:448)
>

[jira] [Commented] (SPARK-32234) Spark sql commands are failing on select Queries for the orc tables

2020-08-14 Thread Ramakrishna Prasad K S (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177708#comment-17177708
 ] 

Ramakrishna Prasad K S commented on SPARK-32234:


Thanks [~saurabhc100] I am going ahead and merging these changes to my local 
Spark_3.0 setup. I hope there is no regression or side effects due to these 
changes. Just wanted to know why this bug is still in resolved state. Is any 
test still pending to be run? Thank you.

> Spark sql commands are failing on select Queries for the  orc tables
> 
>
> Key: SPARK-32234
> URL: https://issues.apache.org/jira/browse/SPARK-32234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Saurabh Chawla
>Assignee: Saurabh Chawla
>Priority: Blocker
> Fix For: 3.0.1, 3.1.0
>
> Attachments: e17f6887c06d47f6a62c0140c1ad569c_00
>
>
> Spark sql commands are failing on select Queries for the orc tables
> Steps to reproduce
>  
> {code:java}
> val table = """CREATE TABLE `date_dim` (
>   `d_date_sk` INT,
>   `d_date_id` STRING,
>   `d_date` TIMESTAMP,
>   `d_month_seq` INT,
>   `d_week_seq` INT,
>   `d_quarter_seq` INT,
>   `d_year` INT,
>   `d_dow` INT,
>   `d_moy` INT,
>   `d_dom` INT,
>   `d_qoy` INT,
>   `d_fy_year` INT,
>   `d_fy_quarter_seq` INT,
>   `d_fy_week_seq` INT,
>   `d_day_name` STRING,
>   `d_quarter_name` STRING,
>   `d_holiday` STRING,
>   `d_weekend` STRING,
>   `d_following_holiday` STRING,
>   `d_first_dom` INT,
>   `d_last_dom` INT,
>   `d_same_day_ly` INT,
>   `d_same_day_lq` INT,
>   `d_current_day` STRING,
>   `d_current_week` STRING,
>   `d_current_month` STRING,
>   `d_current_quarter` STRING,
>   `d_current_year` STRING)
> USING orc
> LOCATION '/Users/test/tpcds_scale5data/date_dim'
> TBLPROPERTIES (
>   'transient_lastDdlTime' = '1574682806')"""
> spark.sql(table).collect
> val u = """select date_dim.d_date_id from date_dim limit 5"""
> spark.sql(u).collect
> {code}
>  
>  
> Exception
>  
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 
> (TID 2, 192.168.0.103, executor driver): 
> java.lang.ArrayIndexOutOfBoundsException: 1
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:141)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:203)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:620)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:343)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:895)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:895)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:336)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:133)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:445)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1489)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:448)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  
> The reason behind this initBatch is not getting the schema that is needed to 
> find out the column value in OrcFi

[jira] [Commented] (SPARK-32234) Spark sql commands are failing on select Queries for the orc tables

2020-08-09 Thread Ramakrishna Prasad K S (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173924#comment-17173924
 ] 

Ramakrishna Prasad K S commented on SPARK-32234:


[~saurabhc100] has the fix for this bug been verified? We are observing the 
same issue as reported here when we upgraded to Spark_3.0 and would like to 
patch the fix on our product.

Our ORC source file contains three fields:  ___col1 string,_col2 string,_col3 
string and reading it fails with the below exception:

java.lang.ArrayIndexOutOfBoundsException: 1
 at 
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:183)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$2(OrcFileFormat.scala:216)

Thanks and Regards,

Ramakrishna

> Spark sql commands are failing on select Queries for the  orc tables
> 
>
> Key: SPARK-32234
> URL: https://issues.apache.org/jira/browse/SPARK-32234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Saurabh Chawla
>Assignee: Saurabh Chawla
>Priority: Blocker
> Fix For: 3.0.1, 3.1.0
>
> Attachments: e17f6887c06d47f6a62c0140c1ad569c_00
>
>
> Spark sql commands are failing on select Queries for the orc tables
> Steps to reproduce
>  
> {code:java}
> val table = """CREATE TABLE `date_dim` (
>   `d_date_sk` INT,
>   `d_date_id` STRING,
>   `d_date` TIMESTAMP,
>   `d_month_seq` INT,
>   `d_week_seq` INT,
>   `d_quarter_seq` INT,
>   `d_year` INT,
>   `d_dow` INT,
>   `d_moy` INT,
>   `d_dom` INT,
>   `d_qoy` INT,
>   `d_fy_year` INT,
>   `d_fy_quarter_seq` INT,
>   `d_fy_week_seq` INT,
>   `d_day_name` STRING,
>   `d_quarter_name` STRING,
>   `d_holiday` STRING,
>   `d_weekend` STRING,
>   `d_following_holiday` STRING,
>   `d_first_dom` INT,
>   `d_last_dom` INT,
>   `d_same_day_ly` INT,
>   `d_same_day_lq` INT,
>   `d_current_day` STRING,
>   `d_current_week` STRING,
>   `d_current_month` STRING,
>   `d_current_quarter` STRING,
>   `d_current_year` STRING)
> USING orc
> LOCATION '/Users/test/tpcds_scale5data/date_dim'
> TBLPROPERTIES (
>   'transient_lastDdlTime' = '1574682806')"""
> spark.sql(table).collect
> val u = """select date_dim.d_date_id from date_dim limit 5"""
> spark.sql(u).collect
> {code}
>  
>  
> Exception
>  
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 
> (TID 2, 192.168.0.103, executor driver): 
> java.lang.ArrayIndexOutOfBoundsException: 1
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:141)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:203)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:620)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:343)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:895)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:895)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:336)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:133)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:445)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1489)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.s

[jira] [Comment Edited] (SPARK-32234) Spark sql commands are failing on select Queries for the orc tables

2020-08-09 Thread Ramakrishna Prasad K S (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173924#comment-17173924
 ] 

Ramakrishna Prasad K S edited comment on SPARK-32234 at 8/9/20, 5:14 PM:
-

[~saurabhc100] has the fix for this bug been verified? We are observing the 
same issue as reported here when we upgraded to Spark_3.0 and would like to 
patch the fix on our product.

Our ORC source file contains three fields:  _col1 string,_col2 string,_col3 
string and reading it fails with the below exception:

java.lang.ArrayIndexOutOfBoundsException: 1
 at 
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:183)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$2(OrcFileFormat.scala:216)

Thanks and Regards,

Ramakrishna


was (Author: ramks):
[~saurabhc100] has the fix for this bug been verified? We are observing the 
same issue as reported here when we upgraded to Spark_3.0 and would like to 
patch the fix on our product.

Our ORC source file contains three fields:  ___col1 string,_col2 string,_col3 
string and reading it fails with the below exception:

java.lang.ArrayIndexOutOfBoundsException: 1
 at 
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:183)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$2(OrcFileFormat.scala:216)

Thanks and Regards,

Ramakrishna

> Spark sql commands are failing on select Queries for the  orc tables
> 
>
> Key: SPARK-32234
> URL: https://issues.apache.org/jira/browse/SPARK-32234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Saurabh Chawla
>Assignee: Saurabh Chawla
>Priority: Blocker
> Fix For: 3.0.1, 3.1.0
>
> Attachments: e17f6887c06d47f6a62c0140c1ad569c_00
>
>
> Spark sql commands are failing on select Queries for the orc tables
> Steps to reproduce
>  
> {code:java}
> val table = """CREATE TABLE `date_dim` (
>   `d_date_sk` INT,
>   `d_date_id` STRING,
>   `d_date` TIMESTAMP,
>   `d_month_seq` INT,
>   `d_week_seq` INT,
>   `d_quarter_seq` INT,
>   `d_year` INT,
>   `d_dow` INT,
>   `d_moy` INT,
>   `d_dom` INT,
>   `d_qoy` INT,
>   `d_fy_year` INT,
>   `d_fy_quarter_seq` INT,
>   `d_fy_week_seq` INT,
>   `d_day_name` STRING,
>   `d_quarter_name` STRING,
>   `d_holiday` STRING,
>   `d_weekend` STRING,
>   `d_following_holiday` STRING,
>   `d_first_dom` INT,
>   `d_last_dom` INT,
>   `d_same_day_ly` INT,
>   `d_same_day_lq` INT,
>   `d_current_day` STRING,
>   `d_current_week` STRING,
>   `d_current_month` STRING,
>   `d_current_quarter` STRING,
>   `d_current_year` STRING)
> USING orc
> LOCATION '/Users/test/tpcds_scale5data/date_dim'
> TBLPROPERTIES (
>   'transient_lastDdlTime' = '1574682806')"""
> spark.sql(table).collect
> val u = """select date_dim.d_date_id from date_dim limit 5"""
> spark.sql(u).collect
> {code}
>  
>  
> Exception
>  
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 
> (TID 2, 192.168.0.103, executor driver): 
> java.lang.ArrayIndexOutOfBoundsException: 1
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:141)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:203)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:620)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:343)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapP

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-09 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

--- 

Download Spark_3.0 from [https://spark.apache.org/downloads.html]

 

Step 1) Create ORC File by using the default Spark_3.0 Native API from spark 
shell .
{code:java}
[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

+-+
|               key| value| 
+-+
|spark.sql.orc.impl|native|
+-+
 
scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []
scala> spark.sql("insert into df_table values('col1val1','col2val1')")
org.apache.spark.sql.DataFrame = []

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala> dFrame.show()

+-+
|    col1|    col2|
+-+
|col1val1|col2val1|
+-+

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/ORC_File_Tgt1")
{code}
 

Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.
{code:java}
adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/ORC_File_Tgt1/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc
 Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
at org.apache.orc.tools.FileDump.main(FileDump.java:154)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
{code}
Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)
{code:java}
scala> spark.sql("set spark.sql.orc.impl=hive")
res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
scala> spark.sql("set spark.sql.orc.impl").show()

++
|               key|value| 
++
|spark.sql.orc.impl| hive|
++

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")
scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []
scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/ORC_File_Tgt2")
{code}
 

Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive
{code:java}
[adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/ORC_File_Tgt2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc
Processing data file 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
at org.apache.

[jira] [Comment Edited] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-09 Thread Ramakrishna Prasad K S (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173779#comment-17173779
 ] 

Ramakrishna Prasad K S edited comment on SPARK-32558 at 8/9/20, 9:00 AM:
-

Hi [~dongjoon]

Thanks for pointing that it is a Hive bug. I am already aware that there is a 
Hive bug related to this which I have put in the Jira description itself.

But according to the latest spark documentation, it was mentioned that if you 
set spark.sql.orc.impl=hive, it would generate orc files that would work with 
Hive_2.1.1 or below. That is why I raised this bug because the workaround 
mentioned in spark documentation was not working for me.

[https://spark.apache.org/docs/latest/sql-migration-guide.html]

!image-2020-08-09-14-07-00-521.png|width=725,height=91!

It is clearly mentioned here that use spark.sql.orc.impl=hive to create files 
that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the 
fix for this issue, how is this a valid workaround here??)

 

And regarding the way I am generating the ORC file, I do not agree with your 
comments. Please look at this closely.

First I am creating a normal spark sql table here and loading data to it using 
insert query. In the second half what I am doing is, reading all the data from 
the spark sql table and loading it into spark dataframe. Finally I am writing 
the spark data frame content into a new ORC file: (which works for all the file 
formats)
{code:java}
scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []
 scala> spark.sql("insert into df_table values('col1val1','col2val1')")
 org.apache.spark.sql.DataFrame = []
scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]
 scala> dFrame.show()
-

    col1     col2
-

col1val1 col2val1
-
scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
{code}
The source and target in this case are two independent entities I believe. Once 
the data comes to a Dataframe, it can be written into any files like ORC, Avro, 
Parquet irrespective of where the data has come into data frame.

Please note, here I am not trying to create a ORC Spark SQL table and then I am 
trying to generate a ORC file using the Native Spark_3.0 APIs. 
scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")
/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2 
{color:#FF}this is just a location on my linux machine. It is not where the 
df_table2 spark sql table is stored. {color}

How or from where the data comes to data frame is irrelevant I believe here, I 
am just loading simple string data as well.

Even the below case is valid, data can be generated this way also and written 
into a ORC file. The issue is observed with the file created in this case as 
well.

 
{code:java}
import org.apache.spark.sql.SparkSession
 val spark = SparkSession.builder().getOrCreate()
 import spark.implicits._
 case class Contact(name: String, phone: String)
 case class Person(name: String, age: Int, contacts: Seq[Contact])
 val records = (1 to 100).map
{ i =>; Person(s"name_$i", i, (0 to 1).map \{ m => Contact(s"contact_$m", 
s"phone_$m") }
)
 }
 records.toDF().write.format("orc").save("/tmp/orc_tgt33")
{code}
 

We need to at-least fix the Spark3.0 documentation or give detailed explanation 
there as to what is the purpose of spark.sql.orc.impl=hive flag..!(If this flag 
cannot generate ORC files that is not compatible with Hive_2.1.1 and below, 
then what is the usage of the same)

We are on Hive_2.1.1 version in our product and we make calls to Hive_2.1.1 ORC 
APIs. Regarding the tool I am using to retrieve metadata *hive --orcfiledump*, 
it internally calls Hive_2.1.1 APIs. Which is why I am using the same as with 
it I am able to replicate the problem.

I am following up with our Hadoop vendor to give a back-port of HIVE-16683 to 
Hive_2.1.1. However, the confusing part in the spark documentation needs to be 
fixed I believe.

Thank you.

Ramakrishna

 


was (Author: ramks):
Hi [~dongjoon]

Thanks for pointing that it is a Hive bug. I am already aware that there is a 
Hive bug related to this which I have put in the Jira description itself.

But according to the latest spark documentation, it was mentioned that if you 
set spark.sql.orc.impl=hive, it would generate orc files that would work with 
Hive_2.1.1 or below. That is why I raised this bug because the workaround 
mentioned in spark documentation was not working for me.

[https://spark.apache.org/docs/latest/sql-migration-guide.html]

!image-2020-08-09-14-07-00-521.png|width=725,height=91!

It is clearly mentioned here that use spark.sql.orc.imp

[jira] [Comment Edited] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-09 Thread Ramakrishna Prasad K S (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173779#comment-17173779
 ] 

Ramakrishna Prasad K S edited comment on SPARK-32558 at 8/9/20, 9:00 AM:
-

Hi [~dongjoon]

Thanks for pointing that it is a Hive bug. I am already aware that there is a 
Hive bug related to this which I have put in the Jira description itself.

But according to the latest spark documentation, it was mentioned that if you 
set spark.sql.orc.impl=hive, it would generate orc files that would work with 
Hive_2.1.1 or below. That is why I raised this bug because the workaround 
mentioned in spark documentation was not working for me.

[https://spark.apache.org/docs/latest/sql-migration-guide.html]

!image-2020-08-09-14-07-00-521.png|width=725,height=91!

It is clearly mentioned here that use spark.sql.orc.impl=hive to create files 
that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the 
fix for this issue, how is this a valid workaround here??)

 

And regarding the way I am generating the ORC file, I do not agree with your 
comments. Please look at this closely.

First I am creating a normal spark sql table here and loading data to it using 
insert query. In the second half what I am doing is, reading all the data from 
the spark sql table and loading it into spark dataframe. Finally I am writing 
the spark data frame content into a new ORC file: (which works for all the file 
formats)
{code:java}
scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []
 scala> spark.sql("insert into df_table values('col1val1','col2val1')")
 org.apache.spark.sql.DataFrame = []
scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]
 scala> dFrame.show()
-

    col1     col2
-

col1val1 col2val1
-
scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
{code}
The source and target in this case are two independent entities I believe. Once 
the data comes to a Dataframe, it can be written into any files like ORC, Avro, 
Parquet irrespective of where the data has come into data frame.

Please note, here I am not trying to create a ORC Spark SQL table and then I am 
trying to generate a ORC file using the Native Spark_3.0 APIs. 
 scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")
 /export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2 
{color:#ff}this is just a location on my linux machine. It is not where the 
df_table2 spark sql table is stored.  Will update the description if it is 
confusing.{color}

How or from where the data comes to data frame is irrelevant I believe here, I 
am just loading simple string data as well.

Even the below case is valid, data can be generated this way also and written 
into a ORC file. The issue is observed with the file created in this case as 
well.

 
{code:java}
import org.apache.spark.sql.SparkSession
 val spark = SparkSession.builder().getOrCreate()
 import spark.implicits._
 case class Contact(name: String, phone: String)
 case class Person(name: String, age: Int, contacts: Seq[Contact])
 val records = (1 to 100).map
{ i =>; Person(s"name_$i", i, (0 to 1).map \{ m => Contact(s"contact_$m", 
s"phone_$m") }
)
 }
 records.toDF().write.format("orc").save("/tmp/orc_tgt33")
{code}
 

We need to at-least fix the Spark3.0 documentation or give detailed explanation 
there as to what is the purpose of spark.sql.orc.impl=hive flag..!(If this flag 
cannot generate ORC files that is not compatible with Hive_2.1.1 and below, 
then what is the usage of the same)

We are on Hive_2.1.1 version in our product and we make calls to Hive_2.1.1 ORC 
APIs. Regarding the tool I am using to retrieve metadata *hive --orcfiledump*, 
it internally calls Hive_2.1.1 APIs. Which is why I am using the same as with 
it I am able to replicate the problem.

I am following up with our Hadoop vendor to give a back-port of HIVE-16683 to 
Hive_2.1.1. However, the confusing part in the spark documentation needs to be 
fixed I believe.

Thank you.

Ramakrishna

 


was (Author: ramks):
Hi [~dongjoon]

Thanks for pointing that it is a Hive bug. I am already aware that there is a 
Hive bug related to this which I have put in the Jira description itself.

But according to the latest spark documentation, it was mentioned that if you 
set spark.sql.orc.impl=hive, it would generate orc files that would work with 
Hive_2.1.1 or below. That is why I raised this bug because the workaround 
mentioned in spark documentation was not working for me.

[https://spark.apache.org/docs/latest/sql-migration-guide.html]

!image-2020-08-09-14-07-00-521.png|width=725,height=91!

It i

[jira] [Comment Edited] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-09 Thread Ramakrishna Prasad K S (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173779#comment-17173779
 ] 

Ramakrishna Prasad K S edited comment on SPARK-32558 at 8/9/20, 8:57 AM:
-

Hi [~dongjoon]

Thanks for pointing that it is a Hive bug. I am already aware that there is a 
Hive bug related to this which I have put in the Jira description itself.

But according to the latest spark documentation, it was mentioned that if you 
set spark.sql.orc.impl=hive, it would generate orc files that would work with 
Hive_2.1.1 or below. That is why I raised this bug because the workaround 
mentioned in spark documentation was not working for me.

[https://spark.apache.org/docs/latest/sql-migration-guide.html]

!image-2020-08-09-14-07-00-521.png|width=725,height=91!

It is clearly mentioned here that use spark.sql.orc.impl=hive to create files 
that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the 
fix for this issue, how is this a valid workaround here??)

 

And regarding the way I am generating the ORC file, I do not agree with your 
comments. Please look at this closely.

First I am creating a normal spark sql table here and loading data to it using 
insert query. In the second half what I am doing is, reading all the data from 
the spark sql table and loading it into spark dataframe. Finally I am writing 
the spark data frame content into a new ORC file: (which works for all the file 
formats)
{code:java}
scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []
 scala> spark.sql("insert into df_table values('col1val1','col2val1')")
 org.apache.spark.sql.DataFrame = []
scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]
 scala> dFrame.show()
-

    col1     col2
-

col1val1 col2val1
-
scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
{code}
The source and target in this case are two independent entities I believe. Once 
the data comes to a Dataframe, it can be written into any files like ORC, Avro, 
Parquet irrespective of where the data has come into data frame.

Please note, here I am not trying to create a ORC Spark SQL table and then I am 
trying to generate a ORC file using the Native Spark_3.0 APIs. 

How or from where the data comes to data frame is irrelevant I believe here, I 
am just loading simple string data as well.

Even the below case is valid, data can be generated this way also and written 
into a ORC file. The issue is observed with the file created in this case as 
well.

 
{code:java}
import org.apache.spark.sql.SparkSession
 val spark = SparkSession.builder().getOrCreate()
 import spark.implicits._
 case class Contact(name: String, phone: String)
 case class Person(name: String, age: Int, contacts: Seq[Contact])
 val records = (1 to 100).map
{ i =>; Person(s"name_$i", i, (0 to 1).map \{ m => Contact(s"contact_$m", 
s"phone_$m") }
)
 }
 records.toDF().write.format("orc").save("/tmp/orc_tgt33")
{code}
 

We need to at-least fix the Spark3.0 documentation or give detailed explanation 
there as to what is the purpose of spark.sql.orc.impl=hive flag..!(If this flag 
cannot generate ORC files that is not compatible with Hive_2.1.1 and below, 
then what is the usage of the same)

We are on Hive_2.1.1 version in our product and we make calls to Hive_2.1.1 ORC 
APIs. Regarding the tool I am using to retrieve metadata *hive --orcfiledump*, 
it internally calls Hive_2.1.1 APIs. Which is why I am using the same as with 
it I am able to replicate the problem.

I am following up with our Hadoop vendor to give a back-port of HIVE-16683 to 
Hive_2.1.1. However, the confusing part in the spark documentation needs to be 
fixed I believe.

Thank you.

Ramakrishna

 


was (Author: ramks):
Hi [~dongjoon]

Thanks for pointing that it is a Hive bug. I am already aware that there is a 
Hive bug related to this which I have put in the Jira description itself.

But according to the latest spark documentation, it was mentioned that if you 
set spark.sql.orc.impl=hive, it would generate orc files that would work with 
Hive_2.1.1 or below. That is why I raised this bug because the workaround 
mentioned in spark documentation was not working for me.

[https://spark.apache.org/docs/latest/sql-migration-guide.html]

!image-2020-08-09-14-07-00-521.png|width=725,height=91!

It is clearly mentioned here that use spark.sql.orc.impl=hive to create files 
that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the 
fix for this issue, how is this a valid workaround here??)

 

And regarding the way I am generating the ORC file, I do not agree with your 
comments. Please look at this closely.

First I am creating a 

[jira] [Comment Edited] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-09 Thread Ramakrishna Prasad K S (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173779#comment-17173779
 ] 

Ramakrishna Prasad K S edited comment on SPARK-32558 at 8/9/20, 8:57 AM:
-

Hi [~dongjoon]

Thanks for pointing that it is a Hive bug. I am already aware that there is a 
Hive bug related to this which I have put in the Jira description itself.

But according to the latest spark documentation, it was mentioned that if you 
set spark.sql.orc.impl=hive, it would generate orc files that would work with 
Hive_2.1.1 or below. That is why I raised this bug because the workaround 
mentioned in spark documentation was not working for me.

[https://spark.apache.org/docs/latest/sql-migration-guide.html]

!image-2020-08-09-14-07-00-521.png|width=725,height=91!

It is clearly mentioned here that use spark.sql.orc.impl=hive to create files 
that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the 
fix for this issue, how is this a valid workaround here??)

 

And regarding the way I am generating the ORC file, I do not agree with your 
comments. Please look at this closely.

First I am creating a normal spark sql table here and loading data to it using 
insert query. In the second half what I am doing is, reading all the data from 
the spark sql table and loading it into spark dataframe. Finally I am writing 
the spark data frame content into a new ORC file: (which works for all the file 
formats)
{code:java}
scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []
 scala> spark.sql("insert into df_table values('col1val1','col2val1')")
 org.apache.spark.sql.DataFrame = []
scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]
 scala> dFrame.show()
-

    col1     col2
-

col1val1 col2val1
-
scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
{code}
The source and target in this case are two independent entities I believe. Once 
the data comes to a Dataframe, it can be written into any files like ORC, Avro, 
Parquet irrespective of where the data has come into data frame.

Please note, here I am not trying to create a ORC Spark SQL table and then I am 
trying to generate a ORC file using the Native Spark_3.0 APIs. 

How or from where the data comes to data frame is irrelevant I believe here, I 
am just loading simple string data as well.

Even the below case is valid, data can be generated this way also and written 
into a ORC file. The issue is observed with the file created in this case as 
well.

 
{code:java}
import org.apache.spark.sql.SparkSession
 val spark = SparkSession.builder().getOrCreate()
 import spark.implicits._
 case class Contact(name: String, phone: String)
 case class Person(name: String, age: Int, contacts: Seq[Contact])
 val records = (1 to 100).map
{ i =>; Person(s"name_$i", i, (0 to 1).map \{ m => Contact(s"contact_$m", 
s"phone_$m") }
)
 }
 records.toDF().write.format("orc").save("/tmp/orc_tgt33")
{code}
 

 

We need to at-least fix the Spark3.0 documentation or give detailed explanation 
there as to what is the purpose of spark.sql.orc.impl=hive flag..!(If this flag 
cannot generate ORC files that is not compatible with Hive_2.1.1 and below, 
then what is the usage of the same)

We are on Hive_2.1.1 version in our product and we make calls to Hive_2.1.1 ORC 
APIs. Regarding the tool I am using to retrieve metadata *hive --orcfiledump*, 
it internally calls Hive_2.1.1 APIs. Which is why I am using the same as with 
it I am able to replicate the problem.

I am following up with our Hadoop vendor to give a back-port of HIVE-16683 to 
Hive_2.1.1. However, the confusing part in the spark documentation needs to be 
fixed I believe.

Thank you.

Ramakrishna

 


was (Author: ramks):
Hi [~dongjoon]

Thanks for pointing that it is a Hive bug. I am already aware that there is a 
Hive bug related to this which I have put in the Jira description itself.

But according to the latest spark documentation, it was mentioned that if you 
set spark.sql.orc.impl=hive, it would generate orc files that would work with 
Hive_2.1.1 or below. That is why I raised this bug because the workaround 
mentioned in spark documentation was not working for me.

[https://spark.apache.org/docs/latest/sql-migration-guide.html]

!image-2020-08-09-14-07-00-521.png|width=725,height=91!

It is clearly mentioned here that use spark.sql.orc.impl=hive to create files 
that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the 
fix for this issue, how is this a valid workaround here??)

 

And regarding the way I am generating the ORC file, I do not agree with your 
comments. Please look at this closely.

First I am creating

[jira] [Comment Edited] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-09 Thread Ramakrishna Prasad K S (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173779#comment-17173779
 ] 

Ramakrishna Prasad K S edited comment on SPARK-32558 at 8/9/20, 8:56 AM:
-

Hi [~dongjoon]

Thanks for pointing that it is a Hive bug. I am already aware that there is a 
Hive bug related to this which I have put in the Jira description itself.

But according to the latest spark documentation, it was mentioned that if you 
set spark.sql.orc.impl=hive, it would generate orc files that would work with 
Hive_2.1.1 or below. That is why I raised this bug because the workaround 
mentioned in spark documentation was not working for me.

[https://spark.apache.org/docs/latest/sql-migration-guide.html]

!image-2020-08-09-14-07-00-521.png|width=725,height=91!

It is clearly mentioned here that use spark.sql.orc.impl=hive to create files 
that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the 
fix for this issue, how is this a valid workaround here??)

 

And regarding the way I am generating the ORC file, I do not agree with your 
comments. Please look at this closely.

First I am creating a normal spark sql table here and loading data to it using 
insert query. In the second half what I am doing is, reading all the data from 
the spark sql table and loading it into spark dataframe. Finally I am writing 
the spark data frame content into a new ORC file: (which works for all the file 
formats)
{code:java}
scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []
 scala> spark.sql("insert into df_table values('col1val1','col2val1')")
 org.apache.spark.sql.DataFrame = []
scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]
 scala> dFrame.show()
-

    col1     col2
-

col1val1 col2val1
-
scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
{code}
The source and target in this case are two independent entities I believe. Once 
the data comes to a Dataframe, it can be written into any files like ORC, Avro, 
Parquet irrespective of where the data has come into data frame.

Please note, here I am not trying to create a ORC Spark SQL table. I am trying 
to generate a ORC file using the Native Spark_3.0 APIs. 

How or from where the data comes to data frame is irrelevant here, I am just 
loading simple string data.

Even the below case is valid, data can be generated this way also and written 
into a ORC file. The issue is observed with the file created in this case as 
well.

 
{code:java}
import org.apache.spark.sql.SparkSession
 val spark = SparkSession.builder().getOrCreate()
 import spark.implicits._
 case class Contact(name: String, phone: String)
 case class Person(name: String, age: Int, contacts: Seq[Contact])
 val records = (1 to 100).map
{ i =>; Person(s"name_$i", i, (0 to 1).map \{ m => Contact(s"contact_$m", 
s"phone_$m") }
)
 }
 records.toDF().write.format("orc").save("/tmp/orc_tgt33")
{code}
 

 

We need to at-least fix the Spark3.0 documentation or give detailed explanation 
there as to what is the purpose of spark.sql.orc.impl=hive flag..!(If this flag 
cannot generate ORC files that is not compatible with Hive_2.1.1 and below, 
then what is the usage of the same)

We are on Hive_2.1.1 version in our product and we make calls to Hive_2.1.1 ORC 
APIs. Regarding the tool I am using to retrieve metadata *hive --orcfiledump*, 
it internally calls Hive_2.1.1 APIs. Which is why I am using the same as with 
it I am able to replicate the problem.

I am following up with our Hadoop vendor to give a back-port of HIVE-16683 to 
Hive_2.1.1. However, the confusing part in the spark documentation needs to be 
fixed I believe.

Thank you.

Ramakrishna

 


was (Author: ramks):
Hi [~dongjoon]

Thanks for pointing that it is a Hive bug. I am already aware that there is a 
Hive bug related to this which I have put in the Jira description itself.

But according to the latest spark documentation, it was mentioned that if you 
set spark.sql.orc.impl=hive, it would generate orc files that would work with 
Hive_2.1.1 or below. That is why I raised this bug because the workaround 
mentioned in spark documentation was not working for me.

[https://spark.apache.org/docs/latest/sql-migration-guide.html]

!image-2020-08-09-14-07-00-521.png|width=725,height=91!

It is clearly mentioned here that use spark.sql.orc.impl=hive to create files 
that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the 
fix for this issue, how is this a valid workaround here??)

 

And regarding the way I am generating the ORC file, I do not agree with your 
comments. Please look at this closely.

First I am creating a normal spark sql table 

[jira] [Comment Edited] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-09 Thread Ramakrishna Prasad K S (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173779#comment-17173779
 ] 

Ramakrishna Prasad K S edited comment on SPARK-32558 at 8/9/20, 8:52 AM:
-

Hi [~dongjoon]

Thanks for pointing that it is a Hive bug. I am already aware that there is a 
Hive bug related to this which I have put in the Jira description itself.

But according to the latest spark documentation, it was mentioned that if you 
set spark.sql.orc.impl=hive, it would generate orc files that would work with 
Hive_2.1.1 or below. That is why I raised this bug because the workaround 
mentioned in spark documentation was not working for me.

[https://spark.apache.org/docs/latest/sql-migration-guide.html]

!image-2020-08-09-14-07-00-521.png|width=725,height=91!

It is clearly mentioned here that use spark.sql.orc.impl=hive to create files 
that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the 
fix for this issue, how is this a valid workaround here??)

 

And regarding the way I am generating the ORC file, I do not agree with your 
comments. Please look at this closely.

First I am creating a normal spark sql table here and loading data to it using 
insert query. In the second half what I am doing is, reading all the data from 
the spark sql table and loading it into spark dataframe. Finally I am writing 
the spark data frame content into the ORC file: (which works for ant file 
formats)
{code:java}
scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []
 scala> spark.sql("insert into df_table values('col1val1','col2val1')")
 org.apache.spark.sql.DataFrame = []
scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]
 scala> dFrame.show()
-

    col1     col2
-

col1val1 col2val1
-
scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
{code}
The source and target in this case are two independent entities I believe. Once 
the data comes to a Dataframe, it can be written into any files like ORC, Avro, 
Parquet irrespective of where the data has come into data frame.

Please note, here I am not trying to create a ORC Spark SQL table. I am trying 
to generate a ORC file using the Native Spark_3.0 APIs. 

How or from where the data comes to data frame is irrelevant here, I am just 
loading simple string data.

Even the below case is valid, data can be generated this way also and written 
into a ORC file. The issue is observed with the file created in this case as 
well.

 
{code:java}
import org.apache.spark.sql.SparkSession
 val spark = SparkSession.builder().getOrCreate()
 import spark.implicits._
 case class Contact(name: String, phone: String)
 case class Person(name: String, age: Int, contacts: Seq[Contact])
 val records = (1 to 100).map
{ i =>; Person(s"name_$i", i, (0 to 1).map \{ m => Contact(s"contact_$m", 
s"phone_$m") }
)
 }
 records.toDF().write.format("orc").save("/tmp/orc_tgt33")
{code}
 

 

We need to at-least fix the Spark3.0 documentation or give detailed explanation 
there as to what is the purpose of spark.sql.orc.impl=hive flag..!(If this flag 
cannot generate ORC files that is not compatible with Hive_2.1.1 and below, 
then what is the usage of the same)

We are on Hive_2.1.1 version in our product and we make calls to Hive_2.1.1 ORC 
APIs. Regarding the tool I am using to retrieve metadata *hive --orcfiledump*, 
it internally calls Hive_2.1.1 APIs. Which is why I am using the same as with 
it I am able to replicate the problem.

I am following up with our Hadoop vendor to give a back-port of HIVE-16683 to 
Hive_2.1.1. However, the confusing part in the spark documentation needs to be 
fixed I believe.

Thank you.

Ramakrishna

 


was (Author: ramks):
Hi [~dongjoon]

Thanks for pointing that it is a Hive bug. I am already aware that there is a 
Hive bug related to this which I have put in the Jira description itself.

But according to the latest spark documentation, it was mentioned that if you 
set spark.sql.orc.impl=hive, it would generate orc files that would work with 
Hive_2.1.1 or below. That is why I raised this bug because the workaround 
mentioned in spark documentation was not working for me.

[https://spark.apache.org/docs/latest/sql-migration-guide.html]

!image-2020-08-09-14-07-00-521.png|width=725,height=91!

It is clearly mentioned here that use spark.sql.orc.impl=hive to create files 
that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the 
fix for this issue, how is this a valid workaround here??)

 

And regarding the way I am generating the ORC file, I do not agree with your 
comments. Please look at this closely.

First I am creating a normal spark sql table here a

[jira] [Comment Edited] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-09 Thread Ramakrishna Prasad K S (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173779#comment-17173779
 ] 

Ramakrishna Prasad K S edited comment on SPARK-32558 at 8/9/20, 8:51 AM:
-

Hi [~dongjoon]

Thanks for pointing that it is a Hive bug. I am already aware that there is a 
Hive bug related to this which I have put in the Jira description itself.

But according to the latest spark documentation, it was mentioned that if you 
set spark.sql.orc.impl=hive, it would generate orc files that would work with 
Hive_2.1.1 or below. That is why I raised this bug because the workaround 
mentioned in spark documentation was not working for me.

[https://spark.apache.org/docs/latest/sql-migration-guide.html]

!image-2020-08-09-14-07-00-521.png|width=725,height=91!

It is clearly mentioned here that use spark.sql.orc.impl=hive to create files 
that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the 
fix for this issue, how is this a valid workaround here??)

 

And regarding the way I am generating the ORC file, I do not agree with your 
comments. Please look at this closely.

First I am creating a normal spark sql table here and loading data to it using 
insert query. In the second half what I am doing is, reading all the data from 
the spark sql table and loading it into spark dataframe. Finally I am writing 
the spark data frame content into the ORC file: (which works for ant file 
formats)
{code:java}
scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []
 scala> spark.sql("insert into df_table values('col1val1','col2val1')")
 org.apache.spark.sql.DataFrame = []
scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]
 scala> dFrame.show()
-

    col1     col2
-

col1val1 col2val1
-
scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
{code}

 The source and target in this case are two independent entities I believe. 
Once the data comes to a Dataframe, it can be written into any files like ORC, 
Avro, Parquet irrespective of where the data has come into data frame.

Please note, here I am not trying to create a ORC Spark SQL table. I am trying 
to generate a ORC file using the Native Spark_3.0 APIs. 

How or from where the data comes to data frame is irrelevant here, I am just 
loading simple string data.

Even the below case is valid, data can be generated this way also and written 
into a ORC file. The issue is observed with the file created in this case as 
well.

 
{code:java}
import org.apache.spark.sql.SparkSession
 val spark = SparkSession.builder().getOrCreate()
 import spark.implicits._
 case class Contact(name: String, phone: String)
 case class Person(name: String, age: Int, contacts: Seq[Contact])
 val records = (1 to 100).map
{ i =>; Person(s"name_$i", i, (0 to 1).map \{ m => Contact(s"contact_$m", 
s"phone_$m") }
)
 }
 records.toDF().write.format("orc").save("/tmp/orc_tgt33")
{code}
 

 

We need to at-least fix the Spark3.0 documentation or give detailed explanation 
there as to what is the purpose of spark.sql.orc.impl=hive flag..!(If this flag 
cannot generate ORC files that will not be compatible with Hive_2.1.1 below, 
then what is the usage of the same)

We are on Hive_2.1.1 version in our product and we make calls to Hive_2.1.1 ORC 
APIs. Regarding the tool I am using to retrieve metadata hive --orcfiledump, it 
internally calls Hive_2.1.1 APIs. Which is why I am using the same as with it I 
am able to replicate the problem.

I am following up with our Hadoop vendor to give a back-port of HIVE-16683 to 
Hive_2.1.1. However, the confusing part in the spark documentation needs to be 
fixed I believe.

Thank you.

Ramakrishna

 


was (Author: ramks):
Hi [~dongjoon]

Thanks for pointing that it is a Hive bug. I am already aware that there is a 
Hive bug related to this which I have put in the Jira description itself.

But according to the latest spark documentation, it was mentioned that if you 
set spark.sql.orc.impl=hive, it would generate orc files that would work with 
Hive_2.1.1 or below. That is why I raised this bug because the workaround 
mentioned in spark documentation was not working for me.

[https://spark.apache.org/docs/latest/sql-migration-guide.html]

!image-2020-08-09-14-07-00-521.png|width=725,height=91!

It is clearly mentioned here that use spark.sql.orc.impl=hive to create files 
that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the 
fix for this issue, how is this a valid workaround here??)

 

And regarding the way I am generating the ORC file, I do not agree with your 
comments. Please look at this closely.

First I am creating a normal spark sql table here 

[jira] [Commented] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-09 Thread Ramakrishna Prasad K S (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173779#comment-17173779
 ] 

Ramakrishna Prasad K S commented on SPARK-32558:


Hi [~dongjoon]

Thanks for pointing that it is a Hive bug. I am already aware that there is a 
Hive bug related to this which I have put in the Jira description itself.

But according to the latest spark documentation, it was mentioned that if you 
set spark.sql.orc.impl=hive, it would generate orc files that would work with 
Hive_2.1.1 or below. That is why I raised this bug because the workaround 
mentioned in spark documentation was not working for me.

[https://spark.apache.org/docs/latest/sql-migration-guide.html]

!image-2020-08-09-14-07-00-521.png|width=725,height=91!

It is clearly mentioned here that use spark.sql.orc.impl=hive to create files 
that would work with Hive_2.1.1 and older. (when Hive_2.1.1 does not have the 
fix for this issue, how is this a valid workaround here??)

 

And regarding the way I am generating the ORC file, I do not agree with your 
comments. Please look at this closely.

First I am creating a normal spark sql table here and loading data to it using 
insert query. In the second half what I am doing is, reading all the data from 
the spark sql table and loading it into spark dataframe. Finally I am writing 
the spark data frame content into the ORC file: (which works for ant file 
formats)
scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []
scala> spark.sql("insert into df_table values('col1val1','col2val1')")
org.apache.spark.sql.DataFrame = []

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala> dFrame.show()

+-+
|    col1|    col2|
+-+
|col1val1|col2val1|
+-+

scala> 
dFrame.toDF().write.format(*"orc"*).save(*"/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table"*)
The source and target in this case are two independent entities I believe. Once 
the data comes to a Dataframe, it can be written into any files like ORC, Avro, 
Parquet irrespective of where the data has come into data frame.

Please note, here I am not trying to create a ORC Spark SQL table. I am trying 
to generate a ORC file using the Native Spark_3.0 APIs. 

How or from where the data comes to data frame is irrelevant here, I am just 
loading simple string data.

Even the below case is valid, data can be generated this way also and written 
into a ORC file. The issue is observed with the file created in this case as 
well.

import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
case class Contact(name: String, phone: String)
case class Person(name: String, age: Int, contacts: Seq[Contact])
val records = (1 to 100).map { i =>;
 Person(s"name_$i", i, (0 to 1).map \{ m => Contact(s"contact_$m", s"phone_$m") 
})
}
records.toDF().write.format("orc").save("/tmp/orc_tgt33")

 

We need to at-least fix the Spark3.0 documentation or give detailed explanation 
there as to what is the purpose of spark.sql.orc.impl=hive flag..!(If this flag 
cannot generate ORC files that will not be compatible with Hive_2.1.1 below, 
then what is the usage of the same)

We are on Hive_2.1.1 version in our product and we make calls to Hive_2.1.1 ORC 
APIs. Regarding the tool I am using to retrieve metadata hive --orcfiledump, it 
internally calls Hive_2.1.1 APIs. Which is why I am using the same as with it I 
am able to replicate the problem.

I am following up with our Hadoop vendor to give a back-port of HIVE-16683 to 
Hive_2.1.1. However, the confusing part in the spark documentation needs to be 
fixed I believe.

Thank you.

Ramakrishna

 

> ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 
> (work-around of using spark.sql.orc.impl=hive is also not working)
> -
>
> Key: SPARK-32558
> URL: https://issues.apache.org/jira/browse/SPARK-32558
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. 
> (Linux Redhat)
>Reporter: Ramakrishna Prasad K S
>Priority: Major
> Attachments: image-2020-08-09-14-07-00-521.png
>
>
> Steps to reproduce the issue:
> --- 
> Download Spark_3.0 from [https://spark.apache.org/downloads.html]
>  
> Step 1) Create ORC File by using the default Spark_3.0 Native API from spark 
> shell .
> {code}
> [linuxuser1@irlrhellinux1 bin]$ ./spark-shell
> Welcome to Spark version 3.0.0
> Using Scala version 2.1

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-09 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Attachment: image-2020-08-09-14-07-00-521.png

> ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 
> (work-around of using spark.sql.orc.impl=hive is also not working)
> -
>
> Key: SPARK-32558
> URL: https://issues.apache.org/jira/browse/SPARK-32558
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. 
> (Linux Redhat)
>Reporter: Ramakrishna Prasad K S
>Priority: Major
> Attachments: image-2020-08-09-14-07-00-521.png
>
>
> Steps to reproduce the issue:
> --- 
> Download Spark_3.0 from [https://spark.apache.org/downloads.html]
>  
> Step 1) Create ORC File by using the default Spark_3.0 Native API from spark 
> shell .
> {code}
> [linuxuser1@irlrhellinux1 bin]$ ./spark-shell
> Welcome to Spark version 3.0.0
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
> Type in expressions to have them evaluated. Type :help for more information.
>  scala> spark.sql("set spark.sql.orc.impl").show()
> +-+
> |               key| value| 
> +-+
> |spark.sql.orc.impl|native|
> +-+
>  
> scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
> org.apache.spark.sql.DataFrame = []
> scala> spark.sql("insert into df_table values('col1val1','col2val1')")
> org.apache.spark.sql.DataFrame = []
> scala> val dFrame = spark.sql("select * from df_table") dFrame: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> dFrame.show()
> +-+
> |    col1|    col2|
> +-+
> |col1val1|col2val1|
> +-+
> scala> 
> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
> {code}
>  
> Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata from the ORC files. As you see below, it 
> fails to fetch the metadata from the ORC file.
> {code}
> adpqa@irlhadoop1 bug]$ hive --orcfiledump 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> Processing data file 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
> {code}
> Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
> [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
> spark.sql.orc.impl as hive)
> {code}
> scala> spark.sql("set spark.sql.orc.impl=hive")
> res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("set spark.sql.orc.impl").show()
> ++
> |               key|value| 
> ++
> |spark.sql.orc.impl| hive|
> ++
> scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")
> scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
> org.apache.spark.sql.DataFrame = []
> scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> 
> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")
> {code} 
> Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata 

[jira] [Commented] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17172805#comment-17172805
 ] 

Ramakrishna Prasad K S commented on SPARK-32558:


[~rohitmishr1484] [~hyukjin.kwon] Thanks for letting me know about setting the 
Priority and target version. Sorry for the same. 

Can someone help me with this issue? This is critical and the workaround 
mentioned in [https://spark.apache.org/docs/latest/sql-migration-guide.html] is 
also not working.

> ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 
> (work-around of using spark.sql.orc.impl=hive is also not working)
> -
>
> Key: SPARK-32558
> URL: https://issues.apache.org/jira/browse/SPARK-32558
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. 
> (Linux Redhat)
>Reporter: Ramakrishna Prasad K S
>Priority: Major
>
> Steps to reproduce the issue:
> --- 
> Download Spark_3.0 from [https://spark.apache.org/downloads.html]
>  
> Step 1) Create ORC File by using the default Spark_3.0 Native API from spark 
> shell .
> {code}
> [linuxuser1@irlrhellinux1 bin]$ ./spark-shell
> Welcome to Spark version 3.0.0
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
> Type in expressions to have them evaluated. Type :help for more information.
>  scala> spark.sql("set spark.sql.orc.impl").show()
> +-+
> |               key| value| 
> +-+
> |spark.sql.orc.impl|native|
> +-+
>  
> scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
> org.apache.spark.sql.DataFrame = []
> scala> spark.sql("insert into df_table values('col1val1','col2val1')")
> org.apache.spark.sql.DataFrame = []
> scala> val dFrame = spark.sql("select * from df_table") dFrame: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> dFrame.show()
> +-+
> |    col1|    col2|
> +-+
> |col1val1|col2val1|
> +-+
> scala> 
> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
> {code}
>  
> Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata from the ORC files. As you see below, it 
> fails to fetch the metadata from the ORC file.
> {code}
> adpqa@irlhadoop1 bug]$ hive --orcfiledump 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> Processing data file 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
> {code}
> Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
> [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
> spark.sql.orc.impl as hive)
> {code}
> scala> spark.sql("set spark.sql.orc.impl=hive")
> res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("set spark.sql.orc.impl").show()
> ++
> |               key|value| 
> ++
> |spark.sql.orc.impl| hive|
> ++
> scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")
> scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
> org.apache.spark.sql.DataFrame = []
> scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> 
> dFrame2.toDF().write.format("orc").save("/export/home

[jira] [Comment Edited] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17172805#comment-17172805
 ] 

Ramakrishna Prasad K S edited comment on SPARK-32558 at 8/7/20, 4:07 AM:
-

[~rohitmishr1484] [~hyukjin.kwon] Thanks for letting me know about setting the 
priority and target version. Sorry about the same. 

Can someone help me with this issue? This is critical and the workaround 
mentioned in [https://spark.apache.org/docs/latest/sql-migration-guide.html] is 
also not working.


was (Author: ramks):
[~rohitmishr1484] [~hyukjin.kwon] Thanks for letting me know about setting the 
Priority and target version. Sorry for the same. 

Can someone help me with this issue? This is critical and the workaround 
mentioned in [https://spark.apache.org/docs/latest/sql-migration-guide.html] is 
also not working.

> ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 
> (work-around of using spark.sql.orc.impl=hive is also not working)
> -
>
> Key: SPARK-32558
> URL: https://issues.apache.org/jira/browse/SPARK-32558
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. 
> (Linux Redhat)
>Reporter: Ramakrishna Prasad K S
>Priority: Major
>
> Steps to reproduce the issue:
> --- 
> Download Spark_3.0 from [https://spark.apache.org/downloads.html]
>  
> Step 1) Create ORC File by using the default Spark_3.0 Native API from spark 
> shell .
> {code}
> [linuxuser1@irlrhellinux1 bin]$ ./spark-shell
> Welcome to Spark version 3.0.0
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
> Type in expressions to have them evaluated. Type :help for more information.
>  scala> spark.sql("set spark.sql.orc.impl").show()
> +-+
> |               key| value| 
> +-+
> |spark.sql.orc.impl|native|
> +-+
>  
> scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
> org.apache.spark.sql.DataFrame = []
> scala> spark.sql("insert into df_table values('col1val1','col2val1')")
> org.apache.spark.sql.DataFrame = []
> scala> val dFrame = spark.sql("select * from df_table") dFrame: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> dFrame.show()
> +-+
> |    col1|    col2|
> +-+
> |col1val1|col2val1|
> +-+
> scala> 
> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
> {code}
>  
> Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata from the ORC files. As you see below, it 
> fails to fetch the metadata from the ORC file.
> {code}
> adpqa@irlhadoop1 bug]$ hive --orcfiledump 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> Processing data file 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
> {code}
> Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
> [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
> spark.sql.orc.impl as hive)
> {code}
> scala> spark.sql("set spark.sql.orc.impl=hive")
> res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("set spark.sql.orc.impl").show()
> ++
> |               key|value| 
> ++
> |spark.sql.orc.impl| hive|
> ++
> scala> s

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

--- 

Download Spark_3.0 from [https://spark.apache.org/downloads.html]

 

Step 1) Create ORC File by using the default Spark_3.0 Native API from the 
spark shell .

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

-+ |               key| value| + 
|spark.sql.orc.impl|*native*| -+

 

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table values('col1val1','col2val1')")

org.apache.spark.sql.DataFrame = []

 

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> dFrame.show()

+ |    col1|    col2| + |col1val1|col2val1| -+

 

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

 

adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

 

scala> spark.sql("set spark.sql.orc.impl=hive")

res6: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("set spark.sql.orc.impl").show()

--+ |               key|value| +--- 
|spark.sql.orc.impl| *hive*| +--

 

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")

20/08/04 22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []

 

scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

 

 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive

 

[adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc

Processing data file 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

--- 

Download Spark_3.0 from [https://spark.apache.org/downloads.html]

 

Step 1) Create ORC File by using the default Spark_3.0 Native API from spark 
shell .

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

-+ |               key| value| + 
|spark.sql.orc.impl|*native*| -+

 

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table values('col1val1','col2val1')")

org.apache.spark.sql.DataFrame = []

 

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> dFrame.show()

+ |    col1|    col2| + |col1val1|col2val1| -+

 

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

 

adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

 

scala> spark.sql("set spark.sql.orc.impl=hive")

res6: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("set spark.sql.orc.impl").show()

--+ |               key|value| +--- 
|spark.sql.orc.impl| *hive*| +--

 

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")

20/08/04 22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []

 

scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

 

 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive

 

[adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc

Processing data file 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version.  (was: 
Spark 3.0 on Linux and Hadoop cluster having Hive_2.1.1 version.)

> ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 
> (work-around of using spark.sql.orc.impl=hive is also not working)
> -
>
> Key: SPARK-32558
> URL: https://issues.apache.org/jira/browse/SPARK-32558
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version.
>Reporter: Ramakrishna Prasad K S
>Priority: Blocker
> Fix For: 3.0.0
>
>
> Steps to reproduce the issue:
> --- 
>  
> Download Spark_3.0 from [https://spark.apache.org/downloads.html]
>  
> Step 1) Create ORC File by using the default Spark_3.0 Native API from the 
> spark shell .
>  
> [linuxuser1@irlrhellinux1 bin]$ ./spark-shell
> Welcome to Spark version 3.0.0
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
> Type in expressions to have them evaluated. Type :help for more information.
>  scala> spark.sql("set spark.sql.orc.impl").show()
> -+ |               key| value| + 
> |spark.sql.orc.impl|*native*| -+
>  
> scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> spark.sql("insert into df_table values('col1val1','col2val1')")
> org.apache.spark.sql.DataFrame = []
>  
> scala> val dFrame = spark.sql("select * from df_table") dFrame: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
>  
> scala> dFrame.show()
> + |    col1|    col2| + |col1val1|col2val1| -+
>  
> scala> 
> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
>  
>  Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata from the ORC files. As you see below, it 
> fails to fetch the metadata from the ORC file.
>  
> adpqa@irlhadoop1 bug]$ hive --orcfiledump 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> Processing data file 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
>  
> Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
> [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
> spark.sql.orc.impl as hive)
>  
> scala> spark.sql("set spark.sql.orc.impl=hive")
> res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("set spark.sql.orc.impl").show()
> --+ |               key|value| +--- 
> |spark.sql.orc.impl| *hive*| +--
>  
> scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")
> 20/08/04 22:43:26 WARN HiveMetaStore: Location: 
> [file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
>  specified for non-external table:df_table2 res5: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
>  
> scala> 
> d

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. (Linux 
Redhat)  (was: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version.)

> ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 
> (work-around of using spark.sql.orc.impl=hive is also not working)
> -
>
> Key: SPARK-32558
> URL: https://issues.apache.org/jira/browse/SPARK-32558
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. 
> (Linux Redhat)
>Reporter: Ramakrishna Prasad K S
>Priority: Blocker
> Fix For: 3.0.0
>
>
> Steps to reproduce the issue:
> --- 
>  
> Download Spark_3.0 from [https://spark.apache.org/downloads.html]
>  
> Step 1) Create ORC File by using the default Spark_3.0 Native API from the 
> spark shell .
>  
> [linuxuser1@irlrhellinux1 bin]$ ./spark-shell
> Welcome to Spark version 3.0.0
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
> Type in expressions to have them evaluated. Type :help for more information.
>  scala> spark.sql("set spark.sql.orc.impl").show()
> -+ |               key| value| + 
> |spark.sql.orc.impl|*native*| -+
>  
> scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> spark.sql("insert into df_table values('col1val1','col2val1')")
> org.apache.spark.sql.DataFrame = []
>  
> scala> val dFrame = spark.sql("select * from df_table") dFrame: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
>  
> scala> dFrame.show()
> + |    col1|    col2| + |col1val1|col2val1| -+
>  
> scala> 
> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
>  
>  Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata from the ORC files. As you see below, it 
> fails to fetch the metadata from the ORC file.
>  
> adpqa@irlhadoop1 bug]$ hive --orcfiledump 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> Processing data file 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
>  
> Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
> [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
> spark.sql.orc.impl as hive)
>  
> scala> spark.sql("set spark.sql.orc.impl=hive")
> res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("set spark.sql.orc.impl").show()
> --+ |               key|value| +--- 
> |spark.sql.orc.impl| *hive*| +--
>  
> scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")
> 20/08/04 22:43:26 WARN HiveMetaStore: Location: 
> [file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
>  specified for non-external table:df_table2 res5: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
> org.apache.spark.sql.DataFrame = []
>  
> scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: s

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

--- 

 

Download Spark_3.0 from [https://spark.apache.org/downloads.html]

 

Step 1) Create ORC File by using the default Spark_3.0 Native API from the 
spark shell .

Launch Spark Shell:

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

-+ |               key| value| + 
|spark.sql.orc.impl|*native*| -+

 

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table values('col1val1','col2val1')")

org.apache.spark.sql.DataFrame = []

 

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> dFrame.show()

+ |    col1|    col2| + |col1val1|col2val1| -+

 

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

 

adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

 

scala> spark.sql("set spark.sql.orc.impl=hive")

res6: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("set spark.sql.orc.impl").show()

--+ |               key|value| +--- 
|spark.sql.orc.impl| *hive*| +--

 

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")

20/08/04 22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []

 

scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

 

 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive

 

[adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc

Processing data file 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

--- 

 

Download Spark_3.0 from [https://spark.apache.org/downloads.html]

 

Step 1) Create ORC File by using the default Spark_3.0 Native API from 
spark_3.0 spark shell .

Launch Spark Shell:

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

-+ |               key| value| + 
|spark.sql.orc.impl|*native*| -+

 

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table values('col1val1','col2val1')")

org.apache.spark.sql.DataFrame = []

 

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> dFrame.show()

+ |    col1|    col2| + |col1val1|col2val1| -+

 

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

 

adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

 

scala> spark.sql("set spark.sql.orc.impl=hive")

res6: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("set spark.sql.orc.impl").show()

--+ |               key|value| +--- 
|spark.sql.orc.impl| *hive*| +--

 

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")

20/08/04 22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []

 

scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

 

 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive

 

[adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc

Processing data file 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(Or

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

--- 

 

Download Spark_3.0 from [https://spark.apache.org/downloads.html]

 

Step 1) Create ORC File by using the default Spark_3.0 Native API from the 
spark shell .

 

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

-+ |               key| value| + 
|spark.sql.orc.impl|*native*| -+

 

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table values('col1val1','col2val1')")

org.apache.spark.sql.DataFrame = []

 

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> dFrame.show()

+ |    col1|    col2| + |col1val1|col2val1| -+

 

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

 

adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

 

scala> spark.sql("set spark.sql.orc.impl=hive")

res6: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("set spark.sql.orc.impl").show()

--+ |               key|value| +--- 
|spark.sql.orc.impl| *hive*| +--

 

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")

20/08/04 22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []

 

scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

 

 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive

 

[adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc

Processing data file 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.a

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

--- 

 

Download Spark_3.0  [https://spark.apache.org/downloads.html]

 

Step 1) Create ORC File by using the default Spark_3.0 Native API from 
spark_3.0 spark shell .

Launch Spark Shell:

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

-+ |               key| value| + 
|spark.sql.orc.impl|*native*| -+

 

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table values('col1val1','col2val1')")

org.apache.spark.sql.DataFrame = []

 

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> dFrame.show()

+ |    col1|    col2| + |col1val1|col2val1| -+

 

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

 

adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

 

scala> spark.sql("set spark.sql.orc.impl=hive")

res6: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("set spark.sql.orc.impl").show()

--+ |               key|value| +--- 
|spark.sql.orc.impl| *hive*| +--

 

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")

20/08/04 22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []

 

scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

 

 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive

 

[adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc

Processing data file 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTai

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

--- 

 

Download Spark_3.0 on Linux: Download Spark_3.0 on Linux: 
[https://spark.apache.org/downloads.html]

 

Step 1) Create ORC File by using the default Spark_3.0 Native API from 
spark_3.0 spark shell .

Launch Spark Shell:

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

-+ |               key| value| + 
|spark.sql.orc.impl|*native*| -+

 

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table values('col1val1','col2val1')")

org.apache.spark.sql.DataFrame = []

 

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> dFrame.show()

+ |    col1|    col2| + |col1val1|col2val1| -+

 

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

 

adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

 

scala> spark.sql("set spark.sql.orc.impl=hive")

res6: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("set spark.sql.orc.impl").show()

--+ |               key|value| +--- 
|spark.sql.orc.impl| *hive*| +--

 

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")

20/08/04 22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []

 

scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

 

 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive

 

[adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc

Processing data file 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.o

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

--- 

 

Download Spark_3.0 on Linux: Download Spark_3.0 on Linux: 
[https://spark.apache.org/downloads.html]

 

Step 1) Create ORC File by using the default Spark_3.0 Native API from 
spark_3.0 spark shell .

Launch Spark Shell:

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

-+ |               key| value| + 
|spark.sql.orc.impl|*native*| -+

 

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table values('col1val1','col2val1')")

org.apache.spark.sql.DataFrame = []

 

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> dFrame.show()

+ |    col1|    col2| + |col1val1|col2val1| -+

 

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

 

adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

 

scala> spark.sql("set spark.sql.orc.impl=hive")

res6: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("set spark.sql.orc.impl").show()

--+ |               key|value| +--- 
|spark.sql.orc.impl| *hive*| +--

 

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")

20/08/04 22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []

 

scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

 

 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive

 

[adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc

Processing data file 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.o

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

--- 

 

Download Spark_3.0 on Linux: Download Spark_3.0 on Linux: 
[https://spark.apache.org/downloads.html]

 

Step 1) Create ORC File by using the default Spark_3.0 Native API from 
spark_3.0 spark shell .

Launch Spark Shell:

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

-+ |               key| value| + 
|spark.sql.orc.impl|native| -+

 

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table values('col1val1','col2val1')")

org.apache.spark.sql.DataFrame = []

 

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> dFrame.show()

+ |    col1|    col2| + |col1val1|col2val1| -+

 

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

 

adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

 

scala> spark.sql("set spark.sql.orc.impl=hive")

res6: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("set spark.sql.orc.impl").show()

---+ |               key|value| + 
|spark.sql.orc.impl| hive| +--

 

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")

20/08/04 22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []

 

scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

 

 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive

 

[adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc

Processing data file 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

--- 

 

Download Spark_3.0 on Linux: Download Spark_3.0 on Linux: 
[https://spark.apache.org/downloads.html]

 

Step 1) Create ORC File by using the default Spark_3.0 Native API from 
spark_3.0 spark shell .


 Launch Spark Shell:

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

-+ |               key| value| + 
|spark.sql.orc.impl|native| -+

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []

scala> spark.sql("insert into df_table values('col1val1','col2val1')")

org.apache.spark.sql.DataFrame = []

scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

scala> dFrame.show()

+ |    col1|    col2| + |col1val1|col2val1| -+

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

 

adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

 

scala> spark.sql("set spark.sql.orc.impl=hive")

res6: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("set spark.sql.orc.impl").show()

---+ |               key|value| + 
|spark.sql.orc.impl| hive| +--

 

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")

20/08/04 22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []

 

scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []

 

scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]

 

scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

 

 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive

 

[adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc

Processing data file 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail

[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

 

 

Download Spark_3.0 on Linux: Download Spark_3.0 on Linux: 
[https://spark.apache.org/downloads.html]

Step 1) Create ORC File by using the default Spark_3.0 Native API from 
spark_3.0 spark shell .
 Launch Spark Shell:

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

-+ |               key| value| + 
|spark.sql.orc.impl|native| -+

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []


 scala> spark.sql("insert into df_table values('col1val1','col2val1')")

org.apache.spark.sql.DataFrame = []


 scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]


 scala> dFrame.show()

+ |    col1|    col2| + |col1val1|col2val1| -+

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.

adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)

scala> spark.sql("set spark.sql.orc.impl=hive")

res6: org.apache.spark.sql.DataFrame = [key: string, value: string]


 scala> spark.sql("set spark.sql.orc.impl").show()

+ |               key|value| +- 
|spark.sql.orc.impl| hive| +--

 

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")

20/08/04 22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []


 scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []


 scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]


 scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")

 

 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive


 [adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc

Processing data file 
/tmp/df_table2/part-0-6d81ea27-ea5b-4f31-b1f7-47d805f98d3e-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)


[jira] [Updated] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prasad K S updated SPARK-32558:
---
Description: 
Steps to reproduce the issue:

 

 

Download Spark_3.0 on Linux: Download Spark_3.0 on Linux: 
[https://spark.apache.org/downloads.html]

Step 1) Create ORC File by using the default Spark_3.0 Native API from 
spark_3.0 spark shell .
 Launch Spark Shell:

[linuxuser1@irlrhellinux1 bin]$ ./spark-shell

Welcome to Spark version 3.0.0

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)

Type in expressions to have them evaluated. Type :help for more information.
 scala> spark.sql("set spark.sql.orc.impl").show()

-+ |               key| value| + 
|spark.sql.orc.impl|native| -+

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []
 scala> spark.sql("insert into df_table values('col1val1','col2val1')") 
20/08/04 22:40:18 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException res2: org.apache.spark.sql.DataFrame = []
 scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]
 scala> dFrame.show() -+ |    col1|    col2| + 
|col1val1|col2val1| -+

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")

 

 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.


 adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414]

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7

at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)

at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)

at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)

at org.apache.orc.OrcFile.createReader(OrcFile.java:222)

at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)

at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)

at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)

at org.apache.orc.tools.FileDump.main(FileDump.java:154)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.hadoop.util.RunJar.run(RunJar.java:313)

at org.apache.hadoop.util.RunJar.main(RunJar.java:227)

 

Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
[https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
spark.sql.orc.impl as hive)


 scala> spark.sql("set spark.sql.orc.impl=hive") res6: 
org.apache.spark.sql.DataFrame = [key: string, value: string]
 scala> spark.sql("set spark.sql.orc.impl").show() +---++ | 
              key|value| ++---+ |spark.sql.orc.impl| hive| 
++---+

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)") 20/08/04 
22:43:26 WARN HiveMetaStore: Location: 
[file:/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2|file://export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/bin/spark-warehouse/df_table2]
 specified for non-external table:df_table2 res5: 
org.apache.spark.sql.DataFrame = []
 scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []
 scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]
 scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")


 Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails with 
the same exception to fetch the metadata even after following the workaround 
suggested by spark to set spark.sql.orc.impl to hive 

[jira] [Created] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-06 Thread Ramakrishna Prasad K S (Jira)
Ramakrishna Prasad K S created SPARK-32558:
--

 Summary: ORC target files that Spark_3.0 produces does not work 
with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not 
working)
 Key: SPARK-32558
 URL: https://issues.apache.org/jira/browse/SPARK-32558
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
 Environment: Spark 3.0 on Linux and Hadoop cluster having Hive_2.1.1 
version.
Reporter: Ramakrishna Prasad K S
 Fix For: 3.0.0


Steps to reproduce the issue:

 

 

Download Spark_3.0 on Linux: Download Spark_3.0 on Linux: 
https://spark.apache.org/downloads.html


 

 Step 1) Create ORC File by using the default Spark_3.0 Native API from 
spark_3.0 spark shell 

 Launch Spark Shell: [linuxuser1@irlrhellinux1 bin]$ ./spark-shell Welcome to 
Spark version 3.0.0   Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, 
Java 1.8.0_191) Type in expressions to have them evaluated. Type :help for more 
information.
 scala> spark.sql("set spark.sql.orc.impl").show() +--+--+ 
|               key| value| +--+--+ 
|spark.sql.orc.impl|native| +--+--+

scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
org.apache.spark.sql.DataFrame = []
 scala> spark.sql("insert into df_table values('col1val1','col2val1')") 
20/08/04 22:40:18 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException res2: org.apache.spark.sql.DataFrame = []
 scala> val dFrame = spark.sql("select * from df_table") dFrame: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]
 scala> dFrame.show() +++ |    col1|    col2| 
+++ |col1val1|col2val1| +++

scala> 
dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")


 Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop cluster 
(which has Hive_2.1.1, for example CDH_6.x) and run the following command to 
analyze or read metadata from the ORC files. As you see below, it fails to 
fetch the metadata from the ORC file.  

 adpqa@irlhadoop1 bug]$ hive --orcfiledump 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
Processing data file 
/tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
[length: 414] Exception in thread "main" 
java.lang.ArrayIndexOutOfBoundsException: 7 at 
org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145) at 
org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74) at 
org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385) at 
org.apache.orc.OrcFile.createReader(OrcFile.java:222) at 
org.apache.orc.tools.FileDump.getReader(FileDump.java:255) at 
org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328) at 
org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307) at 
org.apache.orc.tools.FileDump.main(FileDump.java:154) at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.hadoop.util.RunJar.run(RunJar.java:313) at 
org.apache.hadoop.util.RunJar.main(RunJar.java:227) (even after overriding 
spark.sql.orc.impl to hive)

---
 Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
https://spark.apache.org/docs/latest/sql-migration-guide.html by setting 
spark.sql.orc.impl as hive)  
---
 scala> spark.sql("set spark.sql.orc.impl=hive") res6: 
org.apache.spark.sql.DataFrame = [key: string, value: string]
 scala> spark.sql("set spark.sql.orc.impl").show() +--+-+ | 
              key|value| +--+-+ |spark.sql.orc.impl| hive| 
+--+-+

scala> spark.sql("CREATE table df_table2(col1 string,col2 string)") 20/08/04