[jira] [Commented] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution

2020-04-20 Thread Giri Dandu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087813#comment-17087813
 ] 

Giri Dandu commented on SPARK-27913:


[~hyukjin.kwon]

 

with mergeSchema orc option it is still *NOT* working in spark 2.4.5

 
{code:java}
scala> spark.conf.getAllscala> spark.conf.getAllres15: Map[String,String] = 
Map(spark.driver.host -> 192.168.7.124, spark.sql.orc.mergeSchema -> true, 
spark.driver.port -> 54231, spark.repl.class.uri -> 
spark://192.168.7.124:54231/classes, spark.jars -> "", 
spark.repl.class.outputDir -> 
/private/var/folders/6h/nkpdlpcd0h34sq6x2fmz896wt2p_sp/T/spark-373735c9-6837-4734-bb13-e8457848a70e/repl-0852551f-cfa5-4b4a-aa2c-ac129818bbc2,
 spark.app.name -> Spark shell, spark.ui.showConsoleProgress -> true, 
spark.executor.id -> driver, spark.submit.deployMode -> client, spark.master -> 
local[*], spark.home -> /Users/gdandu/Downloads/spark-2.4.5-bin-hadoop2.7, 
spark.sql.catalogImplementation -> hive, spark.app.id -> local-1587393426045)

scala> spark.sql("drop table test_broken_orc");res16: 
org.apache.spark.sql.DataFrame = []

scala> spark.sql("create external table test_broken_orc(a struct) 
stored as orc location '/tmp/test_broken_2'");res17: 
org.apache.spark.sql.DataFrame = []

scala> spark.sql("insert into table test_broken_orc select named_struct(\"f1\", 
1)");
res18: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select * from test_broken_orc");
res19: org.apache.spark.sql.DataFrame = [a: struct]

scala> res19.show
+---+|  a|+---+|[1]|+---+

scala> spark.sql("drop table test_broken_orc");
res21: org.apache.spark.sql.DataFrame = []

scala> spark.sql("create external table test_broken_orc(a struct) stored as orc location '/tmp/test_broken_2'");
res22: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select * from test_broken_orc");
res23: org.apache.spark.sql.DataFrame = [a: struct]

scala> res23.show
20/04/20 10:46:23 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 
5)java.lang.ArrayIndexOutOfBoundsException: 1 at 
org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$7.apply(OrcFileFormat.scala:230)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$7.apply(OrcFileFormat.scala:230)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) 
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) 
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
org.apache.spark.scheduler.Task.run(Task.scala:123) at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 

[jira] [Commented] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution

2020-04-20 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087740#comment-17087740
 ] 

Hyukjin Kwon commented on SPARK-27913:
--

[~giri.dandu] can you try {{mergeSchema}} option implemented at SPARK-11412?

> Spark SQL's native ORC reader implements its own schema evolution
> -
>
> Key: SPARK-27913
> URL: https://issues.apache.org/jira/browse/SPARK-27913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Owen O'Malley
>Priority: Major
>
> ORC's reader handles a wide range of schema evolution, but the Spark SQL 
> native ORC bindings do not provide the desired schema to the ORC reader. This 
> causes a regression when moving spark.sql.orc.impl from 'hive' to 'native'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution

2020-04-17 Thread Giri Dandu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085852#comment-17085852
 ] 

Giri Dandu commented on SPARK-27913:


[~viirya] Sorry for late reply.

I re-ran the same test in spark 2.4.5 and it *is NOT working* but it works in 
2.3.0. I get the same error in spark 2.4.5

 
{code:java}
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1Caused by: 
java.lang.ArrayIndexOutOfBoundsException: 1 at 
org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) 
{code}

> Spark SQL's native ORC reader implements its own schema evolution
> -
>
> Key: SPARK-27913
> URL: https://issues.apache.org/jira/browse/SPARK-27913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Owen O'Malley
>Priority: Major
>
> ORC's reader handles a wide range of schema evolution, but the Spark SQL 
> native ORC bindings do not provide the desired schema to the ORC reader. This 
> causes a regression when moving spark.sql.orc.impl from 'hive' to 'native'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution

2020-04-02 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073928#comment-17073928
 ] 

L. C. Hsieh commented on SPARK-27913:
-

As we support schema merging in ORC by SPARK-11412, is this still an issue?

> Spark SQL's native ORC reader implements its own schema evolution
> -
>
> Key: SPARK-27913
> URL: https://issues.apache.org/jira/browse/SPARK-27913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Owen O'Malley
>Priority: Major
>
> ORC's reader handles a wide range of schema evolution, but the Spark SQL 
> native ORC bindings do not provide the desired schema to the ORC reader. This 
> causes a regression when moving spark.sql.orc.impl from 'hive' to 'native'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution

2020-02-11 Thread Giri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034875#comment-17034875
 ] 

Giri commented on SPARK-27913:
--

This issue doesn't exist in  *spark spark-3.0.0-preview2 and also in spark 2.3* 
  Will this fix be ported to 2.4.x branch?

 

It appears that issue is realated to spark not using the schema from the 
metastore but from the ORC files and this causes the schema mismatch and out of 
bound exception when  OrcDeserializer accesses the field that doesn't exist in 
the file.

 

I see logs like this:

 

20/02/11 14:30:38 INFO RecordReaderImpl: Reader schema not provided -- using 
file schema struct>
20/02/11 14:30:38 INFO RecordReaderImpl: Reader schema not provided -- using 
file schema struct>

 

> Spark SQL's native ORC reader implements its own schema evolution
> -
>
> Key: SPARK-27913
> URL: https://issues.apache.org/jira/browse/SPARK-27913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Owen O'Malley
>Priority: Major
>
> ORC's reader handles a wide range of schema evolution, but the Spark SQL 
> native ORC bindings do not provide the desired schema to the ORC reader. This 
> causes a regression when moving spark.sql.orc.impl from 'hive' to 'native'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution

2020-01-23 Thread Ivelin Tchangalov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022354#comment-17022354
 ] 

Ivelin Tchangalov commented on SPARK-27913:
---

I'm curious if there's any progress or solution for this issue.

> Spark SQL's native ORC reader implements its own schema evolution
> -
>
> Key: SPARK-27913
> URL: https://issues.apache.org/jira/browse/SPARK-27913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Owen O'Malley
>Priority: Major
>
> ORC's reader handles a wide range of schema evolution, but the Spark SQL 
> native ORC bindings do not provide the desired schema to the ORC reader. This 
> causes a regression when moving spark.sql.orc.impl from 'hive' to 'native'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution

2019-06-09 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859450#comment-16859450
 ] 

Liang-Chi Hsieh commented on SPARK-27913:
-

But seems the above reproducible example also doesn't work if 
spark.sql.orc.impl  is "hive"?

{code}
java.lang.ArrayIndexOutOfBoundsException: 1 
   
at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.genericGet(rows.scala:201)
 
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getAs(rows.scala:35)
   
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.isNullAt(rows.scala:36)
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.isNullAt$(rows.scala:36)
at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.isNullAt(rows.scala:195)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:56)
at 
org.apache.spark.sql.hive.orc.OrcFileFormat$.$anonfun$unwrapOrcStructs$4(OrcFileFormat.scala:347)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:99)
{code}

> Spark SQL's native ORC reader implements its own schema evolution
> -
>
> Key: SPARK-27913
> URL: https://issues.apache.org/jira/browse/SPARK-27913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Owen O'Malley
>Priority: Major
>
> ORC's reader handles a wide range of schema evolution, but the Spark SQL 
> native ORC bindings do not provide the desired schema to the ORC reader. This 
> causes a regression when moving spark.sql.orc.impl from 'hive' to 'native'.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution

2019-06-05 Thread Andrey Zinovyev (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856592#comment-16856592
 ] 

Andrey Zinovyev commented on SPARK-27913:
-

Simple way to reproduce it


{code:sql}
create external table test_broken_orc(a struct) stored as orc;
insert into table test_broken_orc select named_struct("f1", 1);
drop table test_broken_orc;
create external table test_broken_orc(a struct) stored as orc;
select * from test_broken_orc;
{code}

Last statement fails with exception

{noformat}
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49)
at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$7.apply(OrcFileFormat.scala:230)
at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$7.apply(OrcFileFormat.scala:230)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
{noformat}

Also you can remove column or add column in the middle of struct field. As far 
as I understand current implementation it supports by-name field resolution of 
zero level of orc structure. Everything deeper get resolved by index and 
expected be exact match with reader schema


> Spark SQL's native ORC reader implements its own schema evolution
> -
>
> Key: SPARK-27913
> URL: https://issues.apache.org/jira/browse/SPARK-27913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Owen O'Malley
>Priority: Major
>
> ORC's reader handles a wide range of schema evolution, but the Spark SQL 
> native ORC bindings do not provide the desired schema to the ORC reader. This 
> causes a regression when moving spark.sql.orc.impl from 'hive' to 'native'.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org