[jira] [Commented] (SPARK-24472) Orc RecordReaderFactory throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-24472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045096#comment-17045096 ] Dongjoon Hyun commented on SPARK-24472: --- I confirmed that this is fixed at 3.0.0-preview2. {code} scala> spark.sql("select * from orctest").show() 20/02/26 02:46:42 ERROR AcidUtils: Failed to get files with ID; using regular API: Only supported for DFS; got class org.apache.hadoop.hive.ql.io.ProxyLocalFileSystem +---+---+ | s| i| +---+---+ |abc|123| +---+---+ scala> spark.version res3: String = 3.0.0-preview2 {code} > Orc RecordReaderFactory throws IndexOutOfBoundsException > > > Key: SPARK-24472 > URL: https://issues.apache.org/jira/browse/SPARK-24472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Shixiong Zhu >Priority: Major > > When the column number of the underlying file schema is greater than the > column number of the table schema, Orc RecordReaderFactory will throw > IndexOutOfBoundsException. "spark.sql.hive.convertMetastoreOrc" should be > turned off to use HiveTableScanExec. Here is a reproducer: > {code} > scala> :paste > // Entering paste mode (ctrl-D to finish) > Seq(("abc", 123, 123L)).toDF("s", "i", > "l").write.partitionBy("i").format("orc").mode("append").save("/tmp/orctest") > spark.sql(""" > CREATE EXTERNAL TABLE orctest(s string) > PARTITIONED BY (i int) > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' > WITH SERDEPROPERTIES ( > 'serialization.format' = '1' > ) > STORED AS > INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > LOCATION '/tmp/orctest' > """) > spark.sql("msck repair table orctest") > spark.sql("set spark.sql.hive.convertMetastoreOrc=false") > // Exiting paste mode, now interpreting. > 18/06/05 15:34:52 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > res0: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> spark.read.format("orc").load("/tmp/orctest").show() > +---+---+---+ > | s| l| i| > +---+---+---+ > |abc|123|123| > +---+---+---+ > scala> spark.sql("select * from orctest").show() > 18/06/05 15:34:59 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) > java.lang.IndexOutOfBoundsException: toIndex = 2 > at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004) > at java.util.ArrayList.subList(ArrayList.java:996) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202) > at > org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113) > at > org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:266) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at >
[jira] [Commented] (SPARK-24472) Orc RecordReaderFactory throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-24472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044744#comment-17044744 ] Shreyas commented on SPARK-24472: - Now that SPARK-23710 has been fixed, can this one be attended to? Thanks! > Orc RecordReaderFactory throws IndexOutOfBoundsException > > > Key: SPARK-24472 > URL: https://issues.apache.org/jira/browse/SPARK-24472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Shixiong Zhu >Priority: Major > > When the column number of the underlying file schema is greater than the > column number of the table schema, Orc RecordReaderFactory will throw > IndexOutOfBoundsException. "spark.sql.hive.convertMetastoreOrc" should be > turned off to use HiveTableScanExec. Here is a reproducer: > {code} > scala> :paste > // Entering paste mode (ctrl-D to finish) > Seq(("abc", 123, 123L)).toDF("s", "i", > "l").write.partitionBy("i").format("orc").mode("append").save("/tmp/orctest") > spark.sql(""" > CREATE EXTERNAL TABLE orctest(s string) > PARTITIONED BY (i int) > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' > WITH SERDEPROPERTIES ( > 'serialization.format' = '1' > ) > STORED AS > INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > LOCATION '/tmp/orctest' > """) > spark.sql("msck repair table orctest") > spark.sql("set spark.sql.hive.convertMetastoreOrc=false") > // Exiting paste mode, now interpreting. > 18/06/05 15:34:52 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > res0: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> spark.read.format("orc").load("/tmp/orctest").show() > +---+---+---+ > | s| l| i| > +---+---+---+ > |abc|123|123| > +---+---+---+ > scala> spark.sql("select * from orctest").show() > 18/06/05 15:34:59 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) > java.lang.IndexOutOfBoundsException: toIndex = 2 > at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004) > at java.util.ArrayList.subList(ArrayList.java:996) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202) > at > org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113) > at > org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:266) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
[jira] [Commented] (SPARK-24472) Orc RecordReaderFactory throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-24472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652969#comment-16652969 ] Dongjoon Hyun commented on SPARK-24472: --- Thank you, [~desmoon]. We know that, but it's not easy for Spark to upgrade that. There is official JIRA issue for that, SPARK-23710. > Orc RecordReaderFactory throws IndexOutOfBoundsException > > > Key: SPARK-24472 > URL: https://issues.apache.org/jira/browse/SPARK-24472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Shixiong Zhu >Priority: Major > > When the column number of the underlying file schema is greater than the > column number of the table schema, Orc RecordReaderFactory will throw > IndexOutOfBoundsException. "spark.sql.hive.convertMetastoreOrc" should be > turned off to use HiveTableScanExec. Here is a reproducer: > {code} > scala> :paste > // Entering paste mode (ctrl-D to finish) > Seq(("abc", 123, 123L)).toDF("s", "i", > "l").write.partitionBy("i").format("orc").mode("append").save("/tmp/orctest") > spark.sql(""" > CREATE EXTERNAL TABLE orctest(s string) > PARTITIONED BY (i int) > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' > WITH SERDEPROPERTIES ( > 'serialization.format' = '1' > ) > STORED AS > INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > LOCATION '/tmp/orctest' > """) > spark.sql("msck repair table orctest") > spark.sql("set spark.sql.hive.convertMetastoreOrc=false") > // Exiting paste mode, now interpreting. > 18/06/05 15:34:52 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > res0: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> spark.read.format("orc").load("/tmp/orctest").show() > +---+---+---+ > | s| l| i| > +---+---+---+ > |abc|123|123| > +---+---+---+ > scala> spark.sql("select * from orctest").show() > 18/06/05 15:34:59 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) > java.lang.IndexOutOfBoundsException: toIndex = 2 > at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004) > at java.util.ArrayList.subList(ArrayList.java:996) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202) > at > org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113) > at > org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:266) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) >
[jira] [Commented] (SPARK-24472) Orc RecordReaderFactory throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-24472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652885#comment-16652885 ] StephenZou commented on SPARK-24472: Hive already fix this in HIVE-11981 HIVE-12625 also brought it to hive-1.3 branch but hive-orc is 1.2.1-spark2, so it needs to upgrade internal hive version to 1.3 > Orc RecordReaderFactory throws IndexOutOfBoundsException > > > Key: SPARK-24472 > URL: https://issues.apache.org/jira/browse/SPARK-24472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Shixiong Zhu >Priority: Major > > When the column number of the underlying file schema is greater than the > column number of the table schema, Orc RecordReaderFactory will throw > IndexOutOfBoundsException. "spark.sql.hive.convertMetastoreOrc" should be > turned off to use HiveTableScanExec. Here is a reproducer: > {code} > scala> :paste > // Entering paste mode (ctrl-D to finish) > Seq(("abc", 123, 123L)).toDF("s", "i", > "l").write.partitionBy("i").format("orc").mode("append").save("/tmp/orctest") > spark.sql(""" > CREATE EXTERNAL TABLE orctest(s string) > PARTITIONED BY (i int) > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' > WITH SERDEPROPERTIES ( > 'serialization.format' = '1' > ) > STORED AS > INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > LOCATION '/tmp/orctest' > """) > spark.sql("msck repair table orctest") > spark.sql("set spark.sql.hive.convertMetastoreOrc=false") > // Exiting paste mode, now interpreting. > 18/06/05 15:34:52 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > res0: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> spark.read.format("orc").load("/tmp/orctest").show() > +---+---+---+ > | s| l| i| > +---+---+---+ > |abc|123|123| > +---+---+---+ > scala> spark.sql("select * from orctest").show() > 18/06/05 15:34:59 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) > java.lang.IndexOutOfBoundsException: toIndex = 2 > at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004) > at java.util.ArrayList.subList(ArrayList.java:996) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202) > at > org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113) > at > org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:266) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at >
[jira] [Commented] (SPARK-24472) Orc RecordReaderFactory throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-24472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503445#comment-16503445 ] Dongjoon Hyun commented on SPARK-24472: --- Thank you for pinging me, [~zsxwing] > Orc RecordReaderFactory throws IndexOutOfBoundsException > > > Key: SPARK-24472 > URL: https://issues.apache.org/jira/browse/SPARK-24472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Shixiong Zhu >Priority: Major > > When the column number of the underlying file schema is greater than the > column number of the table schema, Orc RecordReaderFactory will throw > IndexOutOfBoundsException. "spark.sql.hive.convertMetastoreOrc" should be > turned off to use HiveTableScanExec. Here is a reproducer: > {code} > scala> :paste > // Entering paste mode (ctrl-D to finish) > Seq(("abc", 123, 123L)).toDF("s", "i", > "l").write.partitionBy("i").format("orc").mode("append").save("/tmp/orctest") > spark.sql(""" > CREATE EXTERNAL TABLE orctest(s string) > PARTITIONED BY (i int) > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' > WITH SERDEPROPERTIES ( > 'serialization.format' = '1' > ) > STORED AS > INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > LOCATION '/tmp/orctest' > """) > spark.sql("msck repair table orctest") > spark.sql("set spark.sql.hive.convertMetastoreOrc=false") > // Exiting paste mode, now interpreting. > 18/06/05 15:34:52 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > res0: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> spark.read.format("orc").load("/tmp/orctest").show() > +---+---+---+ > | s| l| i| > +---+---+---+ > |abc|123|123| > +---+---+---+ > scala> spark.sql("select * from orctest").show() > 18/06/05 15:34:59 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) > java.lang.IndexOutOfBoundsException: toIndex = 2 > at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004) > at java.util.ArrayList.subList(ArrayList.java:996) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202) > at > org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113) > at > org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:266) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at
[jira] [Commented] (SPARK-24472) Orc RecordReaderFactory throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-24472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502914#comment-16502914 ] Shixiong Zhu commented on SPARK-24472: -- cc [~dongjoon] > Orc RecordReaderFactory throws IndexOutOfBoundsException > > > Key: SPARK-24472 > URL: https://issues.apache.org/jira/browse/SPARK-24472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Shixiong Zhu >Priority: Major > > When the column number of the underlying file schema is greater than the > column number of the table schema, Orc RecordReaderFactory will throw > IndexOutOfBoundsException. "spark.sql.hive.convertMetastoreOrc" should be > turned off to use HiveTableScanExec. Here is a reproducer: > {code} > scala> :paste > // Entering paste mode (ctrl-D to finish) > Seq(("abc", 123, 123L)).toDF("s", "i", > "l").write.partitionBy("i").format("orc").mode("append").save("/tmp/orctest") > spark.sql(""" > CREATE EXTERNAL TABLE orctest(s string) > PARTITIONED BY (i int) > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' > WITH SERDEPROPERTIES ( > 'serialization.format' = '1' > ) > STORED AS > INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > LOCATION '/tmp/orctest' > """) > spark.sql("msck repair table orctest") > spark.sql("set spark.sql.hive.convertMetastoreOrc=false") > // Exiting paste mode, now interpreting. > 18/06/05 15:34:52 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > res0: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> spark.read.format("orc").load("/tmp/orctest").show() > +---+---+---+ > | s| l| i| > +---+---+---+ > |abc|123|123| > +---+---+---+ > scala> spark.sql("select * from orctest").show() > 18/06/05 15:34:59 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) > java.lang.IndexOutOfBoundsException: toIndex = 2 > at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004) > at java.util.ArrayList.subList(ArrayList.java:996) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202) > at > org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113) > at > org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:266) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at >
[jira] [Commented] (SPARK-24472) Orc RecordReaderFactory throws IndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-24472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502609#comment-16502609 ] Shixiong Zhu commented on SPARK-24472: -- cc [~cloud_fan] > Orc RecordReaderFactory throws IndexOutOfBoundsException > > > Key: SPARK-24472 > URL: https://issues.apache.org/jira/browse/SPARK-24472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Shixiong Zhu >Priority: Major > > When the column number of the underlying file schema is greater than the > column number of the table schema, Orc RecordReaderFactory will throw > IndexOutOfBoundsException. "spark.sql.hive.convertMetastoreOrc" should be > turned off to use HiveTableScanExec. Here is a reproducer: > {code} > scala> :paste > // Entering paste mode (ctrl-D to finish) > Seq(("abc", 123, 123L)).toDF("s", "i", > "l").write.partitionBy("i").format("orc").mode("append").save("/tmp/orctest") > spark.sql(""" > CREATE EXTERNAL TABLE orctest(s string) > PARTITIONED BY (i int) > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' > WITH SERDEPROPERTIES ( > 'serialization.format' = '1' > ) > STORED AS > INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > LOCATION '/tmp/orctest' > """) > spark.sql("msck repair table orctest") > spark.sql("set spark.sql.hive.convertMetastoreOrc=false") > // Exiting paste mode, now interpreting. > 18/06/05 15:34:52 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > res0: org.apache.spark.sql.DataFrame = [key: string, value: string] > scala> spark.read.format("orc").load("/tmp/orctest").show() > +---+---+---+ > | s| l| i| > +---+---+---+ > |abc|123|123| > +---+---+---+ > scala> spark.sql("select * from orctest").show() > 18/06/05 15:34:59 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) > java.lang.IndexOutOfBoundsException: toIndex = 2 > at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004) > at java.util.ArrayList.subList(ArrayList.java:996) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202) > at > org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226) > at > org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113) > at > org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:266) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at >