You may hit SPARK-23355 (convertMetastore should not ignore table properties).
Since it's a known Spark issue for all Hive tables (Parquet/ORC), could you check that too? Bests, Dongjoon. On 2018/03/28 01:00:55, Dongjoon Hyun <dongj...@apache.org> wrote: > Hi, Eric. > > For me, Spark 2.3 works correctly like the following. Could you give us some > reproducible example? > > ``` > scala> sql("set spark.sql.orc.impl=native") > > scala> sql("set spark.sql.orc.compression.codec=zlib") > res1: org.apache.spark.sql.DataFrame = [key: string, value: string] > > scala> spark.range(10).write.orc("/tmp/zlib_test") > > scala> spark.read.orc("/tmp/zlib_test").show > +---+ > | id| > +---+ > | 8| > | 9| > | 5| > | 0| > | 3| > | 4| > | 6| > | 7| > | 1| > | 2| > +---+ > > scala> sc.version > res4: String = 2.3.0 > ``` > > Bests, > Dongjoon. > > > On 2018/03/23 15:03:29, Eirik Thorsnes <eirik.thors...@uni.no> wrote: > > Hi all, > > > > I'm trying the new ORC native in Spark 2.3 > > (org.apache.spark.sql.execution.datasources.orc). > > > > I've compiled Spark 2.3 from the git branch-2.3 as of March 20th. > > I also get the same error for the Spark 2.2 from Hortonworks HDP 2.6.4. > > > > *NOTE*: the error only occurs with zlib compression, and I see that with > > Snappy I get an extra log-line saying "OrcCodecPool: Got brand-new codec > > SNAPPY". Perhaps zlib codec is never loaded/triggered in the new code? > > > > I can write using the new native codepath without errors, but *reading* > > zlib-compressed ORC, either the newly written ORC-files *or* older > > ORC-files written with Spark 2.2/1.6 I get the following exception. > > > > ======= cut ========= > > 2018-03-23 10:36:08,249 INFO FileScanRDD: Reading File path: > > hdfs://.../year=1999/part-r-00000-2573bfff-1f18-47d9-b0fb-37dc216b8a99.orc, > > range: 0-134217728, partition values: [1999] > > 2018-03-23 10:36:08,326 INFO ReaderImpl: Reading ORC rows from > > hdfs://.../year=1999/part-r-00000-2573bfff-1f18-47d9-b0fb-37dc216b8a99.orc > > with {include: [true, true, true, true, true, true, true, true, true], > > offset: 0, length: 134217728} > > 2018-03-23 10:36:08,326 INFO RecordReaderImpl: Reader schema not > > provided -- using file schema > > struct<datetime:timestamp,lon:float,lat:float,u10:smallint,v10:smallint,lcc:smallint,mcc:smallint,hcc:smallint> > > > > 2018-03-23 10:36:08,824 ERROR Executor: Exception in task 0.0 in stage > > 1.0 (TID 1) > > java.nio.BufferUnderflowException > > at java.nio.Buffer.nextGetIndex(Buffer.java:500) > > at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:249) > > at > > org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:248) > > at > > org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:58) > > at > > org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323) > > at > > org.apache.orc.impl.TreeReaderFactory$TimestampTreeReader.nextVector(TreeReaderFactory.java:976) > > at > > org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:1815) > > at > > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1184) > > at > > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.scala:186) > > at > > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.scala:114) > > at > > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > > at > > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) > > at > > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177) > > at > > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) > > at > > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > > Source) > > at > > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > > Source) > > at > > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > > at > > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > > at > > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) > > at > > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) > > at > > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > > at > > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > > at > > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > > at > > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > > at org.apache.spark.scheduler.Task.run(Task.scala:108) > > at > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) > > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > > ======= cut ========= > > > > I have the following set in spark-defaults.conf: > > > > spark.sql.hive.convertMetastoreOrc true > > spark.sql.orc.char.enabled true > > spark.sql.orc.enabled true > > spark.sql.orc.filterPushdown true > > spark.sql.orc.impl native > > spark.sql.orc.enableVectorizedReader true > > > > > > If I set these to false and use the old hive reader (or specify the full > > classname for the old hive reader in the spark-shell) I get results OK > > with both new and old orc-files. > > > > If I use Snappy compression it works with the new reader without error. > > > > NOTE: I'm running on Hortonworks HDP 2.6.4 (Hadoop 2.7.3) and I also get > > the same error for the Spark 2.2 there which I understand has many of > > the patches from the Spark 2.3 branch. > > > > Should this be reported in the JIRA system? > > > > Regards, > > Eirik > > > > -- > > Eirik Thorsnes > > > > > > --------------------------------------------------------------------- > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org