Hi, Eric. For me, Spark 2.3 works correctly like the following. Could you give us some reproducible example?
``` scala> sql("set spark.sql.orc.impl=native") scala> sql("set spark.sql.orc.compression.codec=zlib") res1: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.range(10).write.orc("/tmp/zlib_test") scala> spark.read.orc("/tmp/zlib_test").show +---+ | id| +---+ | 8| | 9| | 5| | 0| | 3| | 4| | 6| | 7| | 1| | 2| +---+ scala> sc.version res4: String = 2.3.0 ``` Bests, Dongjoon. On 2018/03/23 15:03:29, Eirik Thorsnes <eirik.thors...@uni.no> wrote: > Hi all, > > I'm trying the new ORC native in Spark 2.3 > (org.apache.spark.sql.execution.datasources.orc). > > I've compiled Spark 2.3 from the git branch-2.3 as of March 20th. > I also get the same error for the Spark 2.2 from Hortonworks HDP 2.6.4. > > *NOTE*: the error only occurs with zlib compression, and I see that with > Snappy I get an extra log-line saying "OrcCodecPool: Got brand-new codec > SNAPPY". Perhaps zlib codec is never loaded/triggered in the new code? > > I can write using the new native codepath without errors, but *reading* > zlib-compressed ORC, either the newly written ORC-files *or* older > ORC-files written with Spark 2.2/1.6 I get the following exception. > > ======= cut ========= > 2018-03-23 10:36:08,249 INFO FileScanRDD: Reading File path: > hdfs://.../year=1999/part-r-00000-2573bfff-1f18-47d9-b0fb-37dc216b8a99.orc, > range: 0-134217728, partition values: [1999] > 2018-03-23 10:36:08,326 INFO ReaderImpl: Reading ORC rows from > hdfs://.../year=1999/part-r-00000-2573bfff-1f18-47d9-b0fb-37dc216b8a99.orc > with {include: [true, true, true, true, true, true, true, true, true], > offset: 0, length: 134217728} > 2018-03-23 10:36:08,326 INFO RecordReaderImpl: Reader schema not > provided -- using file schema > struct<datetime:timestamp,lon:float,lat:float,u10:smallint,v10:smallint,lcc:smallint,mcc:smallint,hcc:smallint> > > 2018-03-23 10:36:08,824 ERROR Executor: Exception in task 0.0 in stage > 1.0 (TID 1) > java.nio.BufferUnderflowException > at java.nio.Buffer.nextGetIndex(Buffer.java:500) > at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:249) > at > org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:248) > at > org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:58) > at > org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323) > at > org.apache.orc.impl.TreeReaderFactory$TimestampTreeReader.nextVector(TreeReaderFactory.java:976) > at > org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:1815) > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1184) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.scala:186) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.scala:114) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > ======= cut ========= > > I have the following set in spark-defaults.conf: > > spark.sql.hive.convertMetastoreOrc true > spark.sql.orc.char.enabled true > spark.sql.orc.enabled true > spark.sql.orc.filterPushdown true > spark.sql.orc.impl native > spark.sql.orc.enableVectorizedReader true > > > If I set these to false and use the old hive reader (or specify the full > classname for the old hive reader in the spark-shell) I get results OK > with both new and old orc-files. > > If I use Snappy compression it works with the new reader without error. > > NOTE: I'm running on Hortonworks HDP 2.6.4 (Hadoop 2.7.3) and I also get > the same error for the Spark 2.2 there which I understand has many of > the patches from the Spark 2.3 branch. > > Should this be reported in the JIRA system? > > Regards, > Eirik > > -- > Eirik Thorsnes > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org