Hi all, I'm trying the new ORC native in Spark 2.3 (org.apache.spark.sql.execution.datasources.orc).
I've compiled Spark 2.3 from the git branch-2.3 as of March 20th. I also get the same error for the Spark 2.2 from Hortonworks HDP 2.6.4. *NOTE*: the error only occurs with zlib compression, and I see that with Snappy I get an extra log-line saying "OrcCodecPool: Got brand-new codec SNAPPY". Perhaps zlib codec is never loaded/triggered in the new code? I can write using the new native codepath without errors, but *reading* zlib-compressed ORC, either the newly written ORC-files *or* older ORC-files written with Spark 2.2/1.6 I get the following exception. ======= cut ========= 2018-03-23 10:36:08,249 INFO FileScanRDD: Reading File path: hdfs://.../year=1999/part-r-00000-2573bfff-1f18-47d9-b0fb-37dc216b8a99.orc, range: 0-134217728, partition values: [1999] 2018-03-23 10:36:08,326 INFO ReaderImpl: Reading ORC rows from hdfs://.../year=1999/part-r-00000-2573bfff-1f18-47d9-b0fb-37dc216b8a99.orc with {include: [true, true, true, true, true, true, true, true, true], offset: 0, length: 134217728} 2018-03-23 10:36:08,326 INFO RecordReaderImpl: Reader schema not provided -- using file schema struct<datetime:timestamp,lon:float,lat:float,u10:smallint,v10:smallint,lcc:smallint,mcc:smallint,hcc:smallint> 2018-03-23 10:36:08,824 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) java.nio.BufferUnderflowException at java.nio.Buffer.nextGetIndex(Buffer.java:500) at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:249) at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:248) at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:58) at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323) at org.apache.orc.impl.TreeReaderFactory$TimestampTreeReader.nextVector(TreeReaderFactory.java:976) at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:1815) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1184) at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.scala:186) at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.scala:114) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ======= cut ========= I have the following set in spark-defaults.conf: spark.sql.hive.convertMetastoreOrc true spark.sql.orc.char.enabled true spark.sql.orc.enabled true spark.sql.orc.filterPushdown true spark.sql.orc.impl native spark.sql.orc.enableVectorizedReader true If I set these to false and use the old hive reader (or specify the full classname for the old hive reader in the spark-shell) I get results OK with both new and old orc-files. If I use Snappy compression it works with the new reader without error. NOTE: I'm running on Hortonworks HDP 2.6.4 (Hadoop 2.7.3) and I also get the same error for the Spark 2.2 there which I understand has many of the patches from the Spark 2.3 branch. Should this be reported in the JIRA system? Regards, Eirik -- Eirik Thorsnes --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org