Hello everyone,

I'm hitting below exception when reading an ORC file with default HiveContext after setting hive.exec.orc.default.buffer.size to 1517137. See below for details.

Is there another buffer parameter relevant or another place where I could set it?

Any other ideas what's going wrong?

/15/12/09 03:30:31 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, prtlap09): java.lang.IllegalArgumentException: Buffer size too small. size = 262144 needed = 1317137 at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:193) at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:238) at java.io.InputStream.read(InputStream.java:101)/


Details:

starting spark-1.5.2-bin-hadoop2.6 without any hive-site.xml of default.xml in /conf:

spark-shell --master mesos://mymaster:5050 --driver-memory 10G 
--driver-java-options="-Dspark.executor.memory=20g 
-Dhive.exec.orc.default.buffer.size=1517137"


seems to accept the config parameter

scala> sqlContext
res0: org.apache.spark.sql.SQLContext = 
org.apache.spark.sql.hive.HiveContext@61d16653
scala> sqlContext.getAllConfs.get("hive.exec.orc.default.buffer.size")
res1: Option[String] = Some(1517137)


adding an orc file I hit some buffer being too low. Above buffer was the obvious candidate, also its default is 262144.

scala> val df = sqlContext.read.orc("hdfs://myhdfsmaster/some_orc_file")
15/12/09 03:30:18 INFO orc.OrcRelation: Listing 
hdfs://myhdfsmaster/some_orc_file on driver
df: org.apache.spark.sql.DataFrame = [_col0: string, _col1: int, _col2: 
timestamp]
/scala> df.first() .. 15/12/09 03:30:26 INFO log.PerfLogger: <PERFLOG method=OrcGetSplits from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl> .. 15/12/09 03:30:26 INFO orc.OrcInputFormat: FooterCacheHitRatio: 0/1 .. 15/12/09 03:30:26 INFO log.PerfLogger: </PERFLOG method=OrcGetSplits start=1449660626845 end=1449660626950 duration=105 from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl> 15/12/09 03:30:31 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, prtlap09): java.lang.IllegalArgumentException: Buffer size too small. size = 262144 needed = 1317137 at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:193) at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:238) at java.io.InputStream.read(InputStream.java:101) at org.spark-project.hive.shaded.com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:737) at org.spark-project.hive.shaded.com.google.protobuf.CodedInputStream.isAtEnd(CodedInputStream.java:701) at org.spark-project.hive.shaded.com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:99) at org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter.<init>(OrcProto.java:10661) at org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter.<init>(OrcProto.java:10625) at org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:10730) at org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:10725) at org.spark-project.hive.shaded.com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200) at org.spark-project.hive.shaded.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217) at org.spark-project.hive.shaded.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223) at org.spark-project.hive.shaded.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49) /

Any hint much appreciated!

Best regards,
Fabian



Reply via email to