Hello everyone,
I'm hitting below exception when reading an ORC file with default
HiveContext after setting hive.exec.orc.default.buffer.size to 1517137.
See below for details.
Is there another buffer parameter relevant or another place where I
could set it?
Any other ideas what's going wrong?
/15/12/09 03:30:31 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
0.0 (TID 0, prtlap09): java.lang.IllegalArgumentException: Buffer size
too small. size = 262144 needed = 1317137 at
org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:193)
at
org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:238)
at java.io.InputStream.read(InputStream.java:101)/
Details:
starting spark-1.5.2-bin-hadoop2.6 without any hive-site.xml of
default.xml in /conf:
spark-shell --master mesos://mymaster:5050 --driver-memory 10G
--driver-java-options="-Dspark.executor.memory=20g
-Dhive.exec.orc.default.buffer.size=1517137"
seems to accept the config parameter
scala> sqlContext
res0: org.apache.spark.sql.SQLContext =
org.apache.spark.sql.hive.HiveContext@61d16653
scala> sqlContext.getAllConfs.get("hive.exec.orc.default.buffer.size")
res1: Option[String] = Some(1517137)
adding an orc file I hit some buffer being too low. Above buffer was the
obvious candidate, also its default is 262144.
scala> val df = sqlContext.read.orc("hdfs://myhdfsmaster/some_orc_file")
15/12/09 03:30:18 INFO orc.OrcRelation: Listing
hdfs://myhdfsmaster/some_orc_file on driver
df: org.apache.spark.sql.DataFrame = [_col0: string, _col1: int, _col2:
timestamp]
/scala> df.first() .. 15/12/09 03:30:26 INFO log.PerfLogger: <PERFLOG
method=OrcGetSplits from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl> ..
15/12/09 03:30:26 INFO orc.OrcInputFormat: FooterCacheHitRatio: 0/1 ..
15/12/09 03:30:26 INFO log.PerfLogger: </PERFLOG method=OrcGetSplits
start=1449660626845 end=1449660626950 duration=105
from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl> 15/12/09 03:30:31 WARN
scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, prtlap09):
java.lang.IllegalArgumentException: Buffer size too small. size = 262144
needed = 1317137 at
org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:193)
at
org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:238)
at java.io.InputStream.read(InputStream.java:101) at
org.spark-project.hive.shaded.com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:737)
at
org.spark-project.hive.shaded.com.google.protobuf.CodedInputStream.isAtEnd(CodedInputStream.java:701)
at
org.spark-project.hive.shaded.com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:99)
at
org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter.<init>(OrcProto.java:10661)
at
org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter.<init>(OrcProto.java:10625)
at
org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:10730)
at
org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:10725)
at
org.spark-project.hive.shaded.com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
at
org.spark-project.hive.shaded.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217)
at
org.spark-project.hive.shaded.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223)
at
org.spark-project.hive.shaded.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
/
Any hint much appreciated!
Best regards,
Fabian