ORC file split calculation problems

Patrick Duin Thu, 25 Feb 2016 05:09:57 -0800

Hi,

We've recently moved one of our datasets to ORC and we use Cascading and
Hive to read this data. We've had problems reading the data via Cascading,
because of the generation of splits.
We read in a large number of files (thousands) and they are about 1GB each.
We found that the split calculation took minutes on our cluster and often
didn't succeed at all (when our namenode was busy).
When digging through the code of the
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.class' we figured out that
if we make the files less then the ORC block size (256MB) the code would
avoid lots of namenode calls. We applied this solution and made our files
smaller and that solved the problem. Split calculation in our job went from
10+ mins to a couple of seconds and always succeeds.
We feel it is counterintuitive as bigger files are usually better in HDFS.
We've also seen that doing a hive query on the data does not present this
problem. Internally Hive seem to take a completely different execution path
and is not using the OrcInputFormat but uses
'org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.class'.


Can someone explain the reason for this difference or shed some light on
the behaviour we are seeing? Any help will be greatly appreciated. We are
using hive-0.14.0.

Kind regards,
 Patrick

Here is the stack-trace that we would see when our Cascading job failed to
calculate the splits:
Caused by: java.lang.RuntimeException: serious problem
        at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$Context.waitForTasks(OrcInputFormat.java:478)
        at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:949)
        at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:974)
        at
com.hotels.corc.mapred.CorcInputFormat.getSplits(CorcInputFormat.java:201)
        at
cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:200)
        at
cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:142)
        at
org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:624)
        at
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:616)
        at
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:585)
        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:580)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
        at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:580)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:571)
        at
cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:106)
        at
cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:265)
        at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:184)
        at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:146)
        at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:48)
        ... 4 more
Caused by: java.lang.NullPointerException
        at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.run(OrcInputFormat.java:809)

ORC file split calculation problems

Reply via email to