Hi, We've recently moved one of our datasets to ORC and we use Cascading and Hive to read this data. We've had problems reading the data via Cascading, because of the generation of splits. We read in a large number of files (thousands) and they are about 1GB each. We found that the split calculation took minutes on our cluster and often didn't succeed at all (when our namenode was busy). When digging through the code of the 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.class' we figured out that if we make the files less then the ORC block size (256MB) the code would avoid lots of namenode calls. We applied this solution and made our files smaller and that solved the problem. Split calculation in our job went from 10+ mins to a couple of seconds and always succeeds. We feel it is counterintuitive as bigger files are usually better in HDFS. We've also seen that doing a hive query on the data does not present this problem. Internally Hive seem to take a completely different execution path and is not using the OrcInputFormat but uses 'org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.class'.
Can someone explain the reason for this difference or shed some light on the behaviour we are seeing? Any help will be greatly appreciated. We are using hive-0.14.0. Kind regards, Patrick Here is the stack-trace that we would see when our Cascading job failed to calculate the splits: Caused by: java.lang.RuntimeException: serious problem at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$Context.waitForTasks(OrcInputFormat.java:478) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:949) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:974) at com.hotels.corc.mapred.CorcInputFormat.getSplits(CorcInputFormat.java:201) at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:200) at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:142) at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:624) at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:616) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:585) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:580) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:580) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:571) at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:106) at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:265) at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:184) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:146) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:48) ... 4 more Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.run(OrcInputFormat.java:809)