> On Feb 25, 2016, at 3:15 PM, Prasanth Jayachandran > <pjayachand...@hortonworks.com> wrote: > > Hi Patrick > > Can you paste entire stacktrace? Looks like NPE happened during split > generation but stack trace is incomplete to know what caused it. > > In Hive 0.14.0, the stripe size is changed to 64MB. The default block size > for ORC files is 256MB. 4 stripes can fit a block. ORC does padding to avoid > stripes straddling HDFS blocks. During split calculation, ORC footer which > contains stripe level column statistics is read to perform split pruning > based on predicate condition specified via SARG(Search Argument). > > For example: Assume column ‘state’ is sorted and the predicate condition is > ‘state’=“CA" > Stripe 1: min = AZ max = FL > Stripe 2: min = GA max = MN > Stripe 3: min = MS max = SC > Stripe 4: min = SD max = WY > > In this case, only stripe 1 satisfies the above predicate condition. So only > 1 split with stripe 1 will be created. > So if there are huge number of small files, then footers from all files has > to be read to do split pruning. If there are few number of large files then > only few footers have to be read. Also the minimum splittable position is > stripe boundary. So having fewer large files has the advantage of reading > less data during split pruning. > > If you can send me the full stacktrace, I can tell what is causing the > exception here. I will also let you know of any workaround/next hive version > with the fix. > > In more recent hive versions, hive 1.2.0 onwards. OrcInputFormat is has > strategies to decided when to read footers and when not to read footers > automatically. You can configure the strategy that you want based on the > workload. In case of many small files, footers will not be read and with > large files footers will be read for split pruning.
The default strategy does it automatically (choosing between when to read and when not to footers). It is configurable as well. > > Thanks > Prasanth > >> On Feb 25, 2016, at 7:08 AM, Patrick Duin <patd...@gmail.com> wrote: >> >> Hi, >> >> We've recently moved one of our datasets to ORC and we use Cascading and >> Hive to read this data. We've had problems reading the data via Cascading, >> because of the generation of splits. >> We read in a large number of files (thousands) and they are about 1GB each. >> We found that the split calculation took minutes on our cluster and often >> didn't succeed at all (when our namenode was busy). >> When digging through the code of the >> 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.class' we figured out that >> if we make the files less then the ORC block size (256MB) the code would >> avoid lots of namenode calls. We applied this solution and made our files >> smaller and that solved the problem. Split calculation in our job went from >> 10+ mins to a couple of seconds and always succeeds. >> We feel it is counterintuitive as bigger files are usually better in HDFS. >> We've also seen that doing a hive query on the data does not present this >> problem. Internally Hive seem to take a completely different execution path >> and is not using the OrcInputFormat but uses >> 'org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.class'. >> >> Can someone explain the reason for this difference or shed some light on the >> behaviour we are seeing? Any help will be greatly appreciated. We are using >> hive-0.14.0. >> >> Kind regards, >> Patrick >> >> Here is the stack-trace that we would see when our Cascading job failed to >> calculate the splits: >> Caused by: java.lang.RuntimeException: serious problem >> at >> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$Context.waitForTasks(OrcInputFormat.java:478) >> at >> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:949) >> at >> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:974) >> at >> com.hotels.corc.mapred.CorcInputFormat.getSplits(CorcInputFormat.java:201) >> at >> cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:200) >> at >> cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:142) >> at >> org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:624) >> at >> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:616) >> at >> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492) >> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296) >> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:415) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) >> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293) >> at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:585) >> at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:580) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:415) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) >> at >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:580) >> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:571) >> at >> cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:106) >> at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:265) >> at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:184) >> at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:146) >> at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:48) >> ... 4 more >> Caused by: java.lang.NullPointerException >> at >> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.run(OrcInputFormat.java:809) >