Re: ORC file split calculation problems

Prasanth Jayachandran Thu, 25 Feb 2016 13:16:16 -0800

Hi Patrick

Can you paste entire stacktrace? Looks like NPE happened during split 
generation but stack trace is incomplete to know what caused it.


In Hive 0.14.0, the stripe size is changed to 64MB. The default block size for 
ORC files is 256MB. 4 stripes can fit a block. ORC does padding to avoid 
stripes straddling HDFS blocks. During split calculation, ORC footer which 
contains stripe level column statistics is read to perform split pruning based 
on predicate condition specified via SARG(Search Argument). 

For example: Assume column ‘state’ is sorted and the predicate condition is 
‘state’=“CA"
Stripe 1: min = AZ max = FL
Stripe 2: min = GA max = MN
Stripe 3: min = MS max = SC
Stripe 4: min = SD max = WY

In this case, only stripe 1 satisfies the above predicate condition. So only 1 
split with stripe 1 will be created.
So if there are huge number of small files, then footers from all files has to 
be read to do split pruning. If there are few number of large files then only 
few footers have to be read. Also the minimum splittable position is stripe 
boundary. So having fewer large files has the advantage of reading less data 
during split pruning. 

If you can send me the full stacktrace, I can tell what is causing the 
exception here. I will also let you know of any workaround/next hive version 
with the fix.

In more recent hive versions, hive 1.2.0 onwards. OrcInputFormat is has 
strategies to decided when to read footers and when not to read footers 
automatically. You can configure the strategy that you want based on the 
workload. In case of many small files, footers will not be read and with large 
files footers will be read for split pruning.

Thanks
Prasanth

> On Feb 25, 2016, at 7:08 AM, Patrick Duin <patd...@gmail.com> wrote:
> 
> Hi,
>  
> We've recently moved one of our datasets to ORC and we use Cascading and Hive 
> to read this data. We've had problems reading the data via Cascading, because 
> of the generation of splits. 
> We read in a large number of files (thousands) and they are about 1GB each. 
> We found that the split calculation took minutes on our cluster and often 
> didn't succeed at all (when our namenode was busy). 
> When digging through the code of the 
> 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.class' we figured out that 
> if we make the files less then the ORC block size (256MB) the code would 
> avoid lots of namenode calls. We applied this solution and made our files 
> smaller and that solved the problem. Split calculation in our job went from 
> 10+ mins to a couple of seconds and always succeeds. 
> We feel it is counterintuitive as bigger files are usually better in HDFS. 
> We've also seen that doing a hive query on the data does not present this 
> problem. Internally Hive seem to take a completely different execution path 
> and is not using the OrcInputFormat but uses 
> 'org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.class'. 
> 
> Can someone explain the reason for this difference or shed some light on the 
> behaviour we are seeing? Any help will be greatly appreciated. We are using 
> hive-0.14.0.
> 
> Kind regards,
>  Patrick
> 
> Here is the stack-trace that we would see when our Cascading job failed to 
> calculate the splits:
> Caused by: java.lang.RuntimeException: serious problem
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$Context.waitForTasks(OrcInputFormat.java:478)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:949)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:974)
>         at 
> com.hotels.corc.mapred.CorcInputFormat.getSplits(CorcInputFormat.java:201)
>         at 
> cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:200)
>         at 
> cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:142)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:624)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:616)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492)
>         at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
>         at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:585)
>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:580)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>         at 
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:580)
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:571)
>         at 
> cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:106)
>         at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:265)
>         at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:184)
>         at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:146)
>         at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:48)
>         ... 4 more
> Caused by: java.lang.NullPointerException
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.run(OrcInputFormat.java:809)

Re: ORC file split calculation problems

Reply via email to