We are using Hive 0.14
Our input file size is around 100 GB uncompressed
We are using insering this data to hive which is ORC based table , ZLIB
While inserting we are also using following two parameters.
SET hive.exec.reducers.max=10;
SET mapred.reduce.tasks=5;
The output ORC file produced is of about 10GB compressed.
Question :
1. How to control the number of output ORC files
2. How to control size of ORC file generated
3. If we get very big files like 10GB ORC , we try to query the table
we get exception in hive as shown below ( query and exception below)
4.
Will setting hive.exec.orc.default.block.size or
hive.exec.orc.default.stripe.size to some lower value help to control the
file output size?
Is there any limitation in ORC for file size
We have following hive properties set in Ambari
hive.merge.size.per.task 256000000
hive.merge.orcfile.stripe.level true
hive.merge.mapfiles true
hive.merge.mapredfiles true
Reading Query
Select * from table where partition=“big_*file_*size"
Execption
P-524264982-127.0.0.1-1429020129249:blk_1091744762_18097939):
PathInfo{path=, state=UNUSABLE} is not usable for short circuit; giving up
on BlockReaderLocal.
15/10/21 11:30:02 [LeaseRenewer:d760770@tdcdv2]: DEBUG hdfs.LeaseRenewer:
Lease renewer daemon for [] with renew id 1 executed
15/10/21 11:30:04 [ORC_GET_SPLITS #1]: ERROR orc.OrcInputFormat: Unexpected
Exception
java.lang.NullPointerException
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.setIncludedColumns(OrcInputFormat.java:260)
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.run(OrcInputFormat.java:779)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Failed with exception java.io.IOException:java.lang.RuntimeException:
serious problem
15/10/21 11:30:04 [main]: ERROR CliDriver: Failed with exception
java.io.IOException:java.lang.RuntimeException: serious problem
java.io.IOException: java.lang.RuntimeException: serious problem
at
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:663)
at
org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:561)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:138)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1623)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:267)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:199)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:410)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:783)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:677)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:616)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.lang.RuntimeException: serious problem
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$Context.waitForTasks(OrcInputFormat.java:478)
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:949)
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:974)
at
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:442)
at
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:588)
... 15 more
Caused by: java.lang.NullPointerException
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.setIncludedColumns(OrcInputFormat.java:260)
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.run(OrcInputFormat.java:779)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/10/21 11:30:04 [main]: INFO exec.TableScanOperator: 0 finished.
closing...
15/10/21 11:30:04 [main]: DEBUG exec.TableScanOperator: Closing child =
SEL[2]
15/10/21 11:30:04 [main]: DEBUG exec.SelectOperator:
allInitializedParentsAreClosed? parent.state = CLOSE
15/10/21 11:30:04 [main]: INFO exec.SelectOperator: 2 finished. closing...
15/10/21 11:30:04 [main]: DEBUG exec.SelectOperator: Closing child = LIM[3]