We are using Hive 0.14 Our input file size is around 100 GB uncompressed
We are using insering this data to hive which is ORC based table , ZLIB While inserting we are also using following two parameters. SET hive.exec.reducers.max=10; SET mapred.reduce.tasks=5; The output ORC file produced is of about 10GB compressed. Question : 1. How to control the number of output ORC files 2. How to control size of ORC file generated 3. If we get very big files like 10GB ORC , we try to query the table we get exception in hive as shown below ( query and exception below) 4. Will setting hive.exec.orc.default.block.size or hive.exec.orc.default.stripe.size to some lower value help to control the file output size? Is there any limitation in ORC for file size We have following hive properties set in Ambari hive.merge.size.per.task 256000000 hive.merge.orcfile.stripe.level true hive.merge.mapfiles true hive.merge.mapredfiles true Reading Query Select * from table where partition=“big_*file_*size" Execption P-524264982-127.0.0.1-1429020129249:blk_1091744762_18097939): PathInfo{path=, state=UNUSABLE} is not usable for short circuit; giving up on BlockReaderLocal. 15/10/21 11:30:02 [LeaseRenewer:d760770@tdcdv2]: DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [] with renew id 1 executed 15/10/21 11:30:04 [ORC_GET_SPLITS #1]: ERROR orc.OrcInputFormat: Unexpected Exception java.lang.NullPointerException at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.setIncludedColumns(OrcInputFormat.java:260) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.run(OrcInputFormat.java:779) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Failed with exception java.io.IOException:java.lang.RuntimeException: serious problem 15/10/21 11:30:04 [main]: ERROR CliDriver: Failed with exception java.io.IOException:java.lang.RuntimeException: serious problem java.io.IOException: java.lang.RuntimeException: serious problem at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:663) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:561) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:138) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1623) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:267) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:199) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:410) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:783) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:677) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:616) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) Caused by: java.lang.RuntimeException: serious problem at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$Context.waitForTasks(OrcInputFormat.java:478) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:949) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:974) at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:442) at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:588) ... 15 more Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.setIncludedColumns(OrcInputFormat.java:260) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.run(OrcInputFormat.java:779) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/10/21 11:30:04 [main]: INFO exec.TableScanOperator: 0 finished. closing... 15/10/21 11:30:04 [main]: DEBUG exec.TableScanOperator: Closing child = SEL[2] 15/10/21 11:30:04 [main]: DEBUG exec.SelectOperator: allInitializedParentsAreClosed? parent.state = CLOSE 15/10/21 11:30:04 [main]: INFO exec.SelectOperator: 2 finished. closing... 15/10/21 11:30:04 [main]: DEBUG exec.SelectOperator: Closing child = LIM[3]