Stephen Measmer created HIVE-16870:
--------------------------------------

             Summary: Give Hive the ability to suppress output of empty files
                 Key: HIVE-16870
                 URL: https://issues.apache.org/jira/browse/HIVE-16870
             Project: Hive
          Issue Type: Improvement
          Components: StorageHandler
            Reporter: Stephen Measmer


Today some hive queries using joins can output zero byte files, particularly on 
large joins.  This can have a negative affect on HDFS as it can lead to too 
many small files [1].

A solution suggested in this Cloudera Community thread [2] suggests using 
OutputFormat of LazyOutputFormat because MapReduce can be set to suppress the 
generation of empty (zero byte) files.

But it's not possible to create a table with an OutputFormat of just 
LazyOutputFormat in Hive.  Below is what we found when testing. 

create table mytable (fip int, state string, zip string, level int) STORED AS 
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 
'org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat';

------------
Error: Error while compiling statement: FAILED: SemanticException [Error 
10055]: Output Format must implement HiveOutputFormat, otherwise it should be 
either IgnoreKeyTextOutputFormat or SequenceFileOutputFormat 
(state=42000,code=10055)


[1] http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
[2] 
https://community.cloudera.com/t5/Batch-Processing-and-Workflow/how-to-suppress-mapper-output-files-if-the-output-file-does-not/td-p/29540



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to