[ https://issues.apache.org/jira/browse/HIVE-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803054#action_12803054 ]
Namit Jain commented on HIVE-1071: ---------------------------------- If the table happens to be bucketed, sampling queries may not work after concatenation. The offsets need to be stored (in the metastore) for the buckets, and the offsets should be used to calculate the splits. > Making RCFile "concatenatable" to reduce the number of files of the output > -------------------------------------------------------------------------- > > Key: HIVE-1071 > URL: https://issues.apache.org/jira/browse/HIVE-1071 > Project: Hadoop Hive > Issue Type: Improvement > Reporter: Zheng Shao > > Hive automatically determine the number of reducers most of the time. > Sometimes, we create a lot of small files. > Hive has an option to "merge" those small files though a map-reduce job. > Dhruba has the idea which can fix it even faster: > if we can make RCFile concatenatable, then we can simply tell the namenode to > "merge" these files. > Pros: This approach does not do any I/O so it's faster. > Cons: We have to zero-fill the files to make sure they can be concatenated > (all blocks except the last have to be full HDFS blocks). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.