[jira] [Commented] (HIVE-1950) Block merge for RCFile
[ https://issues.apache.org/jira/browse/HIVE-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174764#comment-14174764 ] Lefty Leverenz commented on HIVE-1950: -- Doc note: [~prasanth_j] documented this in the wiki here: * [DDL -- Alter Table/Partition Concatenate | https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionConcatenate] Block merge for RCFile -- Key: HIVE-1950 URL: https://issues.apache.org/jira/browse/HIVE-1950 Project: Hive Issue Type: New Feature Reporter: He Yongqiang Assignee: He Yongqiang Fix For: 0.8.0 Attachments: HIVE-1950.1.patch, HIVE-1950.2.patch, HIVE-1950.3.patch, HIVE-1950.4.patch, HIVE-1950.5.patch, HIVE-1950.6.patch In our env, there are a lot of small files inside one partition/table. In order to reduce the namenode load, we have one dedicated housekeeping job running to merge these file. Right now the merge is an 'insert overwrite' in hive, and requires decompress the data and compress it. This jira is to add a command in Hive to do the merge without decompress and recompress the data. Something like alter table tbl_name [partition ()] merge files. In this jira the new command will only support RCFile, since there need some new APIs to the fileformat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] Commented: (HIVE-1950) Block merge for RCFile
[ https://issues.apache.org/jira/browse/HIVE-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12998583#comment-12998583 ] He Yongqiang commented on HIVE-1950: it's a typo, and i fixed in the new patch. HIVE_STATS_ATOMIC is an existing conf for stats. Block merge for RCFile -- Key: HIVE-1950 URL: https://issues.apache.org/jira/browse/HIVE-1950 Project: Hive Issue Type: New Feature Reporter: He Yongqiang Assignee: He Yongqiang Attachments: HIVE-1950.1.patch, HIVE-1950.2.patch, HIVE-1950.3.patch, HIVE-1950.4.patch, HIVE-1950.5.patch, HIVE-1950.6.patch In our env, there are a lot of small files inside one partition/table. In order to reduce the namenode load, we have one dedicated housekeeping job running to merge these file. Right now the merge is an 'insert overwrite' in hive, and requires decompress the data and compress it. This jira is to add a command in Hive to do the merge without decompress and recompress the data. Something like alter table tbl_name [partition ()] merge files. In this jira the new command will only support RCFile, since there need some new APIs to the fileformat. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-1950) Block merge for RCFile
[ https://issues.apache.org/jira/browse/HIVE-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12994610#comment-12994610 ] Ning Zhang commented on HIVE-1950: -- Yongqiang, I'm still reviewing the new patch (.4) but found some of my comments are not address (e.g., QTestUtil). Can you elaborate which comments have been addressed and which are not (and the reasons)? Block merge for RCFile -- Key: HIVE-1950 URL: https://issues.apache.org/jira/browse/HIVE-1950 Project: Hive Issue Type: New Feature Reporter: He Yongqiang Assignee: He Yongqiang Attachments: HIVE-1950.1.patch, HIVE-1950.2.patch, HIVE-1950.3.patch, HIVE-1950.4.patch In our env, there are a lot of small files inside one partition/table. In order to reduce the namenode load, we have one dedicated housekeeping job running to merge these file. Right now the merge is an 'insert overwrite' in hive, and requires decompress the data and compress it. This jira is to add a command in Hive to do the merge without decompress and recompress the data. Something like alter table tbl_name [partition ()] merge files. In this jira the new command will only support RCFile, since there need some new APIs to the fileformat. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-1950) Block merge for RCFile
[ https://issues.apache.org/jira/browse/HIVE-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12993806#comment-12993806 ] Ning Zhang commented on HIVE-1950: -- Yongqiang, does the review board have the latest patch? Block merge for RCFile -- Key: HIVE-1950 URL: https://issues.apache.org/jira/browse/HIVE-1950 Project: Hive Issue Type: New Feature Reporter: He Yongqiang Assignee: He Yongqiang Attachments: HIVE-1950.1.patch, HIVE-1950.2.patch, HIVE-1950.3.patch In our env, there are a lot of small files inside one partition/table. In order to reduce the namenode load, we have one dedicated housekeeping job running to merge these file. Right now the merge is an 'insert overwrite' in hive, and requires decompress the data and compress it. This jira is to add a command in Hive to do the merge without decompress and recompress the data. Something like alter table tbl_name [partition ()] merge files. In this jira the new command will only support RCFile, since there need some new APIs to the fileformat. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-1950) Block merge for RCFile
[ https://issues.apache.org/jira/browse/HIVE-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12993852#comment-12993852 ] Ning Zhang commented on HIVE-1950: -- Yongqiang, the patch doesn't compile. Below are some initial reviews from me: QTestUtil.java: 334: you may want to add those index tables that you want to keep in srcTables. Otherewise indexes that are created inside a test will not be cleaned -- side-effect. StatsTask: a StatsTask is added in DDLSemanticAnalyzer for the mege task but why set it to do nothing? ExecDriver: jobExecHelper is constructed in both the constructors and initialize(). Is there a reason? checkFatalError: why removed some code? Why remove METASTOREPWD? DDLTask: move semantics checking (index archive checking etc.) to DDLSemanticAnalyzer. Execution time should only raise exception if there are runtime exceptions. In another word, explain plan of the query shoull throw an exception if there are indexes or table is archived. Block merge for RCFile -- Key: HIVE-1950 URL: https://issues.apache.org/jira/browse/HIVE-1950 Project: Hive Issue Type: New Feature Reporter: He Yongqiang Assignee: He Yongqiang Attachments: HIVE-1950.1.patch, HIVE-1950.2.patch, HIVE-1950.3.patch In our env, there are a lot of small files inside one partition/table. In order to reduce the namenode load, we have one dedicated housekeeping job running to merge these file. Right now the merge is an 'insert overwrite' in hive, and requires decompress the data and compress it. This jira is to add a command in Hive to do the merge without decompress and recompress the data. Something like alter table tbl_name [partition ()] merge files. In this jira the new command will only support RCFile, since there need some new APIs to the fileformat. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-1950) Block merge for RCFile
[ https://issues.apache.org/jira/browse/HIVE-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992674#comment-12992674 ] Namit Jain commented on HIVE-1950: -- 1. Can you change merge_files to concatenate ? alter table T concatenate; 2. Move RCFile check to SemanticAnalyzer from runtime. 3. More comments: DDLTask.java/mergeFiles RCFile: all the new functions etc. Block merge for RCFile -- Key: HIVE-1950 URL: https://issues.apache.org/jira/browse/HIVE-1950 Project: Hive Issue Type: New Feature Reporter: He Yongqiang Assignee: He Yongqiang Attachments: HIVE-1950.1.patch, HIVE-1950.2.patch In our env, there are a lot of small files inside one partition/table. In order to reduce the namenode load, we have one dedicated housekeeping job running to merge these file. Right now the merge is an 'insert overwrite' in hive, and requires decompress the data and compress it. This jira is to add a command in Hive to do the merge without decompress and recompress the data. Something like alter table tbl_name [partition ()] merge files. In this jira the new command will only support RCFile, since there need some new APIs to the fileformat. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-1950) Block merge for RCFile
[ https://issues.apache.org/jira/browse/HIVE-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992112#comment-12992112 ] He Yongqiang commented on HIVE-1950: review comments from internal review: 1) if the stats present, try to correct it 2) jobClose of RCFileMergeMapper should share the code in FileSinkOperator 3) move the original data to a dump loc first 4) remove getRecordWriter() and RCFileBlockMergeOutputFormat 5) ioCxt for input file changed 6) disable merge for archived table/partition and bucketized table/partition 7) comments 8) negative tests for hiveinputformat Block merge for RCFile -- Key: HIVE-1950 URL: https://issues.apache.org/jira/browse/HIVE-1950 Project: Hive Issue Type: New Feature Reporter: He Yongqiang Assignee: He Yongqiang Attachments: HIVE-1950.1.patch In our env, there are a lot of small files inside one partition/table. In order to reduce the namenode load, we have one dedicated housekeeping job running to merge these file. Right now the merge is an 'insert overwrite' in hive, and requires decompress the data and compress it. This jira is to add a command in Hive to do the merge without decompress and recompress the data. Something like alter table tbl_name [partition ()] merge files. In this jira the new command will only support RCFile, since there need some new APIs to the fileformat. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-1950) Block merge for RCFile
[ https://issues.apache.org/jira/browse/HIVE-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992227#comment-12992227 ] Ning Zhang commented on HIVE-1950: -- As discussed offline, this patch should be able to handle stats update (creating a StatsTask as child). Also please keep in mind that the design and implementation of the new MergeTask should be easy to be used in the merge process in INSERT OVERWRITE. Block merge for RCFile -- Key: HIVE-1950 URL: https://issues.apache.org/jira/browse/HIVE-1950 Project: Hive Issue Type: New Feature Reporter: He Yongqiang Assignee: He Yongqiang Attachments: HIVE-1950.1.patch, HIVE-1950.2.patch In our env, there are a lot of small files inside one partition/table. In order to reduce the namenode load, we have one dedicated housekeeping job running to merge these file. Right now the merge is an 'insert overwrite' in hive, and requires decompress the data and compress it. This jira is to add a command in Hive to do the merge without decompress and recompress the data. Something like alter table tbl_name [partition ()] merge files. In this jira the new command will only support RCFile, since there need some new APIs to the fileformat. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HIVE-1950) Block merge for RCFile
[ https://issues.apache.org/jira/browse/HIVE-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990389#comment-12990389 ] He Yongqiang commented on HIVE-1950: review board: https://reviews.apache.org/r/388/ Block merge for RCFile -- Key: HIVE-1950 URL: https://issues.apache.org/jira/browse/HIVE-1950 Project: Hive Issue Type: New Feature Reporter: He Yongqiang Assignee: He Yongqiang Attachments: HIVE-1950.1.patch In our env, there are a lot of small files inside one partition/table. In order to reduce the namenode load, we have one dedicated housekeeping job running to merge these file. Right now the merge is an 'insert overwrite' in hive, and requires decompress the data and compress it. This jira is to add a command in Hive to do the merge without decompress and recompress the data. Something like alter table tbl_name [partition ()] merge files. In this jira the new command will only support RCFile, since there need some new APIs to the fileformat. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira