[ https://issues.apache.org/jira/browse/HIVE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237047#comment-14237047 ]
Jihong Liu commented on HIVE-8966: ---------------------------------- Solution: if the last delta has any file which is in bucket file pattern, but actually is non bucket file, don’t compact this delta. When a transaction is not close, a delta will have a file like bucket_n_flash_length, which is non bucket file. Actually for any reason, if the last delta has a file with bucket file pattern but not compactable, we should ignore this delta. Since after compaction, the delta will be removed. So if the whole delta cannot be compacted, leave it as what it is. So in the above scenario, the second delta will not be compacted. And the cleaner will not remove it because it has higher transaction id than the new created compaction file(base or delta). The reason we only do the above for the last delta is to consider the case that two or more transaction batches may be created and the last one is close first. Then if the last delta gets compacted, the transaction id in the base will be big, so all deltas will be removed by cleaner. So data could be lost. In this case, in the list of deltas for compaction, at least one delta has that bucket_n_flash_length file inside. Since we do not ignore it, the compaction will be auto-fail, so nothing happen, no data lost. In this case, the compaction can only be done after all transaction batches are closed. Although it is not so good, at least no data lost. The patch is attached. It adds one method to test whether needs to remove the last delta from the delta list. And before process the delta list, run that method. After applying this patch, no data is lost. We can do either major or minor compaction meanwhile keeping loading data in the same time. > Delta files created by hive hcatalog streaming cannot be compacted > ------------------------------------------------------------------ > > Key: HIVE-8966 > URL: https://issues.apache.org/jira/browse/HIVE-8966 > Project: Hive > Issue Type: Bug > Components: HCatalog > Affects Versions: 0.14.0 > Environment: hive > Reporter: Jihong Liu > Assignee: Alan Gates > Priority: Critical > Fix For: 0.14.1 > > Attachments: HIVE-8966.patch > > > hive hcatalog streaming will also create a file like bucket_n_flush_length in > each delta directory. Where "n" is the bucket number. But the > compactor.CompactorMR think this file also needs to compact. However this > file of course cannot be compacted, so compactor.CompactorMR will not > continue to do the compaction. > Did a test, after removed the bucket_n_flush_length file, then the "alter > table partition compact" finished successfully. If don't delete that file, > nothing will be compacted. > This is probably a very severity bug. Both 0.13 and 0.14 have this issue -- This message was sent by Atlassian JIRA (v6.3.4#6332)