[jira] [Work logged] (HIVE-23703) Major QB compaction with multiple FileSinkOperators results in data loss and one original file

ASF GitHub Bot (Jira) Wed, 17 Jun 2020 05:36:16 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-23703?focusedWorklogId=447242&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-447242
 ]


ASF GitHub Bot logged work on HIVE-23703:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 17/Jun/20 12:35
            Start Date: 17/Jun/20 12:35
    Worklog Time Spent: 10m 
      Work Description: klcopp commented on a change in pull request #1134:
URL: https://github.com/apache/hive/pull/1134#discussion_r441510102



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java
##########
@@ -334,6 +334,18 @@ public void initializeBucketPaths(int filesIdx, String 
taskId, boolean isNativeT
         if (!isMmTable && !isDirectInsert) {
           if (!bDynParts && !isSkewedStoredAsSubDirectories) {
             finalPaths[filesIdx] = new Path(parent, taskWithExt);
+            if (conf.isCompactionTable()) {
+              // tables used in compaction are external and non-acid. We need 
to keep track of

Review comment:
       Done

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java
##########
@@ -4123,9 +4125,28 @@ private static void copyFiles(final HiveConf conf, final 
FileSystem destFs,
           }
           throw new HiveException(e);
         }
-      } else {
+      else {

Review comment:
       Typo, done.

##########
File path: ql/src/test/org/apache/hadoop/hive/ql/metadata/TestHiveCopyFiles.java
##########
@@ -83,7 +83,8 @@ public void testRenameNewFilesOnSameFileSystem() throws 
IOException {
     FileSystem targetFs = targetPath.getFileSystem(hiveConf);
 
     try {
-      Hive.copyFiles(hiveConf, sourcePath, targetPath, targetFs, 
isSourceLocal, NO_ACID, false,null, false, false, false);
+      Hive.copyFiles(hiveConf, sourcePath, targetPath, targetFs, 
isSourceLocal, NO_ACID, false,null,

Review comment:
       Done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 447242)
    Time Spent: 1h 10m  (was: 1h)

> Major QB compaction with multiple FileSinkOperators results in data loss and 
> one original file
> ----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-23703
>                 URL: https://issues.apache.org/jira/browse/HIVE-23703
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Karen Coppage
>            Assignee: Karen Coppage
>            Priority: Critical
>              Labels: compaction, pull-request-available
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> h4. Problems
> Example:
> {code:java}
> drop table if exists tbl2;
> create transactional table tbl2 (a int, b int) clustered by (a) into 4 
> buckets stored as ORC 
> TBLPROPERTIES('transactional'='true','transactional_properties'='default');
> insert into tbl2 values(1,2),(1,3),(1,4),(2,2),(2,3),(2,4);
> insert into tbl2 values(3,2),(3,3),(3,4),(4,2),(4,3),(4,4);
> insert into tbl2 values(5,2),(5,3),(5,4),(6,2),(6,3),(6,4);{code}
> E.g. in the example above, bucketId=0 when a=2 and a=6.
> 1. Data loss 
>  In non-acid tables, an operator's temp files are named with their task id. 
> Because of this snippet, temp files in the FileSinkOperator for compaction 
> tables are identified by their bucket_id.
> {code:java}
> if (conf.isCompactionTable()) {
>  fsp.initializeBucketPaths(filesIdx, AcidUtils.BUCKET_PREFIX + 
> String.format(AcidUtils.BUCKET_DIGITS, bucketId),
>  isNativeTable(), isSkewedStoredAsSubDirectories);
>  } else {
>  fsp.initializeBucketPaths(filesIdx, taskId, isNativeTable(), 
> isSkewedStoredAsSubDirectories);
>  }
> {code}
> So 2 temp files containing data with a=2 and a=6 will be named bucket_0 and 
> not 000000_0 and 000000_1 as they would normally.
>  In FileSinkOperator.commit, when data with a=2, filename: bucket_0 is moved 
> from _task_tmp.-ext-10002 to _tmp.-ext-10002, it overwrites the files already 
> there with a=6 data, because it too is named bucket_0. You can see in the 
> logs:
> {code:java}
>  WARN [LocalJobRunner Map Task Executor #0] exec.FileSinkOperator: Target 
> path 
> file:.../hive/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnNoBuckets-1591107230237/warehouse/testmajorcompaction/base_0000002_v0000013/.hive-staging_hive_2020-06-02_07-15-21_771_8551447285061957908-1/_tmp.-ext-10002/bucket_00000
>  with a size 610 exists. Trying to delete it.
> {code}
> 2. Results in one original file
>  OrcFileMergeOperator merges the results of the FSOp into 1 file named 
> 000000_0.
> h4. Fix
> 1. FSOp will store data as: taskid/bucketId. e.g. 0_0/bucket_0
> 2. OrcMergeFileOp, instead of merging a bunch of files into 1 file named 
> 000000_0, will merge all files named bucket_0 into one file named bucket_0, 
> and so on.
> 3. MoveTask will get rid of the taskId directories if present and only move 
> the bucket files in them, in case OrcMergeFileOp is not run.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23703) Major QB compaction with multiple FileSinkOperators results in data loss and one original file

Reply via email to