[ 
https://issues.apache.org/jira/browse/HIVE-23703?focusedWorklogId=447149&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-447149
 ]

ASF GitHub Bot logged work on HIVE-23703:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 17/Jun/20 08:43
            Start Date: 17/Jun/20 08:43
    Worklog Time Spent: 10m 
      Work Description: pvary commented on a change in pull request #1134:
URL: https://github.com/apache/hive/pull/1134#discussion_r441382235



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/OrcFileMergeOperator.java
##########
@@ -113,7 +116,11 @@ private void processKeyValuePairs(Object key, Object value)
 
       // store the orc configuration from the first file. All other files 
should
       // match this configuration before merging else will not be merged
-      if (outWriter == null) {
+      int bucketId = 0;
+      if (conf.getIsCompactionTable()) {

Review comment:
       Can we do this outside the process loop? Maybe saving some time, since 
this part of the code runs for every item, and this can cause performance 
degradation (I guess)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 447149)
    Time Spent: 20m  (was: 10m)

> Major QB compaction with multiple FileSinkOperators results in data loss and 
> one original file
> ----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-23703
>                 URL: https://issues.apache.org/jira/browse/HIVE-23703
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Karen Coppage
>            Assignee: Karen Coppage
>            Priority: Critical
>              Labels: compaction, pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> h4. Problems
> Example:
> {code:java}
> drop table if exists tbl2;
> create transactional table tbl2 (a int, b int) clustered by (a) into 4 
> buckets stored as ORC 
> TBLPROPERTIES('transactional'='true','transactional_properties'='default');
> insert into tbl2 values(1,2),(1,3),(1,4),(2,2),(2,3),(2,4);
> insert into tbl2 values(3,2),(3,3),(3,4),(4,2),(4,3),(4,4);
> insert into tbl2 values(5,2),(5,3),(5,4),(6,2),(6,3),(6,4);{code}
> E.g. in the example above, bucketId=0 when a=2 and a=6.
> 1. Data lossĀ 
>  In non-acid tables, an operator's temp files are named with their task id. 
> Because of this snippet, temp files in the FileSinkOperator for compaction 
> tables are identified by their bucket_id.
> {code:java}
> if (conf.isCompactionTable()) {
>  fsp.initializeBucketPaths(filesIdx, AcidUtils.BUCKET_PREFIX + 
> String.format(AcidUtils.BUCKET_DIGITS, bucketId),
>  isNativeTable(), isSkewedStoredAsSubDirectories);
>  } else {
>  fsp.initializeBucketPaths(filesIdx, taskId, isNativeTable(), 
> isSkewedStoredAsSubDirectories);
>  }
> {code}
> So 2 temp files containing data with a=2 and a=6 will be named bucket_0 and 
> not 000000_0 and 000000_1 as they would normally.
>  In FileSinkOperator.commit, when data with a=2, filename: bucket_0 is moved 
> from _task_tmp.-ext-10002 to _tmp.-ext-10002, it overwrites the files already 
> there with a=6 data, because it too is named bucket_0. You can see in the 
> logs:
> {code:java}
>  WARN [LocalJobRunner Map Task Executor #0] exec.FileSinkOperator: Target 
> path 
> file:.../hive/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnNoBuckets-1591107230237/warehouse/testmajorcompaction/base_0000002_v0000013/.hive-staging_hive_2020-06-02_07-15-21_771_8551447285061957908-1/_tmp.-ext-10002/bucket_00000
>  with a size 610 exists. Trying to delete it.
> {code}
> 2. Results in one original file
>  OrcFileMergeOperator merges the results of the FSOp into 1 file named 
> 000000_0.
> h4. Fix
> 1. FSOp will store data as: taskid/bucketId. e.g. 0_0/bucket_0
> 2. OrcMergeFileOp, instead of merging a bunch of files into 1 file named 
> 000000_0, will merge all files named bucket_0 into one file named bucket_0, 
> and so on.
> 3. MoveTask will get rid of the taskId directories if present and only move 
> the bucket files in them, in case OrcMergeFileOp is not run.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to