[jira] [Work logged] (HIVE-23763) Query based minor compaction produces wrong files when rows with different buckets Ids are processed by the same FileSinkOperator

ASF GitHub Bot (Jira) Fri, 31 Jul 2020 01:58:44 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-23763?focusedWorklogId=464938&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-464938
 ]


ASF GitHub Bot logged work on HIVE-23763:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 31/Jul/20 08:57
            Start Date: 31/Jul/20 08:57
    Worklog Time Spent: 10m 
      Work Description: pvary commented on a change in pull request #1327:
URL: https://github.com/apache/hive/pull/1327#discussion_r463490434



##########
File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorOnTezTest.java
##########
@@ -261,22 +326,77 @@ protected void insertMmTestData(String tblName, int 
iterations) throws Exception
     }
 
     List<String> getAllData(String tblName) throws Exception {
-      return getAllData(null, tblName);
+      return getAllData(null, tblName, false);
     }
 
-    List<String> getAllData(String dbName, String tblName) throws Exception {
+    List<String> getAllData(String tblName, boolean withRowId) throws 
Exception {
+      return getAllData(null, tblName, withRowId);
+    }
+
+    List<String> getAllData(String dbName, String tblName, boolean withRowId) 
throws Exception {
       if (dbName != null) {
         tblName = dbName + "." + tblName;
       }
-      List<String> result = executeStatementOnDriverAndReturnResults("select * 
from " + tblName, driver);
+      StringBuffer query = new StringBuffer();
+      query.append("select ");
+      if (withRowId) {
+        query.append("ROW__ID, ");
+      }
+      query.append("* from ");
+      query.append(tblName);
+      List<String> result = 
executeStatementOnDriverAndReturnResults(query.toString(), driver);
       Collections.sort(result);
       return result;
     }
 
+    List<String> getDataWithInputFileNames(String dbName, String tblName) 
throws Exception {
+      if (dbName != null) {
+        tblName = dbName + "." + tblName;
+      }
+      StringBuffer query = new StringBuffer();
+      query.append("select ");
+      query.append("INPUT__FILE__NAME, a from ");

Review comment:
       I have seen issues with queries using virtual columns 
(INPUT__FILE__NAME, ROW__ID). The row number in the results were different with 
and without the virtual columns. - This is just a note, maybe no action is 
needed here




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 464938)
    Time Spent: 40m  (was: 0.5h)

> Query based minor compaction produces wrong files when rows with different 
> buckets Ids are processed by the same FileSinkOperator
> ---------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-23763
>                 URL: https://issues.apache.org/jira/browse/HIVE-23763
>             Project: Hive
>          Issue Type: Bug
>          Components: Transactions
>    Affects Versions: 4.0.0
>            Reporter: Marta Kuczora
>            Assignee: Marta Kuczora
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> How to reproduce:
> - Create an unbucketed ACID table
> - Insert a bigger amount of data into this table so there would be multiple 
> bucket files in the table
> The files in the table should look like this:
> /warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00000_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00001_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00002_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00003_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00004_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00005_0
> - Do some delete on rows with different bucket Ids
> The files in a delete delta should look like this:
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_0000002_0000002_0000/bucket_00000
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_0000006_0000006_0000/bucket_00003
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_0000006_0000006_0000/bucket_00001
> - Run the query-based minor compaction
> - After the compaction the newly created delete delta containes only 1 bucket 
> file. This file contains rows from all buckets and the table becomes unusable
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_0000001_0000007_v0000066/bucket_00000
> The issue happens only if rows with different bucket Ids are processed by the 
> same FileSinkOperator. 
> In the FileSinkOperator.process method, the files for the compaction table 
> are created like this:
> {noformat}
>     if (!bDynParts && !filesCreated) {
>       if (lbDirName != null) {
>         if (valToPaths.get(lbDirName) == null) {
>           createNewPaths(null, lbDirName);
>         }
>       } else {
>         if (conf.isCompactionTable()) {
>           int bucketProperty = getBucketProperty(row);
>           bucketId = 
> BucketCodec.determineVersion(bucketProperty).decodeWriterId(bucketProperty);
>         }
>         createBucketFiles(fsp);
>       }
>     }
> {noformat}
> When the first row is processed, the file is created and then the 
> filesCreated variable is set to true. Then when the other rows are processed, 
> the first if statement will be false, so no new file gets created, but the 
> row will be written into the file created for the first row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23763) Query based minor compaction produces wrong files when rows with different buckets Ids are processed by the same FileSinkOperator

Reply via email to