Rajesh Balamohan created HIVE-26540:
---------------------------------------

             Summary: Iceberg: Select queries after update/delete become  
expensive in reading contents
                 Key: HIVE-26540
                 URL: https://issues.apache.org/jira/browse/HIVE-26540
             Project: Hive
          Issue Type: Improvement
            Reporter: Rajesh Balamohan


- Create basic date_dim table in tpcds. Store it in iceberg v2 format
- Update few 1000 records couple of times
- Run a simple select query {{select count ( * ) from date_dim_ice where d_qoy 
= 11 and d_dom=2 and d_fy_week_seq=3;}}

This takes 8-18 seconds where ACID takes 1.5 seconds.

Basic issue is that, it reads files multiple times (i.e both data and delete 
files).

Lines of interest:

IcebergInputFormat.java

{noformat}
   InternalRecordWrapper wrapper = new 
InternalRecordWrapper(readSchema.asStruct());
        Evaluator filter = new Evaluator(readSchema.asStruct(), residual, 
caseSensitive);
        return CloseableIterable.filter(iter, record -> 
filter.eval(wrapper.wrap((StructLike) record)));
{noformat}



{noformat}
   case GENERIC:
          DeleteFilter deletes = new GenericDeleteFilter(table.io(), 
currentTask, table.schema(), readSchema);
          Schema requiredSchema = deletes.requiredSchema();
          return deletes.filter(openGeneric(currentTask, requiredSchema));
{noformat}

These get evaluated for each row in the data file, causing delay.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to