Rajesh Balamohan created HIVE-27119: ---------------------------------------
Summary: Iceberg: Delete from table generates lot of files Key: HIVE-27119 URL: https://issues.apache.org/jira/browse/HIVE-27119 Project: Hive Issue Type: Improvement Components: Iceberg integration Reporter: Rajesh Balamohan With "delete" it generates lot of files due to the way data is sent to the reducers. Files per partition is impacted by the number of reduce tasks. One way could be to explicitly control the number of reducers; Creating this ticket to have a long term fix. {noformat} explain delete from store_Sales where ss_customer_sk % 10 = 0; INFO : Compiling command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b): explain delete from store_Sales where ss_customer_sk % 10 = 0 INFO : No Stats for tpcds_1000_iceberg_mor_v4@store_sales, Columns: ss_sold_time_sk, ss_cdemo_sk, ss_promo_sk, ss_ext_discount_amt, ss_ext_sales_price, ss_net_profit, ss_addr_sk, ss_ticket_number, ss_wholesale_cost, ss_item_sk, ss_ext_list_price, ss_sold_date_sk, ss_store_sk, ss_coupon_amt, ss_quantity, ss_list_price, ss_sales_price, ss_customer_sk, ss_ext_wholesale_cost, ss_net_paid, ss_ext_tax, ss_hdemo_sk, ss_net_paid_inc_tax INFO : Semantic Analysis Completed (retrial = false) INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain, type:string, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b); Time taken: 0.704 seconds INFO : Executing command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b): explain delete from store_Sales where ss_customer_sk % 10 = 0 INFO : Starting task [Stage-4:EXPLAIN] in serial mode INFO : Completed executing command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b); Time taken: 0.005 seconds INFO : OK Explain STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages: Stage-1 Stage-0 depends on stages: Stage-2 Stage-3 depends on stages: Stage-0 STAGE PLANS: Stage: Stage-1 Tez DagId: hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b:377 Edges: Reducer 2 <- Map 1 (SIMPLE_EDGE) DagName: hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b:377 Vertices: Map 1 Map Operator Tree: TableScan alias: store_sales filterExpr: ((ss_customer_sk % 10) = 0) (type: boolean) Statistics: Num rows: 2755519629 Data size: 3643899155232 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((ss_customer_sk % 10) = 0) (type: boolean) Statistics: Num rows: 1377759814 Data size: 1821949576954 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: PARTITION__SPEC__ID (type: int), PARTITION__HASH (type: bigint), FILE__PATH (type: string), ROW__POSITION (type: bigint), ss_sold_time_sk (type: int), ss_item_sk (type: int), ss_customer_sk (type: int), ss_cdemo_sk (type: int), ss_hdemo_sk (type: int), ss_addr_sk (type: int), ss_store_sk (type: int), ss_promo_sk (type: int), ss_ticket_number (type: bigint), ss_quantity (type: int), ss_wholesale_cost (type: decimal(7,2)), ss_list_price (type: decimal(7,2)), ss_sales_price (type: decimal(7,2)), ss_ext_discount_amt (type: decimal(7,2)), ss_ext_sales_price (type: decimal(7,2)), ss_ext_wholesale_cost (type: decimal(7,2)), ss_ext_list_price (type: decimal(7,2)), ss_ext_tax (type: decimal(7,2)), ss_coupon_amt (type: decimal(7,2)), ss_net_paid (type: decimal(7,2)), ss_net_paid_inc_tax (type: decimal(7,2)), ss_net_profit (type: decimal(7,2)), ss_sold_date_sk (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26 Statistics: Num rows: 1377759814 Data size: 1821949576954 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: int), _col1 (type: bigint), _col2 (type: string), _col3 (type: bigint) null sort order: zzzz sort order: ++++ Statistics: Num rows: 1377759814 Data size: 1821949576954 Basic stats: COMPLETE Column stats: NONE value expressions: _col4 (type: int), _col5 (type: int), _col6 (type: int), _col7 (type: int), _col8 (type: int), _col9 (type: int), _col10 (type: int), _col11 (type: int), _col12 (type: bigint), _col13 (type: int), _col14 (type: decimal(7,2)), _col15 (type: decimal(7,2)), _col16 (type: decimal(7,2)), _col17 (type: decimal(7,2)), _col18 (type: decimal(7,2)), _col19 (type: decimal(7,2)), _col20 (type: decimal(7,2)), _col21 (type: decimal(7,2)), _col22 (type: decimal(7,2)), _col23 (type: decimal(7,2)), _col24 (type: decimal(7,2)), _col25 (type: decimal(7,2)), _col26 (type: int) Execution mode: llap LLAP IO: no inputs Reducer 2 Execution mode: vectorized, llap Reduce Operator Tree: Select Operator expressions: KEY.reducesinkkey0 (type: int), KEY.reducesinkkey1 (type: bigint), KEY.reducesinkkey2 (type: string), KEY.reducesinkkey3 (type: bigint), VALUE._col0 (type: int), VALUE._col1 (type: int), VALUE._col2 (type: int), VALUE._col3 (type: int), VALUE._col4 (type: int), VALUE._col5 (type: int), VALUE._col6 (type: int), VALUE._col7 (type: int), VALUE._col8 (type: bigint), VALUE._col9 (type: int), VALUE._col10 (type: decimal(7,2)), VALUE._col11 (type: decimal(7,2)), VALUE._col12 (type: decimal(7,2)), VALUE._col13 (type: decimal(7,2)), VALUE._col14 (type: decimal(7,2)), VALUE._col15 (type: decimal(7,2)), VALUE._col16 (type: decimal(7,2)), VALUE._col17 (type: decimal(7,2)), VALUE._col18 (type: decimal(7,2)), VALUE._col19 (type: decimal(7,2)), VALUE._col20 (type: decimal(7,2)), VALUE._col21 (type: decimal(7,2)), VALUE._col22 (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26 Statistics: Num rows: 1377759814 Data size: 1821949576954 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1377759814 Data size: 1821949576954 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.iceberg.mr.hive.HiveIcebergInputFormat output format: org.apache.iceberg.mr.hive.HiveIcebergOutputFormat serde: org.apache.iceberg.mr.hive.HiveIcebergSerDe name: tpcds_1000_iceberg_mor_v4.store_sales Stage: Stage-2 Dependency Collection Stage: Stage-0 Move Operator tables: replace: false table: input format: org.apache.iceberg.mr.hive.HiveIcebergInputFormat output format: org.apache.iceberg.mr.hive.HiveIcebergOutputFormat serde: org.apache.iceberg.mr.hive.HiveIcebergSerDe name: tpcds_1000_iceberg_mor_v4.store_sales Stage: Stage-3 Stats Work Basic Stats Work: {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)