[jira] [Commented] (HIVE-27901) Hive's performance for querying the Iceberg table is very poor.

zhangbutao (Jira) Wed, 22 Nov 2023 23:53:04 -0800


    [ 
https://issues.apache.org/jira/browse/HIVE-27901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789013#comment-17789013
 ]


zhangbutao commented on HIVE-27901:
-----------------------------------

I think this ticket looks something like 
https://issues.apache.org/jira/browse/HIVE-27883 . Currently, some optimization 
properties  like merge/split data can not be used on Iceberg table as iceberg 
has its own optimization properties. 

 

For this ticket, it seems that orc table has more tasks than iceberg table, so 
the orc table can run faster. I think that maybe you can try to optimize the 
property _set read.split.target-size=67108864;_

[https://iceberg.apache.org/docs/latest/configuration/#read-properties]  
read.split.target-size is default 134217728.

But i am not sure if this is a good way to optimize your query, as i can not 
reproduce and delve into your problem.

> Hive's performance for querying the Iceberg table is very poor.
> ---------------------------------------------------------------
>
>                 Key: HIVE-27901
>                 URL: https://issues.apache.org/jira/browse/HIVE-27901
>             Project: Hive
>          Issue Type: Bug
>          Components: Iceberg integration
>    Affects Versions: 4.0.0-beta-1
>            Reporter: yongzhi.shao
>            Priority: Major
>         Attachments: image-2023-11-22-18-32-28-344.png, 
> image-2023-11-22-18-33-01-885.png, image-2023-11-22-18-33-32-915.png
>
>
> I am using HIVE4.0.0-BETA for testing.
> BTW,I found that the performance of HIVE reading ICEBERG table is still very 
> slow.
> How should I deal with this problem?
> I count a 7 billion table and compare the performance difference between HIVE 
> reading ICEBERG-ORC and ORC table respectively.
> We use ICEBERG 1.4.2, ICEBERG-ORC with ZSTD compression enabled.
> ORC with SNAPPY compression.
> HADOOP version 3.1.1 (native zstd not supported).
> {code:java}
> --spark3.4.1+iceberg 1.4.2
> CREATE TABLE datacenter.dwd.b_std_trade (
>   uni_order_id STRING,
>   data_from BIGINT,
>   partner STRING,
>   plat_code STRING,
>   order_id STRING,
>   uni_shop_id STRING,
>   uni_id STRING,
>   guide_id STRING,
>   shop_id STRING,
>   plat_account STRING,
>   total_fee DOUBLE,
>   item_discount_fee DOUBLE,
>   trade_discount_fee DOUBLE,
>   adjust_fee DOUBLE,
>   post_fee DOUBLE,
>   discount_rate DOUBLE,
>   payment_no_postfee DOUBLE,
>   payment DOUBLE,
>   pay_time STRING,
>   product_num BIGINT,
>   order_status STRING,
>   is_refund STRING,
>   refund_fee DOUBLE,
>   insert_time STRING,
>   created STRING,
>   endtime STRING,
>   modified STRING,
>   trade_type STRING,
>   receiver_name STRING,
>   receiver_country STRING,
>   receiver_state STRING,
>   receiver_city STRING,
>   receiver_district STRING,
>   receiver_town STRING,
>   receiver_address STRING,
>   receiver_mobile STRING,
>   trade_source STRING,
>   delivery_type STRING,
>   consign_time STRING,
>   orders_num BIGINT,
>   is_presale BIGINT,
>   presale_status STRING,
>   first_fee_paytime STRING,
>   last_fee_paytime STRING,
>   first_paid_fee DOUBLE,
>   tenant STRING,
>   tidb_modified STRING,
>   step_paid_fee DOUBLE,
>   seller_flag STRING,
>   is_used_store_card BIGINT,
>   store_card_used DOUBLE,
>   store_card_basic_used DOUBLE,
>   store_card_expand_used DOUBLE,
>   order_promotion_num BIGINT,
>   item_promotion_num BIGINT,
>   buyer_remark STRING,
>   seller_remark STRING,
>   trade_business_type STRING)
> USING iceberg
> PARTITIONED BY (uni_shop_id, truncate(4, created))
> LOCATION '/iceberg-catalog/warehouse/dwd/b_std_trade'
> TBLPROPERTIES (
>   'current-snapshot-id' = '7217819472703702905',
>   'format' = 'iceberg/orc',
>   'format-version' = '1',
>   'hive.stored-as' = 'iceberg',
>   'read.orc.vectorization.enabled' = 'true',
>   'sort-order' = 'uni_shop_id ASC NULLS FIRST, created ASC NULLS FIRST',
>   'write.distribution-mode' = 'hash',
>   'write.format.default' = 'orc',
>   'write.metadata.delete-after-commit.enabled' = 'true',
>   'write.metadata.previous-versions-max' = '3',
>   'write.orc.bloom.filter.columns' = 'order_id',
>   'write.orc.compression-codec' = 'zstd')
> --hive-iceberg
> CREATE EXTERNAL TABLE iceberg_dwd.b_std_trade 
> STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 
> LOCATION 'hdfs://xxxx/iceberg-catalog/warehouse/dwd/b_std_trade'
> TBLPROPERTIES 
> ('iceberg.catalog'='location_based_table','engine.hive.enabled'='true'); 
> --inner orc table( set hive default format = orc )
> set hive.default.fileformat=orc;
> set hive.default.fileformat.managed=orc;
> create table if not exists iceberg_dwd.orc_inner_table as select * from 
> iceberg_dwd.b_std_trade;{code}
>  
> !image-2023-11-22-18-32-28-344.png!
> !image-2023-11-22-18-33-01-885.png!
> Also, I have another question. The Submit Plan statistic is clearly 
> incorrect. Is this something that needs to be fixed?
> !image-2023-11-22-18-33-32-915.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-27901) Hive's performance for querying the Iceberg table is very poor.

Reply via email to