[jira] [Created] (HIVE-27905) SPLIT throws ClassCastException
okumin created HIVE-27905: - Summary: SPLIT throws ClassCastException Key: HIVE-27905 URL: https://issues.apache.org/jira/browse/HIVE-27905 Project: Hive Issue Type: Improvement Affects Versions: 4.0.0-beta-1 Reporter: okumin Assignee: okumin GenericUDFSplit throws ClassCastException when a non-primitive type is given. {code:java} 0: jdbc:hive2://hive-hiveserver2:1/defaul> select split(array('a,b,c'), ','); Error: Error while compiling statement: FAILED: ClassCastException org.apache.hadoop.hive.serde2.objectinspector.StandardConstantListObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector (state=42000,code=4) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HIVE-27901) Hive's performance for querying the Iceberg table is very poor.
[ https://issues.apache.org/jira/browse/HIVE-27901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789013#comment-17789013 ] zhangbutao commented on HIVE-27901: --- I think this ticket looks something like https://issues.apache.org/jira/browse/HIVE-27883 . Currently, some optimization properties like merge/split data can not be used on Iceberg table as iceberg has its own optimization properties. For this ticket, it seems that orc table has more tasks than iceberg table, so the orc table can run faster. I think that maybe you can try to optimize the property _set read.split.target-size=67108864;_ [https://iceberg.apache.org/docs/latest/configuration/#read-properties] read.split.target-size is default 134217728. But i am not sure if this is a good way to optimize your query, as i can not reproduce and delve into your problem. > Hive's performance for querying the Iceberg table is very poor. > --- > > Key: HIVE-27901 > URL: https://issues.apache.org/jira/browse/HIVE-27901 > Project: Hive > Issue Type: Bug > Components: Iceberg integration >Affects Versions: 4.0.0-beta-1 >Reporter: yongzhi.shao >Priority: Major > Attachments: image-2023-11-22-18-32-28-344.png, > image-2023-11-22-18-33-01-885.png, image-2023-11-22-18-33-32-915.png > > > I am using HIVE4.0.0-BETA for testing. > BTW,I found that the performance of HIVE reading ICEBERG table is still very > slow. > How should I deal with this problem? > I count a 7 billion table and compare the performance difference between HIVE > reading ICEBERG-ORC and ORC table respectively. > We use ICEBERG 1.4.2, ICEBERG-ORC with ZSTD compression enabled. > ORC with SNAPPY compression. > HADOOP version 3.1.1 (native zstd not supported). > {code:java} > --spark3.4.1+iceberg 1.4.2 > CREATE TABLE datacenter.dwd.b_std_trade ( > uni_order_id STRING, > data_from BIGINT, > partner STRING, > plat_code STRING, > order_id STRING, > uni_shop_id STRING, > uni_id STRING, > guide_id STRING, > shop_id STRING, > plat_account STRING, > total_fee DOUBLE, > item_discount_fee DOUBLE, > trade_discount_fee DOUBLE, > adjust_fee DOUBLE, > post_fee DOUBLE, > discount_rate DOUBLE, > payment_no_postfee DOUBLE, > payment DOUBLE, > pay_time STRING, > product_num BIGINT, > order_status STRING, > is_refund STRING, > refund_fee DOUBLE, > insert_time STRING, > created STRING, > endtime STRING, > modified STRING, > trade_type STRING, > receiver_name STRING, > receiver_country STRING, > receiver_state STRING, > receiver_city STRING, > receiver_district STRING, > receiver_town STRING, > receiver_address STRING, > receiver_mobile STRING, > trade_source STRING, > delivery_type STRING, > consign_time STRING, > orders_num BIGINT, > is_presale BIGINT, > presale_status STRING, > first_fee_paytime STRING, > last_fee_paytime STRING, > first_paid_fee DOUBLE, > tenant STRING, > tidb_modified STRING, > step_paid_fee DOUBLE, > seller_flag STRING, > is_used_store_card BIGINT, > store_card_used DOUBLE, > store_card_basic_used DOUBLE, > store_card_expand_used DOUBLE, > order_promotion_num BIGINT, > item_promotion_num BIGINT, > buyer_remark STRING, > seller_remark STRING, > trade_business_type STRING) > USING iceberg > PARTITIONED BY (uni_shop_id, truncate(4, created)) > LOCATION '/iceberg-catalog/warehouse/dwd/b_std_trade' > TBLPROPERTIES ( > 'current-snapshot-id' = '7217819472703702905', > 'format' = 'iceberg/orc', > 'format-version' = '1', > 'hive.stored-as' = 'iceberg', > 'read.orc.vectorization.enabled' = 'true', > 'sort-order' = 'uni_shop_id ASC NULLS FIRST, created ASC NULLS FIRST', > 'write.distribution-mode' = 'hash', > 'write.format.default' = 'orc', > 'write.metadata.delete-after-commit.enabled' = 'true', > 'write.metadata.previous-versions-max' = '3', > 'write.orc.bloom.filter.columns' = 'order_id', > 'write.orc.compression-codec' = 'zstd') > --hive-iceberg > CREATE EXTERNAL TABLE iceberg_dwd.b_std_trade > STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' > LOCATION 'hdfs:///iceberg-catalog/warehouse/dwd/b_std_trade' > TBLPROPERTIES > ('iceberg.catalog'='location_based_table','engine.hive.enabled'='true'); > --inner orc table( set hive default format = orc ) > set hive.default.fileformat=orc; > set hive.default.fileformat.managed=orc; > create table if not exists iceberg_dwd.orc_inner_table as select * from > iceberg_dwd.b_std_trade;{code} > > !image-2023-11-22-18-32-28-344.png! > !image-2023-11-22-18-33-01-885.png! > Also, I have another question. The Submit Plan statistic is clearly > incorrect. Is this something that needs to be fixed? > !image-2023-11-22-18-33-32-915.png! > -- This m
[jira] [Commented] (HIVE-27898) HIVE4 can't use ICEBERG table in subqueries
[ https://issues.apache.org/jira/browse/HIVE-27898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789008#comment-17789008 ] zhangbutao commented on HIVE-27898: --- Please provide more simple test to help others to reproduce this issue. 1) Can we create a more simple table with several columns? table *_datacenter.dwd.b_std_trade_* has too many columns. 2) Can we insert several rows to help to reproduce this issue? > HIVE4 can't use ICEBERG table in subqueries > --- > > Key: HIVE-27898 > URL: https://issues.apache.org/jira/browse/HIVE-27898 > Project: Hive > Issue Type: Bug > Components: Iceberg integration >Affects Versions: 4.0.0-beta-1 >Reporter: yongzhi.shao >Priority: Critical > > Currently, we found that when using HIVE4-BETA1 version, if we use ICEBERG > table in the subquery, we can't get any data in the end. > I have used HIVE3-TEZ for cross validation and HIVE3 does not have this > problem when querying ICEBERG. > {code:java} > --spark3.4.1+iceberg 1.4.2 > CREATE TABLE datacenter.dwd.b_std_trade ( > uni_order_id STRING, > data_from BIGINT, > partner STRING, > plat_code STRING, > order_id STRING, > uni_shop_id STRING, > uni_id STRING, > guide_id STRING, > shop_id STRING, > plat_account STRING, > total_fee DOUBLE, > item_discount_fee DOUBLE, > trade_discount_fee DOUBLE, > adjust_fee DOUBLE, > post_fee DOUBLE, > discount_rate DOUBLE, > payment_no_postfee DOUBLE, > payment DOUBLE, > pay_time STRING, > product_num BIGINT, > order_status STRING, > is_refund STRING, > refund_fee DOUBLE, > insert_time STRING, > created STRING, > endtime STRING, > modified STRING, > trade_type STRING, > receiver_name STRING, > receiver_country STRING, > receiver_state STRING, > receiver_city STRING, > receiver_district STRING, > receiver_town STRING, > receiver_address STRING, > receiver_mobile STRING, > trade_source STRING, > delivery_type STRING, > consign_time STRING, > orders_num BIGINT, > is_presale BIGINT, > presale_status STRING, > first_fee_paytime STRING, > last_fee_paytime STRING, > first_paid_fee DOUBLE, > tenant STRING, > tidb_modified STRING, > step_paid_fee DOUBLE, > seller_flag STRING, > is_used_store_card BIGINT, > store_card_used DOUBLE, > store_card_basic_used DOUBLE, > store_card_expand_used DOUBLE, > order_promotion_num BIGINT, > item_promotion_num BIGINT, > buyer_remark STRING, > seller_remark STRING, > trade_business_type STRING) > USING iceberg > PARTITIONED BY (uni_shop_id, truncate(4, created)) > LOCATION '/iceberg-catalog/warehouse/dwd/b_std_trade' > TBLPROPERTIES ( > 'current-snapshot-id' = '7217819472703702905', > 'format' = 'iceberg/orc', > 'format-version' = '1', > 'hive.stored-as' = 'iceberg', > 'read.orc.vectorization.enabled' = 'true', > 'sort-order' = 'uni_shop_id ASC NULLS FIRST, created ASC NULLS FIRST', > 'write.distribution-mode' = 'hash', > 'write.format.default' = 'orc', > 'write.metadata.delete-after-commit.enabled' = 'true', > 'write.metadata.previous-versions-max' = '3', > 'write.orc.bloom.filter.columns' = 'order_id', > 'write.orc.compression-codec' = 'zstd') > --hive-iceberg > CREATE EXTERNAL TABLE iceberg_dwd.b_std_trade > STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' > LOCATION 'hdfs:///iceberg-catalog/warehouse/dwd/b_std_trade' > TBLPROPERTIES > ('iceberg.catalog'='location_based_table','engine.hive.enabled'='true'); > select * from iceberg_dwd.b_std_trade > where uni_shop_id = 'TEST|1' limit 10 --10 rows > select * > from ( > select * from iceberg_dwd.b_std_trade > where uni_shop_id = 'TEST|1' limit 10 > ) t1; --10 rows > select uni_shop_id > from ( > select * from iceberg_dwd.b_std_trade > where uni_shop_id = 'TEST|1' limit 10 > ) t1; --0 rows > select uni_shop_id > from ( > select uni_shop_id as uni_shop_id from iceberg_dwd.b_std_trade > where uni_shop_id = 'TEST|1' limit 10 > ) t1; --0 rows > --hive-orc > select uni_shop_id > from ( > select * from iceberg_dwd.trade_test > where uni_shop_id = 'TEST|1' limit 10 > ) t1;--10 ROWS{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HIVE-27900) hive can not read iceberg-parquet table
[ https://issues.apache.org/jira/browse/HIVE-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789007#comment-17789007 ] zhangbutao commented on HIVE-27900: --- I can not reproduce this issue on master code. My env is : 1) Hive master branch: You can compile hive code using the cmd: {code:java} mvn clean install -DskipTests -Piceberg -Pdist{code} 2)Tez 0.10.2 I recommend you use 0.10.2 to test as 0.10.3 is not released. We can not make sure 0.10.3 work well with Hive. 3) Hadoop 3.3.1 BTW, if the table _*local.test.b_qqd_shop_rfm_parquet_snappy* is_ empty without data, will the issue occur again in your env? > hive can not read iceberg-parquet table > --- > > Key: HIVE-27900 > URL: https://issues.apache.org/jira/browse/HIVE-27900 > Project: Hive > Issue Type: Bug > Components: Iceberg integration >Affects Versions: 4.0.0-beta-1 >Reporter: yongzhi.shao >Priority: Major > > We found that using HIVE4-BETA version, we could not query the > Iceberg-Parquet table with vectorised execution turned on. > {code:java} > --spark-sql(3.4.1+iceberg 1.4.2) > CREATE TABLE local.test.b_qqd_shop_rfm_parquet_snappy ( > a string,b string,c string) > USING iceberg > LOCATION '/iceberg-catalog/warehouse/test/b_qqd_shop_rfm_parquet_snappy' > TBLPROPERTIES ( > 'current-snapshot-id' = '5138351937447353683', > 'format' = 'iceberg/parquet', > 'format-version' = '2', > 'read.orc.vectorization.enabled' = 'true', > 'write.format.default' = 'parquet', > 'write.metadata.delete-after-commit.enabled' = 'true', > 'write.metadata.previous-versions-max' = '3', > 'write.parquet.compression-codec' = 'snappy'); > --hive-sql > CREATE EXTERNAL TABLE iceberg_dwd.b_qqd_shop_rfm_parquet_snappy > STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' > LOCATION > 'hdfs://xxx/iceberg-catalog/warehouse/test/b_qqd_shop_rfm_parquet_snappy/' > TBLPROPERTIES > ('iceberg.catalog'='location_based_table','engine.hive.enabled'='true'); > set hive.default.fileformat=orc; > set hive.default.fileformat.managed=orc; > create table test_parquet_as_orc as select * from > b_qqd_shop_rfm_parquet_snappy limit 100; > , TaskAttempt 2 failed, info=[Error: Node: /xxx..xx.xx: Error while > running task ( failure ) : > attempt_1696729618575_69586_1_00_00_2:java.lang.RuntimeException: > java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: > Hive Runtime Error while processing row > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:348) > at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:276) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:381) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:82) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:69) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:69) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:39) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at > com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131) > at > com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:76) > at > com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > Caused by: java.lang.RuntimeException: > org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while > processing row > at > org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:110) > at > org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:83) > at > org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:414) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:293) > ... 16 more > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime > Error while processing row > at > org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:993) > at > org.apache.hadoop.hive.ql.exec.tez.MapRecordSourc
[jira] [Commented] (HIVE-27899) Killed speculative execution task attempt should not commit file
[ https://issues.apache.org/jira/browse/HIVE-27899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788971#comment-17788971 ] Sungwoo Park commented on HIVE-27899: - Calling canCommit() may not be a complete solution. For example, can we have a bad scenario like this? TaskAttempt#1 calls canCommit(), writes output, and then fails for some reason. Later TaskAttempt#2 calls canCommit(), writes output, and then completes successfully. > Killed speculative execution task attempt should not commit file > > > Key: HIVE-27899 > URL: https://issues.apache.org/jira/browse/HIVE-27899 > Project: Hive > Issue Type: Bug > Components: Tez >Reporter: Chenyu Zheng >Assignee: Chenyu Zheng >Priority: Major > Attachments: reproduce_bug.md > > > As I mentioned in HIVE-25561, when tez turns on speculative execution, the > data file produced by hive may be duplicated. I mentioned in HIVE-25561 that > if the speculatively executed task is killed, some data may be submitted > unexpectedly. However, after HIVE-25561, there is still a situation that has > not been solved. If two task attempts commit file at the same time, the > problem of duplicate data files may also occur. Although the probability of > this happening is very, very low, it does happen. > > Why? > There are two key steps: > (1)FileSinkOperator::closeOp > TezProcessor::initializeAndRunProcessor --> ... --> FileSinkOperator::closeOp > --> fsp.commit > When the OP is closed, the process of closing the OP will be triggered, and > eventually the call to fsp.commit will be triggered. > (2)removeTempOrDuplicateFiles > (2.a)Firstly, listStatus the files in the temporary directory. > (2.b)Secondly check whether there are multiple incorrect commit, and finally > move the correct results to the final directory. > When speculative execution is enabled, when one attempt of a Task is > completed, other attempts will be killed. However, AM only sends the kill > event and does not ensure that all cleanup actions are completed, that is, > closeOp may be executed between 2.a and 2.b. Therefore, > removeTempOrDuplicateFiles will not delete the file generated by the kill > attempt. > How? > The problem is that both speculatively executed tasks commit the file. This > will not happen in the Tez examples because they will try canCommit, which > can guarantee that one and only one task attempt commit successfully. If one > task attempt executes canCommti successfully, the other one will be stuck by > canCommit until it receives a kill signal. > detail see: > [https://github.com/apache/tez/blob/51d6f53967110e2b91b6d90b46f8e16bdc062091/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/processor/SimpleMRProcessor.java#L70] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27904) Add fiexed-length column serde
Jiamin Wang created HIVE-27904: -- Summary: Add fiexed-length column serde Key: HIVE-27904 URL: https://issues.apache.org/jira/browse/HIVE-27904 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers Reporter: Jiamin Wang Hive does not support setting the column delimiter as a space. Some systems require storing files in fixed-length format. I am thinking that maybe we can add this feature. I can submit the code, and create tables like this. ```sql CREATE TABLE fixed_length_table ( column1 STRING, column2 STRING, column3 STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.fixed.FixedLengthTextSerDe' WITH SERDEPROPERTIES ( "field.lengths"="10,5,8" ) STORED AS TEXTFILE; ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-27901) Hive's performance for querying the Iceberg table is very poor.
[ https://issues.apache.org/jira/browse/HIVE-27901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yongzhi.shao updated HIVE-27901: Description: I am using HIVE4.0.0-BETA for testing. BTW,I found that the performance of HIVE reading ICEBERG table is still very slow. How should I deal with this problem? I count a 7 billion table and compare the performance difference between HIVE reading ICEBERG-ORC and ORC table respectively. We use ICEBERG 1.4.2, ICEBERG-ORC with ZSTD compression enabled. ORC with SNAPPY compression. HADOOP version 3.1.1 (native zstd not supported). {code:java} --spark3.4.1+iceberg 1.4.2 CREATE TABLE datacenter.dwd.b_std_trade ( uni_order_id STRING, data_from BIGINT, partner STRING, plat_code STRING, order_id STRING, uni_shop_id STRING, uni_id STRING, guide_id STRING, shop_id STRING, plat_account STRING, total_fee DOUBLE, item_discount_fee DOUBLE, trade_discount_fee DOUBLE, adjust_fee DOUBLE, post_fee DOUBLE, discount_rate DOUBLE, payment_no_postfee DOUBLE, payment DOUBLE, pay_time STRING, product_num BIGINT, order_status STRING, is_refund STRING, refund_fee DOUBLE, insert_time STRING, created STRING, endtime STRING, modified STRING, trade_type STRING, receiver_name STRING, receiver_country STRING, receiver_state STRING, receiver_city STRING, receiver_district STRING, receiver_town STRING, receiver_address STRING, receiver_mobile STRING, trade_source STRING, delivery_type STRING, consign_time STRING, orders_num BIGINT, is_presale BIGINT, presale_status STRING, first_fee_paytime STRING, last_fee_paytime STRING, first_paid_fee DOUBLE, tenant STRING, tidb_modified STRING, step_paid_fee DOUBLE, seller_flag STRING, is_used_store_card BIGINT, store_card_used DOUBLE, store_card_basic_used DOUBLE, store_card_expand_used DOUBLE, order_promotion_num BIGINT, item_promotion_num BIGINT, buyer_remark STRING, seller_remark STRING, trade_business_type STRING) USING iceberg PARTITIONED BY (uni_shop_id, truncate(4, created)) LOCATION '/iceberg-catalog/warehouse/dwd/b_std_trade' TBLPROPERTIES ( 'current-snapshot-id' = '7217819472703702905', 'format' = 'iceberg/orc', 'format-version' = '1', 'hive.stored-as' = 'iceberg', 'read.orc.vectorization.enabled' = 'true', 'sort-order' = 'uni_shop_id ASC NULLS FIRST, created ASC NULLS FIRST', 'write.distribution-mode' = 'hash', 'write.format.default' = 'orc', 'write.metadata.delete-after-commit.enabled' = 'true', 'write.metadata.previous-versions-max' = '3', 'write.orc.bloom.filter.columns' = 'order_id', 'write.orc.compression-codec' = 'zstd') --hive-iceberg CREATE EXTERNAL TABLE iceberg_dwd.b_std_trade STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' LOCATION 'hdfs:///iceberg-catalog/warehouse/dwd/b_std_trade' TBLPROPERTIES ('iceberg.catalog'='location_based_table','engine.hive.enabled'='true'); --inner orc table( set hive default format = orc ) set hive.default.fileformat=orc; set hive.default.fileformat.managed=orc; create table if not exists iceberg_dwd.orc_inner_table as select * from iceberg_dwd.b_std_trade;{code} !image-2023-11-22-18-32-28-344.png! !image-2023-11-22-18-33-01-885.png! Also, I have another question. The Submit Plan statistic is clearly incorrect. Is this something that needs to be fixed? !image-2023-11-22-18-33-32-915.png! was: I am using HIVE4.0.0-BETA for testing. BTW,I found that the performance of HIVE reading ICEBERG table is still very slow. How should I deal with this problem? I count a 7 billion table and compare the performance difference between HIVE reading ICEBERG-ORC and ORC table respectively. We use ICEBERG 1.4.2, ICEBERG-ORC with ZSTD compression enabled. ORC with SNAPPY compression. HADOOP version 3.1.1 (native zstd not supported). {code:java} --spark3.4.1+iceberg 1.4.2 CREATE TABLE datacenter.dwd.b_std_trade ( uni_order_id STRING, data_from BIGINT, partner STRING, plat_code STRING, order_id STRING, uni_shop_id STRING, uni_id STRING, guide_id STRING, shop_id STRING, plat_account STRING, total_fee DOUBLE, item_discount_fee DOUBLE, trade_discount_fee DOUBLE, adjust_fee DOUBLE, post_fee DOUBLE, discount_rate DOUBLE, payment_no_postfee DOUBLE, payment DOUBLE, pay_time STRING, product_num BIGINT, order_status STRING, is_refund STRING, refund_fee DOUBLE, insert_time STRING, created STRING, endtime STRING, modified STRING, trade_type STRING, receiver_name STRING, receiver_country STRING, receiver_state STRING, receiver_city STRING, receiver_district STRING, receiver_town STRING, receiver_address STRING, receiver_mobile STRING, trade_source STRING, delivery_type STRING, consign_time STRING, orders_num BIGINT, is_presale BIGINT, presale_status STRING, first_fee_paytime STRING, last_fee_paytim
[jira] [Updated] (HIVE-27901) Hive's performance for querying the Iceberg table is very poor.
[ https://issues.apache.org/jira/browse/HIVE-27901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yongzhi.shao updated HIVE-27901: Description: I am using HIVE4.0.0-BETA for testing. BTW,I found that the performance of HIVE reading ICEBERG table is still very slow. How should I deal with this problem? I count a 7 billion table and compare the performance difference between HIVE reading ICEBERG-ORC and ORC table respectively. We use ICEBERG 1.4.2, ICEBERG-ORC with ZSTD compression enabled. ORC with SNAPPY compression. HADOOP version 3.1.1 (native zstd not supported). {code:java} --spark3.4.1+iceberg 1.4.2 CREATE TABLE datacenter.dwd.b_std_trade ( uni_order_id STRING, data_from BIGINT, partner STRING, plat_code STRING, order_id STRING, uni_shop_id STRING, uni_id STRING, guide_id STRING, shop_id STRING, plat_account STRING, total_fee DOUBLE, item_discount_fee DOUBLE, trade_discount_fee DOUBLE, adjust_fee DOUBLE, post_fee DOUBLE, discount_rate DOUBLE, payment_no_postfee DOUBLE, payment DOUBLE, pay_time STRING, product_num BIGINT, order_status STRING, is_refund STRING, refund_fee DOUBLE, insert_time STRING, created STRING, endtime STRING, modified STRING, trade_type STRING, receiver_name STRING, receiver_country STRING, receiver_state STRING, receiver_city STRING, receiver_district STRING, receiver_town STRING, receiver_address STRING, receiver_mobile STRING, trade_source STRING, delivery_type STRING, consign_time STRING, orders_num BIGINT, is_presale BIGINT, presale_status STRING, first_fee_paytime STRING, last_fee_paytime STRING, first_paid_fee DOUBLE, tenant STRING, tidb_modified STRING, step_paid_fee DOUBLE, seller_flag STRING, is_used_store_card BIGINT, store_card_used DOUBLE, store_card_basic_used DOUBLE, store_card_expand_used DOUBLE, order_promotion_num BIGINT, item_promotion_num BIGINT, buyer_remark STRING, seller_remark STRING, trade_business_type STRING) USING iceberg PARTITIONED BY (uni_shop_id, truncate(4, created)) LOCATION '/iceberg-catalog/warehouse/dwd/b_std_trade' TBLPROPERTIES ( 'current-snapshot-id' = '7217819472703702905', 'format' = 'iceberg/orc', 'format-version' = '1', 'hive.stored-as' = 'iceberg', 'read.orc.vectorization.enabled' = 'true', 'sort-order' = 'uni_shop_id ASC NULLS FIRST, created ASC NULLS FIRST', 'write.distribution-mode' = 'hash', 'write.format.default' = 'orc', 'write.metadata.delete-after-commit.enabled' = 'true', 'write.metadata.previous-versions-max' = '3', 'write.orc.bloom.filter.columns' = 'order_id', 'write.orc.compression-codec' = 'zstd') --hive-iceberg CREATE EXTERNAL TABLE iceberg_dwd.b_std_trade STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' LOCATION 'hdfs:///iceberg-catalog/warehouse/dwd/b_std_trade' TBLPROPERTIES ('iceberg.catalog'='location_based_table','engine.hive.enabled'='true'); --inner orc table( set hive default format = orc ) create table if not exists iceberg_dwd.orc_inner_table as select * from iceberg_dwd.b_std_trade;{code} !image-2023-11-22-18-32-28-344.png! !image-2023-11-22-18-33-01-885.png! Also, I have another question. The Submit Plan statistic is clearly incorrect. Is this something that needs to be fixed? !image-2023-11-22-18-33-32-915.png! was: I am using HIVE4.0.0-BETA for testing. BTW,I found that the performance of HIVE reading ICEBERG table is still very slow. How should I deal with this problem? I count a 7 billion table and compare the performance difference between HIVE reading ICEBERG-ORC and ORC table respectively. We use ICEBERG 1.4.2, ICEBERG-ORC with ZSTD compression enabled. ORC with SNAPPY compression. HADOOP version 3.1.1 (native zstd not supported). !image-2023-11-22-18-32-28-344.png! !image-2023-11-22-18-33-01-885.png! Also, I have another question. The Submit Plan statistic is clearly incorrect. Is this something that needs to be fixed? !image-2023-11-22-18-33-32-915.png! > Hive's performance for querying the Iceberg table is very poor. > --- > > Key: HIVE-27901 > URL: https://issues.apache.org/jira/browse/HIVE-27901 > Project: Hive > Issue Type: Bug > Components: Iceberg integration >Affects Versions: 4.0.0-beta-1 >Reporter: yongzhi.shao >Priority: Major > Attachments: image-2023-11-22-18-32-28-344.png, > image-2023-11-22-18-33-01-885.png, image-2023-11-22-18-33-32-915.png > > > I am using HIVE4.0.0-BETA for testing. > BTW,I found that the performance of HIVE reading ICEBERG table is still very > slow. > How should I deal with this problem? > I count a 7 billion table and compare the performance difference between HIVE > reading ICEBERG-ORC and ORC table respectively. > We use ICEBERG 1.4.
[jira] [Updated] (HIVE-27898) HIVE4 can't use ICEBERG table in subqueries
[ https://issues.apache.org/jira/browse/HIVE-27898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yongzhi.shao updated HIVE-27898: Description: Currently, we found that when using HIVE4-BETA1 version, if we use ICEBERG table in the subquery, we can't get any data in the end. I have used HIVE3-TEZ for cross validation and HIVE3 does not have this problem when querying ICEBERG. {code:java} --spark3.4.1+iceberg 1.4.2 CREATE TABLE datacenter.dwd.b_std_trade ( uni_order_id STRING, data_from BIGINT, partner STRING, plat_code STRING, order_id STRING, uni_shop_id STRING, uni_id STRING, guide_id STRING, shop_id STRING, plat_account STRING, total_fee DOUBLE, item_discount_fee DOUBLE, trade_discount_fee DOUBLE, adjust_fee DOUBLE, post_fee DOUBLE, discount_rate DOUBLE, payment_no_postfee DOUBLE, payment DOUBLE, pay_time STRING, product_num BIGINT, order_status STRING, is_refund STRING, refund_fee DOUBLE, insert_time STRING, created STRING, endtime STRING, modified STRING, trade_type STRING, receiver_name STRING, receiver_country STRING, receiver_state STRING, receiver_city STRING, receiver_district STRING, receiver_town STRING, receiver_address STRING, receiver_mobile STRING, trade_source STRING, delivery_type STRING, consign_time STRING, orders_num BIGINT, is_presale BIGINT, presale_status STRING, first_fee_paytime STRING, last_fee_paytime STRING, first_paid_fee DOUBLE, tenant STRING, tidb_modified STRING, step_paid_fee DOUBLE, seller_flag STRING, is_used_store_card BIGINT, store_card_used DOUBLE, store_card_basic_used DOUBLE, store_card_expand_used DOUBLE, order_promotion_num BIGINT, item_promotion_num BIGINT, buyer_remark STRING, seller_remark STRING, trade_business_type STRING) USING iceberg PARTITIONED BY (uni_shop_id, truncate(4, created)) LOCATION '/iceberg-catalog/warehouse/dwd/b_std_trade' TBLPROPERTIES ( 'current-snapshot-id' = '7217819472703702905', 'format' = 'iceberg/orc', 'format-version' = '1', 'hive.stored-as' = 'iceberg', 'read.orc.vectorization.enabled' = 'true', 'sort-order' = 'uni_shop_id ASC NULLS FIRST, created ASC NULLS FIRST', 'write.distribution-mode' = 'hash', 'write.format.default' = 'orc', 'write.metadata.delete-after-commit.enabled' = 'true', 'write.metadata.previous-versions-max' = '3', 'write.orc.bloom.filter.columns' = 'order_id', 'write.orc.compression-codec' = 'zstd') --hive-iceberg CREATE EXTERNAL TABLE iceberg_dwd.b_std_trade STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' LOCATION 'hdfs:///iceberg-catalog/warehouse/dwd/b_std_trade' TBLPROPERTIES ('iceberg.catalog'='location_based_table','engine.hive.enabled'='true'); select * from iceberg_dwd.b_std_trade where uni_shop_id = 'TEST|1' limit 10 --10 rows select * from ( select * from iceberg_dwd.b_std_trade where uni_shop_id = 'TEST|1' limit 10 ) t1; --10 rows select uni_shop_id from ( select * from iceberg_dwd.b_std_trade where uni_shop_id = 'TEST|1' limit 10 ) t1; --0 rows select uni_shop_id from ( select uni_shop_id as uni_shop_id from iceberg_dwd.b_std_trade where uni_shop_id = 'TEST|1' limit 10 ) t1; --0 rows --orc select uni_shop_id from ( select * from iceberg_dwd.trade_test where uni_shop_id = 'TEST|1' limit 10 ) t1;--10 ROWS{code} was: Currently, we found that when using HIVE4-BETA1 version, if we use ICEBERG table in the subquery, we can't get any data in the end. I have used HIVE3-TEZ for cross validation and HIVE3 does not have this problem when querying ICEBERG. {code:java} --iceberg select * from iceberg_dwd.b_std_trade where uni_shop_id = 'TEST|1' limit 10 --10 rows select * from ( select * from iceberg_dwd.b_std_trade where uni_shop_id = 'TEST|1' limit 10 ) t1; --10 rows select uni_shop_id from ( select * from iceberg_dwd.b_std_trade where uni_shop_id = 'TEST|1' limit 10 ) t1; --0 rows select uni_shop_id from ( select uni_shop_id as uni_shop_id from iceberg_dwd.b_std_trade where uni_shop_id = 'TEST|1' limit 10 ) t1; --0 rows --orc select uni_shop_id from ( select * from iceberg_dwd.trade_test where uni_shop_id = 'TEST|1' limit 10 ) t1;--10 ROWS{code} > HIVE4 can't use ICEBERG table in subqueries > --- > > Key: HIVE-27898 > URL: https://issues.apache.org/jira/browse/HIVE-27898 > Project: Hive > Issue Type: Bug > Components: Iceberg integration >Affects Versions: 4.0.0-beta-1 >Reporter: yongzhi.shao >Priority: Critical > > Currently, we found that when using HIVE4-BETA1 version, if we use ICEBERG > table in the subquery, we can't get any data in the end. > I have used HIVE3-TEZ for cross validation and HIVE3 does not have this > problem when querying
[jira] [Updated] (HIVE-27898) HIVE4 can't use ICEBERG table in subqueries
[ https://issues.apache.org/jira/browse/HIVE-27898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yongzhi.shao updated HIVE-27898: Description: Currently, we found that when using HIVE4-BETA1 version, if we use ICEBERG table in the subquery, we can't get any data in the end. I have used HIVE3-TEZ for cross validation and HIVE3 does not have this problem when querying ICEBERG. {code:java} --spark3.4.1+iceberg 1.4.2 CREATE TABLE datacenter.dwd.b_std_trade ( uni_order_id STRING, data_from BIGINT, partner STRING, plat_code STRING, order_id STRING, uni_shop_id STRING, uni_id STRING, guide_id STRING, shop_id STRING, plat_account STRING, total_fee DOUBLE, item_discount_fee DOUBLE, trade_discount_fee DOUBLE, adjust_fee DOUBLE, post_fee DOUBLE, discount_rate DOUBLE, payment_no_postfee DOUBLE, payment DOUBLE, pay_time STRING, product_num BIGINT, order_status STRING, is_refund STRING, refund_fee DOUBLE, insert_time STRING, created STRING, endtime STRING, modified STRING, trade_type STRING, receiver_name STRING, receiver_country STRING, receiver_state STRING, receiver_city STRING, receiver_district STRING, receiver_town STRING, receiver_address STRING, receiver_mobile STRING, trade_source STRING, delivery_type STRING, consign_time STRING, orders_num BIGINT, is_presale BIGINT, presale_status STRING, first_fee_paytime STRING, last_fee_paytime STRING, first_paid_fee DOUBLE, tenant STRING, tidb_modified STRING, step_paid_fee DOUBLE, seller_flag STRING, is_used_store_card BIGINT, store_card_used DOUBLE, store_card_basic_used DOUBLE, store_card_expand_used DOUBLE, order_promotion_num BIGINT, item_promotion_num BIGINT, buyer_remark STRING, seller_remark STRING, trade_business_type STRING) USING iceberg PARTITIONED BY (uni_shop_id, truncate(4, created)) LOCATION '/iceberg-catalog/warehouse/dwd/b_std_trade' TBLPROPERTIES ( 'current-snapshot-id' = '7217819472703702905', 'format' = 'iceberg/orc', 'format-version' = '1', 'hive.stored-as' = 'iceberg', 'read.orc.vectorization.enabled' = 'true', 'sort-order' = 'uni_shop_id ASC NULLS FIRST, created ASC NULLS FIRST', 'write.distribution-mode' = 'hash', 'write.format.default' = 'orc', 'write.metadata.delete-after-commit.enabled' = 'true', 'write.metadata.previous-versions-max' = '3', 'write.orc.bloom.filter.columns' = 'order_id', 'write.orc.compression-codec' = 'zstd') --hive-iceberg CREATE EXTERNAL TABLE iceberg_dwd.b_std_trade STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' LOCATION 'hdfs:///iceberg-catalog/warehouse/dwd/b_std_trade' TBLPROPERTIES ('iceberg.catalog'='location_based_table','engine.hive.enabled'='true'); select * from iceberg_dwd.b_std_trade where uni_shop_id = 'TEST|1' limit 10 --10 rows select * from ( select * from iceberg_dwd.b_std_trade where uni_shop_id = 'TEST|1' limit 10 ) t1; --10 rows select uni_shop_id from ( select * from iceberg_dwd.b_std_trade where uni_shop_id = 'TEST|1' limit 10 ) t1; --0 rows select uni_shop_id from ( select uni_shop_id as uni_shop_id from iceberg_dwd.b_std_trade where uni_shop_id = 'TEST|1' limit 10 ) t1; --0 rows --hive-orc select uni_shop_id from ( select * from iceberg_dwd.trade_test where uni_shop_id = 'TEST|1' limit 10 ) t1;--10 ROWS{code} was: Currently, we found that when using HIVE4-BETA1 version, if we use ICEBERG table in the subquery, we can't get any data in the end. I have used HIVE3-TEZ for cross validation and HIVE3 does not have this problem when querying ICEBERG. {code:java} --spark3.4.1+iceberg 1.4.2 CREATE TABLE datacenter.dwd.b_std_trade ( uni_order_id STRING, data_from BIGINT, partner STRING, plat_code STRING, order_id STRING, uni_shop_id STRING, uni_id STRING, guide_id STRING, shop_id STRING, plat_account STRING, total_fee DOUBLE, item_discount_fee DOUBLE, trade_discount_fee DOUBLE, adjust_fee DOUBLE, post_fee DOUBLE, discount_rate DOUBLE, payment_no_postfee DOUBLE, payment DOUBLE, pay_time STRING, product_num BIGINT, order_status STRING, is_refund STRING, refund_fee DOUBLE, insert_time STRING, created STRING, endtime STRING, modified STRING, trade_type STRING, receiver_name STRING, receiver_country STRING, receiver_state STRING, receiver_city STRING, receiver_district STRING, receiver_town STRING, receiver_address STRING, receiver_mobile STRING, trade_source STRING, delivery_type STRING, consign_time STRING, orders_num BIGINT, is_presale BIGINT, presale_status STRING, first_fee_paytime STRING, last_fee_paytime STRING, first_paid_fee DOUBLE, tenant STRING, tidb_modified STRING, step_paid_fee DOUBLE, seller_flag STRING, is_used_store_card BIGINT, store_card_used DOUBLE, store_card_basic_used DOUBLE, store_card_expand_
[jira] [Updated] (HIVE-27900) hive can not read iceberg-parquet table
[ https://issues.apache.org/jira/browse/HIVE-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yongzhi.shao updated HIVE-27900: Description: We found that using HIVE4-BETA version, we could not query the Iceberg-Parquet table with vectorised execution turned on. {code:java} --spark-sql(3.4.1+iceberg 1.4.2) CREATE TABLE local.test.b_qqd_shop_rfm_parquet_snappy ( a string,b string,c string) USING iceberg LOCATION '/iceberg-catalog/warehouse/test/b_qqd_shop_rfm_parquet_snappy' TBLPROPERTIES ( 'current-snapshot-id' = '5138351937447353683', 'format' = 'iceberg/parquet', 'format-version' = '2', 'read.orc.vectorization.enabled' = 'true', 'write.format.default' = 'parquet', 'write.metadata.delete-after-commit.enabled' = 'true', 'write.metadata.previous-versions-max' = '3', 'write.parquet.compression-codec' = 'snappy'); --hive-sql CREATE EXTERNAL TABLE iceberg_dwd.b_qqd_shop_rfm_parquet_snappy STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' LOCATION 'hdfs://xxx/iceberg-catalog/warehouse/test/b_qqd_shop_rfm_parquet_snappy/' TBLPROPERTIES ('iceberg.catalog'='location_based_table','engine.hive.enabled'='true'); set hive.default.fileformat=orc; set hive.default.fileformat.managed=orc; create table test_parquet_as_orc as select * from b_qqd_shop_rfm_parquet_snappy limit 100; , TaskAttempt 2 failed, info=[Error: Node: /xxx..xx.xx: Error while running task ( failure ) : attempt_1696729618575_69586_1_00_00_2:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:348) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:276) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:381) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:82) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:69) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:69) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:39) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:76) at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:110) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:83) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:414) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:293) ... 16 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:993) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:101) ... 19 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.vector.reducesink.VectorReduceSinkEmptyKeyOperator.process(VectorReduceSinkEmptyKeyOperator.java:137) at org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:919) at org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:158) at org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:919) at org.apache.hadoop.hive.ql.exec.vector.VectorLimitOperator.process(VectorLimitOperator.java:108) at org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:919) at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:171) at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.deliverVectorizedRowBatch(VectorMapOperator.java:809) at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:878) ... 20 more Caused by: java.lang.NullPointerEx
[jira] [Commented] (HIVE-23354) Remove file size sanity checking from compareTempOrDuplicateFiles
[ https://issues.apache.org/jira/browse/HIVE-23354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788956#comment-17788956 ] Chenyu Zheng commented on HIVE-23354: - [~jfs] [~pvary] [~kuczoram] [~nareshpr] Hi, I know we disable speculative execution for removeTempOrDuplicateFiles in this issue. I found some problem about "Hive on tez" when tez speculative execution is enable. By my debug, I found some problem, and submit the issue HIVE-25561 and HIVE-27899. And I explain the reason in the two issues. _"That might be problematic when there are speculative execution on the way, and the original execution is finished, but the newest/speculative execution is still running"_ I think after HIVE-25561 and HIVE-27899, this is not a problem, at least on tez. After HIVE-25561, If the original execution is killed, the generated file will not be commtited, the generated will be a temp file. removeTempOrDuplicateFiles will delete the temp file. After HIVE-25561, if the original execution is finished with success state, the generated file will be committed. the speculative task attempt will stuck until received kill signal, will never trigger to commit file, so the file generated by speculative task is tmp file, will be deleted by removeTempOrDuplicateFiles. What do I think? Can we enable the speculative execution? After all, speculative execution is crucial in large-scale production environments. Looking forward to your reply! > Remove file size sanity checking from compareTempOrDuplicateFiles > - > > Key: HIVE-23354 > URL: https://issues.apache.org/jira/browse/HIVE-23354 > Project: Hive > Issue Type: Bug > Components: HiveServer2 >Reporter: John Sherman >Assignee: John Sherman >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0-alpha-1 > > Attachments: HIVE-23354.1.patch, HIVE-23354.2.patch, > HIVE-23354.3.patch, HIVE-23354.4.patch, HIVE-23354.5.patch, > HIVE-23354.6.patch, HIVE-23354.7.patch > > Time Spent: 20m > Remaining Estimate: 0h > > [https://github.com/apache/hive/blob/cdd55aa319a3440963a886ebfff11cd2a240781d/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1952-L2010] > compareTempOrDuplicateFiles uses a combination of attemptId and fileSize to > determine which file(s) to keep. > I've seen instances where this function throws an exception due to the fact > that the newer attemptId file size is less than the older attemptId (thus > failing the query). > I think this assumption is faulty, due to various factors such as file > compression and the order in which values are written. It may be prudent to > trust that the newest attemptId is in fact the best choice. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HIVE-27903) TBLPROPERTIES('history.expire.max-snapshot-age-ms') doesn't work
[ https://issues.apache.org/jira/browse/HIVE-27903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788895#comment-17788895 ] Ayush Saxena commented on HIVE-27903: - Should work post: HIVE-27789 If you specify the timestamp via command the property won't be used. The official iceberg documentation says that as well `` if {{older_than}} and {{retain_last}} are omitted, the table’s [expiration properties|https://iceberg.apache.org/docs/latest/configuration/#table-behavior-properties] will be used. `` So, try with RETAIN LAST & this config should take effect. It isn't supposed to work when you specify older_than timestamp, & in your example you specified the older_than as ('2200-10-10') > TBLPROPERTIES('history.expire.max-snapshot-age-ms') doesn't work > > > Key: HIVE-27903 > URL: https://issues.apache.org/jira/browse/HIVE-27903 > Project: Hive > Issue Type: Improvement > Components: Hive >Affects Versions: 4.0.0-alpha-2 >Reporter: JK Pasimuthu >Priority: Major > > [https://github.com/apache/iceberg/issues/9123] > The 'history.expire.max-snapshot-age-ms' option doesn't have any effect while > expiring snapshots. > # > CREATE TABLE IF NOT EXISTS test5d78b6 ( > id INT, random1 STRING > ) > PARTITIONED BY (random2 STRING) > STORED BY ICEBERG > TBLPROPERTIES ( > 'write.format.default'='orc', > 'format-version'='2', > 'write.orc.compression-codec'='none' > ) > # INSERT INTO test5d78b6 SELECT if(isnull(MAX(id)) ,0 , MAX(id) ) +1, > uuid(), uuid() FROM test5d78b6 > # INSERT INTO test5d78b6 SELECT if(isnull(MAX(id)) ,0 , MAX(id) ) +1, > uuid(), uuid() FROM test5d78b6 > # SLEEP for 30 seconds > # INSERT INTO test5d78b6 SELECT if(isnull(MAX(id)) ,0 , MAX(id) ) +1, > uuid(), uuid() FROM test5d78b6 > # INSERT INTO test5d78b6 SELECT if(isnull(MAX(id)) ,0 , MAX(id) ) +1, > uuid(), uuid() FROM test5d78b6 > # SELECT (UNIX_TIMESTAMP(CURRENT_TIMESTAMP) - UNIX_TIMESTAMP('2023-10-09 > 13:23:54.455')) * 1000; > # ALTER TABLE test5d78b6 SET > tblproperties('history.expire.max-snapshot-age-ms'='54000'); - the elapsed > time in ms from the second insert and current time > # ALTER TABLE test5d78b6 EXECUTE expire_snapshots('2200-10-10'); > # SELECT COUNT FROM default.test5d78b6.snapshots; > output: 1. it should be 2 rows. The default 1 is retained an all snapshots > are expired as usual, so setting the property has no effect. > Additional Info: the default value for 'history.expire.max-snapshot-age-ms' > is 5 days per this link: > [https://iceberg.apache.org/docs/1.3.1/configuration/] > Now while writing the tests and running them, the expiring snapshots just > worked fine within few seconds of the snapshots being created. > So, I'm assuming that this option doesn't have any effect right now. Having > said that, I'm thinking the implications on end user will have if we fix this. > The end user may not know about this option at all and will have tough time > figuring out why the snapshots are not getting expired. One option could be > to set the default to 0ms. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27903) TBLPROPERTIES('history.expire.max-snapshot-age-ms') doesn't work
JK Pasimuthu created HIVE-27903: --- Summary: TBLPROPERTIES('history.expire.max-snapshot-age-ms') doesn't work Key: HIVE-27903 URL: https://issues.apache.org/jira/browse/HIVE-27903 Project: Hive Issue Type: Improvement Components: Hive Affects Versions: 4.0.0-alpha-2 Reporter: JK Pasimuthu [https://github.com/apache/iceberg/issues/9123] The 'history.expire.max-snapshot-age-ms' option doesn't have any effect while expiring snapshots. # CREATE TABLE IF NOT EXISTS test5d78b6 ( id INT, random1 STRING ) PARTITIONED BY (random2 STRING) STORED BY ICEBERG TBLPROPERTIES ( 'write.format.default'='orc', 'format-version'='2', 'write.orc.compression-codec'='none' ) # INSERT INTO test5d78b6 SELECT if(isnull(MAX(id)) ,0 , MAX(id) ) +1, uuid(), uuid() FROM test5d78b6 # INSERT INTO test5d78b6 SELECT if(isnull(MAX(id)) ,0 , MAX(id) ) +1, uuid(), uuid() FROM test5d78b6 # SLEEP for 30 seconds # INSERT INTO test5d78b6 SELECT if(isnull(MAX(id)) ,0 , MAX(id) ) +1, uuid(), uuid() FROM test5d78b6 # INSERT INTO test5d78b6 SELECT if(isnull(MAX(id)) ,0 , MAX(id) ) +1, uuid(), uuid() FROM test5d78b6 # SELECT (UNIX_TIMESTAMP(CURRENT_TIMESTAMP) - UNIX_TIMESTAMP('2023-10-09 13:23:54.455')) * 1000; # ALTER TABLE test5d78b6 SET tblproperties('history.expire.max-snapshot-age-ms'='54000'); - the elapsed time in ms from the second insert and current time # ALTER TABLE test5d78b6 EXECUTE expire_snapshots('2200-10-10'); # SELECT COUNT FROM default.test5d78b6.snapshots; output: 1. it should be 2 rows. The default 1 is retained an all snapshots are expired as usual, so setting the property has no effect. Additional Info: the default value for 'history.expire.max-snapshot-age-ms' is 5 days per this link: [https://iceberg.apache.org/docs/1.3.1/configuration/] Now while writing the tests and running them, the expiring snapshots just worked fine within few seconds of the snapshots being created. So, I'm assuming that this option doesn't have any effect right now. Having said that, I'm thinking the implications on end user will have if we fix this. The end user may not know about this option at all and will have tough time figuring out why the snapshots are not getting expired. One option could be to set the default to 0ms. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-27902) Rewrite Update with empty Where clause to IOW
[ https://issues.apache.org/jira/browse/HIVE-27902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-27902: -- Labels: ACID iceberg (was: ) > Rewrite Update with empty Where clause to IOW > - > > Key: HIVE-27902 > URL: https://issues.apache.org/jira/browse/HIVE-27902 > Project: Hive > Issue Type: Improvement >Affects Versions: 4.0.0-beta-1 >Reporter: Denys Kuzmenko >Priority: Major > Labels: ACID, iceberg > > rewrite > {code} > update table mytbl set a = a+5 > {code} > with > {code}insert overwrite table mytbl as select a+5 from mytbl > {code} > note: in case of Iceberg tables it should take care of partition evolution > and overwrite all -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-27902) Rewrite Update with empty Where clause to IOW
[ https://issues.apache.org/jira/browse/HIVE-27902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-27902: -- Description: rewrite {code} update table mytbl set a = a+5 {code} with {code}insert overwrite table mytbl as select a+5 from mytbl {code} note: in case of Iceberg tables it should take care of partition evolution and overwrite all was: rewrite {code} update table mytbl set a = a+5 {code} with {code}insert overwrite table mytbl as select a+5 from mytbl {code} > Rewrite Update with empty Where clause to IOW > - > > Key: HIVE-27902 > URL: https://issues.apache.org/jira/browse/HIVE-27902 > Project: Hive > Issue Type: Improvement >Reporter: Denys Kuzmenko >Priority: Major > > rewrite > {code} > update table mytbl set a = a+5 > {code} > with > {code}insert overwrite table mytbl as select a+5 from mytbl > {code} > note: in case of Iceberg tables it should take care of partition evolution > and overwrite all -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-27902) Rewrite Update with empty Where clause to IOW
[ https://issues.apache.org/jira/browse/HIVE-27902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-27902: -- Affects Version/s: 4.0.0-beta-1 > Rewrite Update with empty Where clause to IOW > - > > Key: HIVE-27902 > URL: https://issues.apache.org/jira/browse/HIVE-27902 > Project: Hive > Issue Type: Improvement >Affects Versions: 4.0.0-beta-1 >Reporter: Denys Kuzmenko >Priority: Major > > rewrite > {code} > update table mytbl set a = a+5 > {code} > with > {code}insert overwrite table mytbl as select a+5 from mytbl > {code} > note: in case of Iceberg tables it should take care of partition evolution > and overwrite all -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HIVE-27687) Logger variable should be static final as its creation takes more time in query compilation
[ https://issues.apache.org/jira/browse/HIVE-27687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramesh Kumar Thangarajan resolved HIVE-27687. - Resolution: Fixed > Logger variable should be static final as its creation takes more time in > query compilation > --- > > Key: HIVE-27687 > URL: https://issues.apache.org/jira/browse/HIVE-27687 > Project: Hive > Issue Type: Task > Components: Hive >Reporter: Ramesh Kumar Thangarajan >Assignee: Ramesh Kumar Thangarajan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: Screenshot 2023-09-12 at 5.03.31 PM.png > > > In query compilation, > LoggerFactory.getLogger() seems to take up more time. Some of the serde > classes use non static variable for Logger that forces the getLogger() call > for each of the class creation. > Making Logger variable static final will avoid this code path for every serde > class construction. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HIVE-27687) Logger variable should be static final as its creation takes more time in query compilation
[ https://issues.apache.org/jira/browse/HIVE-27687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788845#comment-17788845 ] Ramesh Kumar Thangarajan commented on HIVE-27687: - [~zabetak] Thanks, marked it. > Logger variable should be static final as its creation takes more time in > query compilation > --- > > Key: HIVE-27687 > URL: https://issues.apache.org/jira/browse/HIVE-27687 > Project: Hive > Issue Type: Task > Components: Hive >Reporter: Ramesh Kumar Thangarajan >Assignee: Ramesh Kumar Thangarajan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: Screenshot 2023-09-12 at 5.03.31 PM.png > > > In query compilation, > LoggerFactory.getLogger() seems to take up more time. Some of the serde > classes use non static variable for Logger that forces the getLogger() call > for each of the class creation. > Making Logger variable static final will avoid this code path for every serde > class construction. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HIVE-27687) Logger variable should be static final as its creation takes more time in query compilation
[ https://issues.apache.org/jira/browse/HIVE-27687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramesh Kumar Thangarajan closed HIVE-27687. --- > Logger variable should be static final as its creation takes more time in > query compilation > --- > > Key: HIVE-27687 > URL: https://issues.apache.org/jira/browse/HIVE-27687 > Project: Hive > Issue Type: Task > Components: Hive >Reporter: Ramesh Kumar Thangarajan >Assignee: Ramesh Kumar Thangarajan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: Screenshot 2023-09-12 at 5.03.31 PM.png > > > In query compilation, > LoggerFactory.getLogger() seems to take up more time. Some of the serde > classes use non static variable for Logger that forces the getLogger() call > for each of the class creation. > Making Logger variable static final will avoid this code path for every serde > class construction. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-27902) Rewrite Update with empty Where clause to IOW
[ https://issues.apache.org/jira/browse/HIVE-27902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-27902: -- Description: rewrite {code} update table mytbl set a = a+5 {code} with {code}insert overwrite table mytbl as select a+5 from mytbl {code} was: rewrite {code} update table mytbl set a = a+5 {code} with {code}insert overwrite table mytbl select a+5 from mytbl {code} > Rewrite Update with empty Where clause to IOW > - > > Key: HIVE-27902 > URL: https://issues.apache.org/jira/browse/HIVE-27902 > Project: Hive > Issue Type: Improvement >Reporter: Denys Kuzmenko >Priority: Major > > rewrite > {code} > update table mytbl set a = a+5 > {code} > with > {code}insert overwrite table mytbl as select a+5 from mytbl > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-27902) Rewrite Update with empty Where clause to IOW
[ https://issues.apache.org/jira/browse/HIVE-27902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-27902: -- Description: rewrite {code} update table mytbl set mytbl.a = a+5 {code} with {code}insert overwrite table mytbl select a+5 from mytbl {code} was: rewrite {code} update table mytbl set mytbl.a+5 {code} with {code}insert overwrite table mytbl select a+5 from mytbl {code} > Rewrite Update with empty Where clause to IOW > - > > Key: HIVE-27902 > URL: https://issues.apache.org/jira/browse/HIVE-27902 > Project: Hive > Issue Type: Improvement >Reporter: Denys Kuzmenko >Priority: Major > > rewrite > {code} > update table mytbl set mytbl.a = a+5 > {code} > with > {code}insert overwrite table mytbl select a+5 from mytbl > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-27902) Rewrite Update with empty Where clause to IOW
[ https://issues.apache.org/jira/browse/HIVE-27902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-27902: -- Description: rewrite {code} update table mytbl set a = a+5 {code} with {code}insert overwrite table mytbl select a+5 from mytbl {code} was: rewrite {code} update table mytbl set mytbl.a = a+5 {code} with {code}insert overwrite table mytbl select a+5 from mytbl {code} > Rewrite Update with empty Where clause to IOW > - > > Key: HIVE-27902 > URL: https://issues.apache.org/jira/browse/HIVE-27902 > Project: Hive > Issue Type: Improvement >Reporter: Denys Kuzmenko >Priority: Major > > rewrite > {code} > update table mytbl set a = a+5 > {code} > with > {code}insert overwrite table mytbl select a+5 from mytbl > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-27902) Rewrite Update with empty Where clause to IOW
[ https://issues.apache.org/jira/browse/HIVE-27902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-27902: -- Description: rewrite {code} update table mytbl set mytbl.a+5 {code} with {code}insert overwrite table mytbl select a+5 from mytbl {code} > Rewrite Update with empty Where clause to IOW > - > > Key: HIVE-27902 > URL: https://issues.apache.org/jira/browse/HIVE-27902 > Project: Hive > Issue Type: Improvement >Reporter: Denys Kuzmenko >Priority: Major > > rewrite > {code} > update table mytbl set mytbl.a+5 > {code} > with > {code}insert overwrite table mytbl select a+5 from mytbl > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-27687) Logger variable should be static final as its creation takes more time in query compilation
[ https://issues.apache.org/jira/browse/HIVE-27687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramesh Kumar Thangarajan updated HIVE-27687: Fix Version/s: 4.0.0 > Logger variable should be static final as its creation takes more time in > query compilation > --- > > Key: HIVE-27687 > URL: https://issues.apache.org/jira/browse/HIVE-27687 > Project: Hive > Issue Type: Task > Components: Hive >Reporter: Ramesh Kumar Thangarajan >Assignee: Ramesh Kumar Thangarajan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: Screenshot 2023-09-12 at 5.03.31 PM.png > > > In query compilation, > LoggerFactory.getLogger() seems to take up more time. Some of the serde > classes use non static variable for Logger that forces the getLogger() call > for each of the class creation. > Making Logger variable static final will avoid this code path for every serde > class construction. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27902) Rewrite Update with empty Where clause to IOW
Denys Kuzmenko created HIVE-27902: - Summary: Rewrite Update with empty Where clause to IOW Key: HIVE-27902 URL: https://issues.apache.org/jira/browse/HIVE-27902 Project: Hive Issue Type: Improvement Reporter: Denys Kuzmenko -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (HIVE-27687) Logger variable should be static final as its creation takes more time in query compilation
[ https://issues.apache.org/jira/browse/HIVE-27687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ramesh Kumar Thangarajan reopened HIVE-27687: - > Logger variable should be static final as its creation takes more time in > query compilation > --- > > Key: HIVE-27687 > URL: https://issues.apache.org/jira/browse/HIVE-27687 > Project: Hive > Issue Type: Task > Components: Hive >Reporter: Ramesh Kumar Thangarajan >Assignee: Ramesh Kumar Thangarajan >Priority: Major > Labels: pull-request-available > Attachments: Screenshot 2023-09-12 at 5.03.31 PM.png > > > In query compilation, > LoggerFactory.getLogger() seems to take up more time. Some of the serde > classes use non static variable for Logger that forces the getLogger() call > for each of the class creation. > Making Logger variable static final will avoid this code path for every serde > class construction. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HIVE-27885) Cast decimal from string with space without digits before dot returns NULL
[ https://issues.apache.org/jira/browse/HIVE-27885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naveen Gangam resolved HIVE-27885. -- Fix Version/s: 4.0.0 Resolution: Fixed Fix has been merged to master. Thank you [~nareshpr] for the patch. > Cast decimal from string with space without digits before dot returns NULL > -- > > Key: HIVE-27885 > URL: https://issues.apache.org/jira/browse/HIVE-27885 > Project: Hive > Issue Type: Bug >Reporter: Naresh P R >Assignee: Naresh P R >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > eg., > select cast(". " as decimal(8,4)) > {code:java} > – Expected output > 0. > – Actual output > NULL > {code} > select cast("0. " as decimal(8,4)) > {code:java} > – Actual output > 0. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-27900) hive can not read iceberg-parquet table
[ https://issues.apache.org/jira/browse/HIVE-27900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yongzhi.shao updated HIVE-27900: Description: We found that using HIVE4-BETA version, we could not query the Iceberg-Parquet table with vectorised execution turned on. {code:java} CREATE EXTERNAL TABLE iceberg_dwd.b_qqd_shop_rfm_parquet_snappy STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' LOCATION 'hdfs://xxx/iceberg-catalog/warehouse/test/b_qqd_shop_rfm_parquet_snappy/' TBLPROPERTIES ('iceberg.catalog'='location_based_table','engine.hive.enabled'='true'); set hive.default.fileformat=orc; set hive.default.fileformat.managed=orc; create table test_parquet_as_orc as select * from b_qqd_shop_rfm_parquet_snappy limit 100; , TaskAttempt 2 failed, info=[Error: Node: /xxx..xx.xx: Error while running task ( failure ) : attempt_1696729618575_69586_1_00_00_2:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:348) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:276) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:381) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:82) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:69) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:69) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:39) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:76) at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:110) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:83) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:414) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:293) ... 16 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:993) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:101) ... 19 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.vector.reducesink.VectorReduceSinkEmptyKeyOperator.process(VectorReduceSinkEmptyKeyOperator.java:137) at org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:919) at org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:158) at org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:919) at org.apache.hadoop.hive.ql.exec.vector.VectorLimitOperator.process(VectorLimitOperator.java:108) at org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:919) at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:171) at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.deliverVectorizedRowBatch(VectorMapOperator.java:809) at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:878) ... 20 more Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.common.io.NonSyncByteArrayOutputStream.write(NonSyncByteArrayOutputStream.java:110) at org.apache.hadoop.hive.serde2.lazybinary.fast.LazyBinarySerializeWrite.writeString(LazyBinarySerializeWrite.java:280) at org.apache.hadoop.hive.ql.exec.vector.VectorSerializeRow$VectorSerializeStringWriter.serialize(VectorSerializeRow.java:532) at org.apache.hadoop.hive.ql.exec.vector.VectorSerializeRow.serializeWrite(VectorSerializeRow.java:316) at org.apache.hadoop.hive.ql.exec.vector.VectorSerializeRow.serializeWrite(VectorSerializeRow.java:297)
[jira] [Updated] (HIVE-27512) CalciteSemanticException.UnsupportedFeature enum to capital
[ https://issues.apache.org/jira/browse/HIVE-27512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahesh Raju Somalaraju updated HIVE-27512: -- Status: Patch Available (was: In Progress) > CalciteSemanticException.UnsupportedFeature enum to capital > --- > > Key: HIVE-27512 > URL: https://issues.apache.org/jira/browse/HIVE-27512 > Project: Hive > Issue Type: Improvement >Reporter: László Bodor >Assignee: Mahesh Raju Somalaraju >Priority: Major > Labels: newbie, pull-request-available > > https://github.com/apache/hive/blob/3bc62cbc2d42c22dfd55f78ad7b41ec84a71380f/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/CalciteSemanticException.java#L32-L39 > {code} > public enum UnsupportedFeature { > Distinct_without_an_aggreggation, Duplicates_in_RR, > Filter_expression_with_non_boolean_return_type, > Having_clause_without_any_groupby, Invalid_column_reference, > Invalid_decimal, > Less_than_equal_greater_than, Others, Same_name_in_multiple_expressions, > Schema_less_table, Select_alias_in_having_clause, Select_transform, > Subquery, > Table_sample_clauses, UDTF, Union_type, Unique_join, > HighPrecissionTimestamp // CALCITE-1690 > }; > {code} > this just hurts my eyes, I expect it as DISTINCT_WITHOUT_AN_AGGREGATION ... -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-27512) CalciteSemanticException.UnsupportedFeature enum to capital
[ https://issues.apache.org/jira/browse/HIVE-27512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HIVE-27512: -- Labels: newbie pull-request-available (was: newbie) > CalciteSemanticException.UnsupportedFeature enum to capital > --- > > Key: HIVE-27512 > URL: https://issues.apache.org/jira/browse/HIVE-27512 > Project: Hive > Issue Type: Improvement >Reporter: László Bodor >Assignee: Mahesh Raju Somalaraju >Priority: Major > Labels: newbie, pull-request-available > > https://github.com/apache/hive/blob/3bc62cbc2d42c22dfd55f78ad7b41ec84a71380f/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/CalciteSemanticException.java#L32-L39 > {code} > public enum UnsupportedFeature { > Distinct_without_an_aggreggation, Duplicates_in_RR, > Filter_expression_with_non_boolean_return_type, > Having_clause_without_any_groupby, Invalid_column_reference, > Invalid_decimal, > Less_than_equal_greater_than, Others, Same_name_in_multiple_expressions, > Schema_less_table, Select_alias_in_having_clause, Select_transform, > Subquery, > Table_sample_clauses, UDTF, Union_type, Unique_join, > HighPrecissionTimestamp // CALCITE-1690 > }; > {code} > this just hurts my eyes, I expect it as DISTINCT_WITHOUT_AN_AGGREGATION ... -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HIVE-26618) Add setting to turn on/off removing sections of a query plan known never produces rows
[ https://issues.apache.org/jira/browse/HIVE-26618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Kasa resolved HIVE-26618. --- Resolution: Won't Fix > Add setting to turn on/off removing sections of a query plan known never > produces rows > -- > > Key: HIVE-26618 > URL: https://issues.apache.org/jira/browse/HIVE-26618 > Project: Hive > Issue Type: Improvement > Components: CBO >Reporter: Krisztian Kasa >Assignee: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > HIVE-26524 introduced an optimization to remove sections of query plan known > never produces rows. > Add a setting into hive conf to turn on/off this optimization. > When the optimization is turned off restore the legacy behavior: > * represent empty result operator with {{HiveSortLimit}} 0 > * disable {{HiveRemoveEmptySingleRules}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-27899) Killed speculative execution task attempt should not commit file
[ https://issues.apache.org/jira/browse/HIVE-27899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenyu Zheng updated HIVE-27899: Description: As I mentioned in HIVE-25561, when tez turns on speculative execution, the data file produced by hive may be duplicated. I mentioned in HIVE-25561 that if the speculatively executed task is killed, some data may be submitted unexpectedly. However, after HIVE-25561, there is still a situation that has not been solved. If two task attempts commit file at the same time, the problem of duplicate data files may also occur. Although the probability of this happening is very, very low, it does happen. Why? There are two key steps: (1)FileSinkOperator::closeOp TezProcessor::initializeAndRunProcessor --> ... --> FileSinkOperator::closeOp --> fsp.commit When the OP is closed, the process of closing the OP will be triggered, and eventually the call to fsp.commit will be triggered. (2)removeTempOrDuplicateFiles (2.a)Firstly, listStatus the files in the temporary directory. (2.b)Secondly check whether there are multiple incorrect commit, and finally move the correct results to the final directory. When speculative execution is enabled, when one attempt of a Task is completed, other attempts will be killed. However, AM only sends the kill event and does not ensure that all cleanup actions are completed, that is, closeOp may be executed between 2.a and 2.b. Therefore, removeTempOrDuplicateFiles will not delete the file generated by the kill attempt. How? The problem is that both speculatively executed tasks commit the file. This will not happen in the Tez examples because they will try canCommit, which can guarantee that one and only one task attempt commit successfully. If one task attempt executes canCommti successfully, the other one will be stuck by canCommit until it receives a kill signal. detail see: [https://github.com/apache/tez/blob/51d6f53967110e2b91b6d90b46f8e16bdc062091/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/processor/SimpleMRProcessor.java#L70] was: As I mentioned in HIVE-25561, when tez turns on speculative execution, the data file produced by hive may be duplicated. I mentioned in HIVE-25561 that if the speculatively executed task is killed, some data may be submitted unexpectedly. However, after HIVE-25561, there is still a situation that has not been solved. If two task attempts commit file at the same time, the problem of duplicate data files may also occur. Although the probability of this happening is very, very low, it does happen. Why? There are two key steps: (1)FileSinkOperator::closeOp TezProcessor::initializeAndRunProcessor --> ... --> FileSinkOperator::closeOp --> fsp.commit When the OP is closed, the process of closing the OP will be triggered, and eventually the call to fsp.commit will be triggered. (2)removeTempOrDuplicateFiles (2.a)Firstly, listStatus the files in the temporary directory. (2.b)Secondly check whether there are multiple incorrect commit, and finally move the correct results to the final directory. When speculative execution is enabled, when one attempt of a Task is completed, other attempts will be killed. However, AM only sends the kill event and does not ensure that all cleanup actions are completed, that is, closeOp may be executed between 2.a and 2.b. Therefore, removeTempOrDuplicateFiles will not delete the file generated by the kill attempt. How? The problem is that both speculatively executed tasks commit the file. This will not happen in the Tez examples because they will try canCommit, which can guarantee that one and only one task attempt commit successfully. If one task attempt executes canCommti successfully, the other one will be stuck by canCommti until it receives a kill signal. detail see: https://github.com/apache/tez/blob/51d6f53967110e2b91b6d90b46f8e16bdc062091/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/processor/SimpleMRProcessor.java#L70 > Killed speculative execution task attempt should not commit file > > > Key: HIVE-27899 > URL: https://issues.apache.org/jira/browse/HIVE-27899 > Project: Hive > Issue Type: Bug > Components: Tez >Reporter: Chenyu Zheng >Assignee: Chenyu Zheng >Priority: Major > Attachments: reproduce_bug.md > > > As I mentioned in HIVE-25561, when tez turns on speculative execution, the > data file produced by hive may be duplicated. I mentioned in HIVE-25561 that > if the speculatively executed task is killed, some data may be submitted > unexpectedly. However, after HIVE-25561, there is still a situation that has > not been solved. If two task attempts commit file at the same time, the > problem of duplicate data files may also occur. Although the probability of > this
[jira] [Commented] (HIVE-27899) Killed speculative execution task attempt should not commit file
[ https://issues.apache.org/jira/browse/HIVE-27899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788725#comment-17788725 ] Chenyu Zheng commented on HIVE-27899: - The probability of this bug recurring is very very low, and some special code must be added to increase the probability of this bug recurring. Only then can the correctness of the repair code be guaranteed. I added a sleep to the relevant code to simulate the stuck. This greatly increases the probability of the problem recurring. I've added the relevant details in the attachment `reproduce_bug.md`. > Killed speculative execution task attempt should not commit file > > > Key: HIVE-27899 > URL: https://issues.apache.org/jira/browse/HIVE-27899 > Project: Hive > Issue Type: Bug > Components: Tez >Reporter: Chenyu Zheng >Assignee: Chenyu Zheng >Priority: Major > Attachments: reproduce_bug.md > > > As I mentioned in HIVE-25561, when tez turns on speculative execution, the > data file produced by hive may be duplicated. I mentioned in HIVE-25561 that > if the speculatively executed task is killed, some data may be submitted > unexpectedly. However, after HIVE-25561, there is still a situation that has > not been solved. If two task attempts commit file at the same time, the > problem of duplicate data files may also occur. Although the probability of > this happening is very, very low, it does happen. > > Why? > There are two key steps: > (1)FileSinkOperator::closeOp > TezProcessor::initializeAndRunProcessor --> ... --> FileSinkOperator::closeOp > --> fsp.commit > When the OP is closed, the process of closing the OP will be triggered, and > eventually the call to fsp.commit will be triggered. > (2)removeTempOrDuplicateFiles > (2.a)Firstly, listStatus the files in the temporary directory. > (2.b)Secondly check whether there are multiple incorrect commit, and finally > move the correct results to the final directory. > When speculative execution is enabled, when one attempt of a Task is > completed, other attempts will be killed. However, AM only sends the kill > event and does not ensure that all cleanup actions are completed, that is, > closeOp may be executed between 2.a and 2.b. Therefore, > removeTempOrDuplicateFiles will not delete the file generated by the kill > attempt. > How? > The problem is that both speculatively executed tasks commit the file. This > will not happen in the Tez examples because they will try canCommit, which > can guarantee that one and only one task attempt commit successfully. If one > task attempt executes canCommti successfully, the other one will be stuck by > canCommti until it receives a kill signal. > detail see: > https://github.com/apache/tez/blob/51d6f53967110e2b91b6d90b46f8e16bdc062091/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/processor/SimpleMRProcessor.java#L70 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-27899) Killed speculative execution task attempt should not commit file
[ https://issues.apache.org/jira/browse/HIVE-27899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenyu Zheng updated HIVE-27899: Attachment: reproduce_bug.md > Killed speculative execution task attempt should not commit file > > > Key: HIVE-27899 > URL: https://issues.apache.org/jira/browse/HIVE-27899 > Project: Hive > Issue Type: Bug > Components: Tez >Reporter: Chenyu Zheng >Assignee: Chenyu Zheng >Priority: Major > Attachments: reproduce_bug.md > > > As I mentioned in HIVE-25561, when tez turns on speculative execution, the > data file produced by hive may be duplicated. I mentioned in HIVE-25561 that > if the speculatively executed task is killed, some data may be submitted > unexpectedly. However, after HIVE-25561, there is still a situation that has > not been solved. If two task attempts commit file at the same time, the > problem of duplicate data files may also occur. Although the probability of > this happening is very, very low, it does happen. > > Why? > There are two key steps: > (1)FileSinkOperator::closeOp > TezProcessor::initializeAndRunProcessor --> ... --> FileSinkOperator::closeOp > --> fsp.commit > When the OP is closed, the process of closing the OP will be triggered, and > eventually the call to fsp.commit will be triggered. > (2)removeTempOrDuplicateFiles > (2.a)Firstly, listStatus the files in the temporary directory. > (2.b)Secondly check whether there are multiple incorrect commit, and finally > move the correct results to the final directory. > When speculative execution is enabled, when one attempt of a Task is > completed, other attempts will be killed. However, AM only sends the kill > event and does not ensure that all cleanup actions are completed, that is, > closeOp may be executed between 2.a and 2.b. Therefore, > removeTempOrDuplicateFiles will not delete the file generated by the kill > attempt. > How? > The problem is that both speculatively executed tasks commit the file. This > will not happen in the Tez examples because they will try canCommit, which > can guarantee that one and only one task attempt commit successfully. If one > task attempt executes canCommti successfully, the other one will be stuck by > canCommti until it receives a kill signal. > detail see: > https://github.com/apache/tez/blob/51d6f53967110e2b91b6d90b46f8e16bdc062091/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/processor/SimpleMRProcessor.java#L70 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-27892) Hive "insert overwrite table" for multiple partition table issue
[ https://issues.apache.org/jira/browse/HIVE-27892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HIVE-27892: -- Labels: pull-request-available (was: ) > Hive "insert overwrite table" for multiple partition table issue > > > Key: HIVE-27892 > URL: https://issues.apache.org/jira/browse/HIVE-27892 > Project: Hive > Issue Type: Bug >Reporter: Mayank Kunwar >Assignee: Mayank Kunwar >Priority: Major > Labels: pull-request-available > > Authorization is not working for Hive "insert overwrite table" for multiple > partition table. > Steps to reproduce the issue: > 1) CREATE EXTERNAL TABLE Part (eid int, name int) > PARTITIONED BY (position int, dept int); > 2) SET hive.exec.dynamic.partition.mode=nonstrict; > 3) INSERT INTO TABLE PART PARTITION (position,DEPT) > SELECT 1,1,1,1; > 4) select * from part; > create a test user test123, and grant test123 only Select permission for db > default, table Part and column * . > 1) insert overwrite table part partition(position=2,DEPT=2) select 2,2; > This will failed as expected. > 2) insert overwrite table part partition(position,DEPT) select 2,2,2,2; > This will failed as expected. > 3) insert overwrite table part partition(position=2,DEPT) select 2,2,2; > But this will succeed and no audit in Ranger, which means no authorization > happened when this query was executed. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-27899) Killed speculative execution task attempt should not commit file
[ https://issues.apache.org/jira/browse/HIVE-27899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenyu Zheng updated HIVE-27899: Description: As I mentioned in HIVE-25561, when tez turns on speculative execution, the data file produced by hive may be duplicated. I mentioned in HIVE-25561 that if the speculatively executed task is killed, some data may be submitted unexpectedly. However, after HIVE-25561, there is still a situation that has not been solved. If two task attempts commit file at the same time, the problem of duplicate data files may also occur. Although the probability of this happening is very, very low, it does happen. Why? There are two key steps: (1)FileSinkOperator::closeOp TezProcessor::initializeAndRunProcessor --> ... --> FileSinkOperator::closeOp --> fsp.commit When the OP is closed, the process of closing the OP will be triggered, and eventually the call to fsp.commit will be triggered. (2)removeTempOrDuplicateFiles (2.a)Firstly, listStatus the files in the temporary directory. (2.b)Secondly check whether there are multiple incorrect commit, and finally move the correct results to the final directory. When speculative execution is enabled, when one attempt of a Task is completed, other attempts will be killed. However, AM only sends the kill event and does not ensure that all cleanup actions are completed, that is, closeOp may be executed between 2.a and 2.b. Therefore, removeTempOrDuplicateFiles will not delete the file generated by the kill attempt. How? The problem is that both speculatively executed tasks commit the file. This will not happen in the Tez examples because they will try canCommit, which can guarantee that one and only one task attempt commit successfully. If one task attempt executes canCommti successfully, the other one will be stuck by canCommti until it receives a kill signal. detail see: https://github.com/apache/tez/blob/51d6f53967110e2b91b6d90b46f8e16bdc062091/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/processor/SimpleMRProcessor.java#L70 was:As I mentioned in HIVE-25561, when tez turns on speculative execution, the data file produced by hive may be duplicated. I mentioned in HIVE-25561 that if the speculatively executed task is killed, some data may be submitted unexpectedly. However, after HIVE-25561, there is still a situation that has not been solved. If two task attempts commit file at the same time, the problem of duplicate data files may also occur. Although the probability of this happening is very, very low, it does happen. > Killed speculative execution task attempt should not commit file > > > Key: HIVE-27899 > URL: https://issues.apache.org/jira/browse/HIVE-27899 > Project: Hive > Issue Type: Bug > Components: Tez >Reporter: Chenyu Zheng >Assignee: Chenyu Zheng >Priority: Major > > As I mentioned in HIVE-25561, when tez turns on speculative execution, the > data file produced by hive may be duplicated. I mentioned in HIVE-25561 that > if the speculatively executed task is killed, some data may be submitted > unexpectedly. However, after HIVE-25561, there is still a situation that has > not been solved. If two task attempts commit file at the same time, the > problem of duplicate data files may also occur. Although the probability of > this happening is very, very low, it does happen. > > Why? > There are two key steps: > (1)FileSinkOperator::closeOp > TezProcessor::initializeAndRunProcessor --> ... --> FileSinkOperator::closeOp > --> fsp.commit > When the OP is closed, the process of closing the OP will be triggered, and > eventually the call to fsp.commit will be triggered. > (2)removeTempOrDuplicateFiles > (2.a)Firstly, listStatus the files in the temporary directory. > (2.b)Secondly check whether there are multiple incorrect commit, and finally > move the correct results to the final directory. > When speculative execution is enabled, when one attempt of a Task is > completed, other attempts will be killed. However, AM only sends the kill > event and does not ensure that all cleanup actions are completed, that is, > closeOp may be executed between 2.a and 2.b. Therefore, > removeTempOrDuplicateFiles will not delete the file generated by the kill > attempt. > How? > The problem is that both speculatively executed tasks commit the file. This > will not happen in the Tez examples because they will try canCommit, which > can guarantee that one and only one task attempt commit successfully. If one > task attempt executes canCommti successfully, the other one will be stuck by > canCommti until it receives a kill signal. > detail see: > https://github.com/apache/tez/blob/51d6f53967110e2b91b6d90b46f8e16bdc062091/tez-mapreduce/src/main/java/org/apache/tez
[jira] [Updated] (HIVE-27899) Killed speculative execution task attempt should not commit file
[ https://issues.apache.org/jira/browse/HIVE-27899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenyu Zheng updated HIVE-27899: Description: As I mentioned in HIVE-25561, when tez turns on speculative execution, the data file produced by hive may be duplicated. I mentioned in HIVE-25561 that if the speculatively executed task is killed, some data may be submitted unexpectedly. However, after HIVE-25561, there is still a situation that has not been solved. If two task attempts commit file at the same time, the problem of duplicate data files may also occur. Although the probability of this happening is very, very low, it does happen. > Killed speculative execution task attempt should not commit file > > > Key: HIVE-27899 > URL: https://issues.apache.org/jira/browse/HIVE-27899 > Project: Hive > Issue Type: Bug > Components: Tez >Reporter: Chenyu Zheng >Assignee: Chenyu Zheng >Priority: Major > > As I mentioned in HIVE-25561, when tez turns on speculative execution, the > data file produced by hive may be duplicated. I mentioned in HIVE-25561 that > if the speculatively executed task is killed, some data may be submitted > unexpectedly. However, after HIVE-25561, there is still a situation that has > not been solved. If two task attempts commit file at the same time, the > problem of duplicate data files may also occur. Although the probability of > this happening is very, very low, it does happen. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-27899) Killed speculative execution task attempt should not commit file
[ https://issues.apache.org/jira/browse/HIVE-27899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenyu Zheng updated HIVE-27899: Summary: Killed speculative execution task attempt should not commit file (was: Speculative execution task which will be killed should not commit file) > Killed speculative execution task attempt should not commit file > > > Key: HIVE-27899 > URL: https://issues.apache.org/jira/browse/HIVE-27899 > Project: Hive > Issue Type: Bug > Components: Tez >Reporter: Chenyu Zheng >Assignee: Chenyu Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-27901) Hive's performance for querying the Iceberg table is very poor.
[ https://issues.apache.org/jira/browse/HIVE-27901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yongzhi.shao updated HIVE-27901: Description: I am using HIVE4.0.0-BETA for testing. BTW,I found that the performance of HIVE reading ICEBERG table is still very slow. How should I deal with this problem? I count a 7 billion table and compare the performance difference between HIVE reading ICEBERG-ORC and ORC table respectively. We use ICEBERG 1.4.2, ICEBERG-ORC with ZSTD compression enabled. ORC with SNAPPY compression. HADOOP version 3.1.1 (native zstd not supported). !image-2023-11-22-18-32-28-344.png! !image-2023-11-22-18-33-01-885.png! Also, I have another question. The Submit Plan statistic is clearly incorrect. Is this something that needs to be fixed? !image-2023-11-22-18-33-32-915.png! was: I am using HIVE-4.0.0-BETA for testing. BTW,I found that the performance of HIVE reading ICEBERG table is still very slow. How should I deal with this problem? I count a 7 billion table and compare the performance difference between HIVE reading ICEBERG-ORC and ORC table respectively. We use ICEBERG 1.4.2, ICEBERG-ORC with ZSTD compression enabled. ORC with SNAPPY compression. HADOOP version 3.1.1 (native zstd not supported). !image-2023-11-22-18-32-28-344.png! !image-2023-11-22-18-33-01-885.png! Also, I have another question. The Submit Plan statistic is clearly incorrect. Is this something that needs to be fixed? !image-2023-11-22-18-33-32-915.png! > Hive's performance for querying the Iceberg table is very poor. > --- > > Key: HIVE-27901 > URL: https://issues.apache.org/jira/browse/HIVE-27901 > Project: Hive > Issue Type: Bug > Components: Iceberg integration >Affects Versions: 4.0.0-beta-1 >Reporter: yongzhi.shao >Priority: Major > Attachments: image-2023-11-22-18-32-28-344.png, > image-2023-11-22-18-33-01-885.png, image-2023-11-22-18-33-32-915.png > > > I am using HIVE4.0.0-BETA for testing. > BTW,I found that the performance of HIVE reading ICEBERG table is still very > slow. > How should I deal with this problem? > I count a 7 billion table and compare the performance difference between HIVE > reading ICEBERG-ORC and ORC table respectively. > We use ICEBERG 1.4.2, ICEBERG-ORC with ZSTD compression enabled. > ORC with SNAPPY compression. > HADOOP version 3.1.1 (native zstd not supported). > !image-2023-11-22-18-32-28-344.png! > !image-2023-11-22-18-33-01-885.png! > Also, I have another question. The Submit Plan statistic is clearly > incorrect. Is this something that needs to be fixed? > !image-2023-11-22-18-33-32-915.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27901) Hive's performance for querying the Iceberg table is very poor.
yongzhi.shao created HIVE-27901: --- Summary: Hive's performance for querying the Iceberg table is very poor. Key: HIVE-27901 URL: https://issues.apache.org/jira/browse/HIVE-27901 Project: Hive Issue Type: Bug Components: Iceberg integration Affects Versions: 4.0.0-beta-1 Reporter: yongzhi.shao Attachments: image-2023-11-22-18-32-28-344.png, image-2023-11-22-18-33-01-885.png, image-2023-11-22-18-33-32-915.png I am using HIVE-4.0.0-BETA for testing. BTW,I found that the performance of HIVE reading ICEBERG table is still very slow. How should I deal with this problem? I count a 7 billion table and compare the performance difference between HIVE reading ICEBERG-ORC and ORC table respectively. We use ICEBERG 1.4.2, ICEBERG-ORC with ZSTD compression enabled. ORC with SNAPPY compression. HADOOP version 3.1.1 (native zstd not supported). !image-2023-11-22-18-32-28-344.png! !image-2023-11-22-18-33-01-885.png! Also, I have another question. The Submit Plan statistic is clearly incorrect. Is this something that needs to be fixed? !image-2023-11-22-18-33-32-915.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27900) hive can not read iceberg-parquet table
yongzhi.shao created HIVE-27900: --- Summary: hive can not read iceberg-parquet table Key: HIVE-27900 URL: https://issues.apache.org/jira/browse/HIVE-27900 Project: Hive Issue Type: Bug Components: Iceberg integration Affects Versions: 4.0.0-beta-1 Reporter: yongzhi.shao We found that using HIVE4-BETA version, we could not query the Iceberg-Parquet table with vectorised execution turned on. {code:java} CREATE EXTERNAL TABLE iceberg_dwd.b_qqd_shop_rfm_parquet_snappy STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' LOCATION 'hdfs://xxx/iceberg-catalog/warehouse/test/b_qqd_shop_rfm_parquet_snappy/' TBLPROPERTIES ('iceberg.catalog'='location_based_table','engine.hive.enabled'='true'); set hive.default.fileformat=orc; set hive.default.fileformat.managed=orc; create table test_parquet_as_orc as select * from b_qqd_shop_rfm_parquet_snappy limit 100; , TaskAttempt 2 failed, info=[Error: Node: /xxx..xx.xx: Error while running task ( failure ) : attempt_1696729618575_69586_1_00_00_2:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:348) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:276) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:381) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:82) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:69) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:69) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:39) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:76) at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:110) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:83) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:414) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:293) ... 16 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:993) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:101) ... 19 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.vector.reducesink.VectorReduceSinkEmptyKeyOperator.process(VectorReduceSinkEmptyKeyOperator.java:137) at org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:919) at org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:158) at org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:919) at org.apache.hadoop.hive.ql.exec.vector.VectorLimitOperator.process(VectorLimitOperator.java:108) at org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:919) at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:171) at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.deliverVectorizedRowBatch(VectorMapOperator.java:809) at org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:878) ... 20 more Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.common.io.NonSyncByteArrayOutputStream.write(NonSyncByteArrayOutputStream.java:110) at org.apache.hadoop.hive.serde2.lazybinary.fast.LazyBinarySerializeWrite.writeString(LazyBinarySerializeWrite.java:280) at org.apache.hadoop.hive.ql.exec.vector.VectorSerializeRow$VectorSerializeStringWriter.serialize(VectorSerializeRow.java:532) at org.apache.hado
[jira] [Updated] (HIVE-27898) HIVE4 can't use ICEBERG table in subqueries
[ https://issues.apache.org/jira/browse/HIVE-27898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yongzhi.shao updated HIVE-27898: Description: Currently, we found that when using HIVE4-BETA1 version, if we use ICEBERG table in the subquery, we can't get any data in the end. I have used HIVE3-TEZ for cross validation and HIVE3 does not have this problem when querying ICEBERG. {code:java} --iceberg select * from iceberg_dwd.b_std_trade where uni_shop_id = 'TEST|1' limit 10 --10 rows select * from ( select * from iceberg_dwd.b_std_trade where uni_shop_id = 'TEST|1' limit 10 ) t1; --10 rows select uni_shop_id from ( select * from iceberg_dwd.b_std_trade where uni_shop_id = 'TEST|1' limit 10 ) t1; --0 rows select uni_shop_id from ( select uni_shop_id as uni_shop_id from iceberg_dwd.b_std_trade where uni_shop_id = 'TEST|1' limit 10 ) t1; --0 rows --orc select uni_shop_id from ( select * from iceberg_dwd.trade_test where uni_shop_id = 'TEST|1' limit 10 ) t1;--10 ROWS{code} was: Currently, we found that when using HIVE4-BETA1 version, if we use ICEBERG table in the subquery, we can't get any data in the end. I have used HIVE3 for cross validation and HIVE3 does not have this problem when querying ICEBERG. {code:java} --iceberg select * from iceberg_dwd.b_std_trade where uni_shop_id = 'TEST|1' limit 10 --10 rows select * from ( select * from iceberg_dwd.b_std_trade where uni_shop_id = 'TEST|1' limit 10 ) t1; --10 rows select uni_shop_id from ( select * from iceberg_dwd.b_std_trade where uni_shop_id = 'TEST|1' limit 10 ) t1; --0 rows select uni_shop_id from ( select uni_shop_id as uni_shop_id from iceberg_dwd.b_std_trade where uni_shop_id = 'TEST|1' limit 10 ) t1; --0 rows --orc select uni_shop_id from ( select * from iceberg_dwd.trade_test where uni_shop_id = 'TEST|1' limit 10 ) t1;--10 ROWS{code} > HIVE4 can't use ICEBERG table in subqueries > --- > > Key: HIVE-27898 > URL: https://issues.apache.org/jira/browse/HIVE-27898 > Project: Hive > Issue Type: Bug > Components: Iceberg integration >Affects Versions: 4.0.0-beta-1 >Reporter: yongzhi.shao >Priority: Critical > > Currently, we found that when using HIVE4-BETA1 version, if we use ICEBERG > table in the subquery, we can't get any data in the end. > I have used HIVE3-TEZ for cross validation and HIVE3 does not have this > problem when querying ICEBERG. > {code:java} > --iceberg > select * from iceberg_dwd.b_std_trade > where uni_shop_id = 'TEST|1' limit 10 --10 rows > select * > from ( > select * from iceberg_dwd.b_std_trade > where uni_shop_id = 'TEST|1' limit 10 > ) t1; --10 rows > select uni_shop_id > from ( > select * from iceberg_dwd.b_std_trade > where uni_shop_id = 'TEST|1' limit 10 > ) t1; --0 rows > select uni_shop_id > from ( > select uni_shop_id as uni_shop_id from iceberg_dwd.b_std_trade > where uni_shop_id = 'TEST|1' limit 10 > ) t1; --0 rows > --orc > select uni_shop_id > from ( > select * from iceberg_dwd.trade_test > where uni_shop_id = 'TEST|1' limit 10 > ) t1;--10 ROWS{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HIVE-27899) Speculative execution task which will be killed should not commit file
[ https://issues.apache.org/jira/browse/HIVE-27899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenyu Zheng reassigned HIVE-27899: --- Assignee: Chenyu Zheng Issue Type: Bug (was: Improvement) > Speculative execution task which will be killed should not commit file > -- > > Key: HIVE-27899 > URL: https://issues.apache.org/jira/browse/HIVE-27899 > Project: Hive > Issue Type: Bug > Components: Tez >Reporter: Chenyu Zheng >Assignee: Chenyu Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27899) Speculative execution task which will be killed should not commit file
Chenyu Zheng created HIVE-27899: --- Summary: Speculative execution task which will be killed should not commit file Key: HIVE-27899 URL: https://issues.apache.org/jira/browse/HIVE-27899 Project: Hive Issue Type: Improvement Components: Tez Reporter: Chenyu Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27898) HIVE4 can't use ICEBERG table in subqueries
yongzhi.shao created HIVE-27898: --- Summary: HIVE4 can't use ICEBERG table in subqueries Key: HIVE-27898 URL: https://issues.apache.org/jira/browse/HIVE-27898 Project: Hive Issue Type: Bug Components: Iceberg integration Affects Versions: 4.0.0-beta-1 Reporter: yongzhi.shao Currently, we found that when using HIVE4-BETA1 version, if we use ICEBERG table in the subquery, we can't get any data in the end. I have used HIVE3 for cross validation and HIVE3 does not have this problem when querying ICEBERG. {code:java} --iceberg select * from iceberg_dwd.b_std_trade where uni_shop_id = 'TEST|1' limit 10 --10 rows select * from ( select * from iceberg_dwd.b_std_trade where uni_shop_id = 'TEST|1' limit 10 ) t1; --10 rows select uni_shop_id from ( select * from iceberg_dwd.b_std_trade where uni_shop_id = 'TEST|1' limit 10 ) t1; --0 rows select uni_shop_id from ( select uni_shop_id as uni_shop_id from iceberg_dwd.b_std_trade where uni_shop_id = 'TEST|1' limit 10 ) t1; --0 rows --orc select uni_shop_id from ( select * from iceberg_dwd.trade_test where uni_shop_id = 'TEST|1' limit 10 ) t1;--10 ROWS{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HIVE-27687) Logger variable should be static final as its creation takes more time in query compilation
[ https://issues.apache.org/jira/browse/HIVE-27687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788686#comment-17788686 ] Stamatis Zampetakis commented on HIVE-27687: [~rameshkumar] Please fill in the "Fix Version" field otherwise this entry will never make it to the release notes. > Logger variable should be static final as its creation takes more time in > query compilation > --- > > Key: HIVE-27687 > URL: https://issues.apache.org/jira/browse/HIVE-27687 > Project: Hive > Issue Type: Task > Components: Hive >Reporter: Ramesh Kumar Thangarajan >Assignee: Ramesh Kumar Thangarajan >Priority: Major > Labels: pull-request-available > Attachments: Screenshot 2023-09-12 at 5.03.31 PM.png > > > In query compilation, > LoggerFactory.getLogger() seems to take up more time. Some of the serde > classes use non static variable for Logger that forces the getLogger() call > for each of the class creation. > Making Logger variable static final will avoid this code path for every serde > class construction. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-27897) Backport of HIVE-22373, HIVE-25553, HIVE-23561, HIVE-24321, HIVE-22856, HIVE-22973, HIVE-21729
Aman Raj created HIVE-27897: --- Summary: Backport of HIVE-22373, HIVE-25553, HIVE-23561, HIVE-24321, HIVE-22856, HIVE-22973, HIVE-21729 Key: HIVE-27897 URL: https://issues.apache.org/jira/browse/HIVE-27897 Project: Hive Issue Type: Sub-task Affects Versions: 3.2.0 Reporter: Aman Raj Assignee: Aman Raj -- This message was sent by Atlassian Jira (v8.20.10#820010)