[jira] [Commented] (IMPALA-11265) Iceberg tables have a large memory footprint in catalog cache
[ https://issues.apache.org/jira/browse/IMPALA-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17869374#comment-17869374 ] Gabor Kaszab commented on IMPALA-11265: --- I did the experiment myself too. For me the functional_parquet.iceberg_partitioned table has the size of 2.8M - 3.1M (not always the same for some reason). Could the difference be caused by a potential Iceberg version bump since your measurements? Anyway, I checked the size of the [BaseTable|https://github.com/apache/iceberg/blob/1.3.x/core/src/main/java/org/apache/iceberg/BaseTable.java] object and for me it seems that mostly the TableOperations object takes all the memory while the size of other members within this class are negligible. {code:java} if (value instanceof BaseTable) { BaseTable bt = (BaseTable)value; long size1 = SIZEOF.deepSizeOf(bt.operations()); long size2 = SIZEOF.deepSizeOf(bt.name()); long size3 = SIZEOF.deepSizeOf(LoggingMetricsReporter.instance()); if (size1 < 0 || size2 < 0 || size3 < 0) throw new RuntimeException("something"); } {code} With the above code snippet size1=3145000, size2=184, size3=16 Note, the MetricsReporter is not exposed from BaseTable in Iceberg 1.3 only in newer versions so I simply took the LoggingMetricsReporter as it is used within BaseTable anyway. So the next step here is to grind one step deeper and check what is consuming that amount of memory in HadoopTableOperation. After a first glance it seems that there are a lot of string configs stored, but will keep investigating further. > Iceberg tables have a large memory footprint in catalog cache > - > > Key: IMPALA-11265 > URL: https://issues.apache.org/jira/browse/IMPALA-11265 > Project: IMPALA > Issue Type: Improvement > Components: Catalog >Reporter: Quanlong Huang >Priority: Major > Labels: impala-iceberg > > During the investigation of IMPALA-11260, I found the cache item size of a > (IcebergApiTableCacheKey, org.apache.iceberg.BaseTable) pair could be 30MB. > For instance, here are the cache items of the iceberg table > {{{}functional_parquet.iceberg_partitioned{}}}: > {code:java} > weigh=3792, keyClass=class > org.apache.impala.catalog.local.CatalogdMetaProvider$TableCacheKey, > valueClass=class > org.apache.impala.catalog.local.CatalogdMetaProvider$TableMetaRefImpl > weigh=14960, keyClass=class > org.apache.impala.catalog.local.CatalogdMetaProvider$IcebergMetaCacheKey, > valueClass=class org.apache.impala.thrift.TPartialTableInfo > weigh=30546992, keyClass=class > org.apache.impala.catalog.local.CatalogdMetaProvider$IcebergApiTableCacheKey, > valueClass=class org.apache.iceberg.BaseTable > weigh=496, keyClass=class > org.apache.impala.catalog.local.CatalogdMetaProvider$ColStatsCacheKey, > valueClass=class org.apache.hadoop.hive.metastore.api.ColumnStatisticsObj > weigh=496, keyClass=class > org.apache.impala.catalog.local.CatalogdMetaProvider$ColStatsCacheKey, > valueClass=class org.apache.hadoop.hive.metastore.api.ColumnStatisticsObj > weigh=496, keyClass=class > org.apache.impala.catalog.local.CatalogdMetaProvider$ColStatsCacheKey, > valueClass=class org.apache.hadoop.hive.metastore.api.ColumnStatisticsObj > weigh=512, keyClass=class > org.apache.impala.catalog.local.CatalogdMetaProvider$ColStatsCacheKey, > valueClass=class org.apache.hadoop.hive.metastore.api.ColumnStatisticsObj > weigh=472, keyClass=class > org.apache.impala.catalog.local.CatalogdMetaProvider$PartitionListCacheKey, > valueClass=class java.util.ArrayList > weigh=10328, keyClass=class > org.apache.impala.catalog.local.CatalogdMetaProvider$PartitionCacheKey, > valueClass=class > org.apache.impala.catalog.local.CatalogdMetaProvider$PartitionMetadataImpl{code} > Note that this table just have 20 rows. The total memory footprint size is > 30MB. > For a normal partitioned partquet table, the memory footprint is not that > large. For instance, here are the cache items for > {{{}functional_parquet.alltypes{}}}: > {code:java} > weigh=4216, keyClass=class > org.apache.impala.catalog.local.CatalogdMetaProvider$TableCacheKey, > valueClass=class > org.apache.impala.catalog.local.CatalogdMetaProvider$TableMetaRefImpl > weigh=480, keyClass=class > org.apache.impala.catalog.local.CatalogdMetaProvider$ColStatsCacheKey, > valueClass=class org.apache.hadoop.hive.metastore.api.ColumnStatisticsObj > weigh=472, keyClass=class > org.apache.impala.catalog.local.CatalogdMetaProvider$ColStatsCacheKey, > valueClass=class org.apache.hadoop.hive.metastore.api.ColumnStatisticsObj > weigh=488, keyClass=class > org.apache.impala.catalog.local.CatalogdMetaProvider$ColStatsCacheKey, > valueClass=class org.apache.hadoop.hive.metastore.api.ColumnStatisticsObj > weigh=488,
[jira] [Commented] (IMPALA-13244) Timestamp partition error in catalogd when insert data into iceberg table
[ https://issues.apache.org/jira/browse/IMPALA-13244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17867242#comment-17867242 ] Gabor Kaszab commented on IMPALA-13244: --- Ok, so I guess the issue is with the order of the partition cols. So you defined the table cols this order: 'xxx', 'code', 'updatetime'. While the partition cols are defined in the opposite order. So when you insert with providing also the table structure then there the columns aren't in the order how the actual order of the cols in the table are (in the order of how you defined the partition cols). I'm not sure this is a real issue here. > Timestamp partition error in catalogd when insert data into iceberg table > -- > > Key: IMPALA-13244 > URL: https://issues.apache.org/jira/browse/IMPALA-13244 > Project: IMPALA > Issue Type: Bug > Components: Catalog >Affects Versions: Impala 4.4.0 > Environment: centos7.9 >Reporter: Pain Sun >Priority: Major > > create table sql like this: > > CREATE TABLE test111.table1 ( > xxx STRING, > code STRING, > updatetime TIMESTAMP > ) PARTITIONED BY spec( > month(updatetime), > bucket(10, code), > bucket(10, xxx) > ) STORED AS ICEBERG TBLPROPERTIES( > 'iceberg.catalog' = 'hadoop.catalog', > 'iceberg.catalog_location' = '/impalatable', > 'iceberg.table_identifier' = 'middle.table1', > 'write.metadata.previous-versions-max' = '3', > 'write.metadata.delete-after-commit.enabled' = 'true', > 'commit.manifest.min-count-to-merge' = '3', > 'commit.manifest-merge.enabled' = 'true', > 'format-version' = '1' > ); > > > > then insert data into this table like this: > insert into > test111.table1 ( > xxx, > code, > updatetime > ) > select > 'm1' as xxx, > 'c1' as code, > '2024-07-17 13:44:01' as updatetime; > Catalogd error like this : > E0719 09:50:57.458815 126128 JniUtil.java:183] > 964d388b63170b6b:7c6e06c2] Error in Update catalog for > test111.table1. Time spent: 6ms > I0719 09:50:57.459015 126128 jni-util.cc:302] > 964d388b63170b6b:7c6e06c2] java.lang.IllegalStateException > at > com.google.common.base.Preconditions.checkState(Preconditions.java:496) > at > org.apache.impala.util.IcebergUtil.parseMonthToTransformMonth(IcebergUtil.java:882) > at > org.apache.impala.util.IcebergUtil.getPartitionValue(IcebergUtil.java:826) > at > org.apache.impala.util.IcebergUtil.partitionDataFromDataFile(IcebergUtil.java:800) > at > org.apache.impala.service.IcebergCatalogOpExecutor.createDataFile(IcebergCatalogOpExecutor.java:445) > at > org.apache.impala.service.IcebergCatalogOpExecutor.appendFiles(IcebergCatalogOpExecutor.java:487) > at > org.apache.impala.service.IcebergCatalogOpExecutor.execute(IcebergCatalogOpExecutor.java:366) > at > org.apache.impala.service.CatalogOpExecutor.updateCatalogImpl(CatalogOpExecutor.java:7443) > at > org.apache.impala.service.CatalogOpExecutor.updateCatalog(CatalogOpExecutor.java:7180) > at > org.apache.impala.service.JniCatalog.lambda$updateCatalog$15(JniCatalog.java:504) > at > org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90) > at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58) > at > org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89) > at > org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:100) > at > org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:245) > at > org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:259) > at > org.apache.impala.service.JniCatalog.updateCatalog(JniCatalog.java:503) > I0719 09:50:57.459033 126128 status.cc:129] > 964d388b63170b6b:7c6e06c2] IllegalStateException: null > @ 0x10546b4 > @ 0x1b94d34 > @ 0x10040ab > @ 0xfa1c27 > @ 0xf61f84 > @ 0xf4acc3 > @ 0xf5278b > @ 0x14486aa > @ 0x143b0fa > @ 0x1c78d39 > @ 0x1c79fd1 > @ 0x256da47 > @ 0x7fabd2eb8ea5 > @ 0x7fabcfe939fd > E0719 09:50:57.459059 126128 catalog-server.cc:324] > 964d388b63170b6b:7c6e06c2] IllegalStateException: null > > but spark insert success. > > versions : > impala: 4.4.0 > jar in impala: iceberg-api-1.3.1.7.2.18.0-369.jar > spark: 3.3.4 > iceberg: apache 1.3.1 > iceberg-spark jar: iceberg-spark-runtime-3.3_2.12-1.3.1.jar > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IMPALA-13244) Timestamp partition error in catalogd when insert data into iceberg table
[ https://issues.apache.org/jira/browse/IMPALA-13244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17867240#comment-17867240 ] Gabor Kaszab commented on IMPALA-13244: --- Note, if you rewrite the query a bit then it works as expected: {code:java} insert into table1 select 'm1' as xxx, 'c1' as code, '2024-07-17 13:44:01' as updatetime; {code} This succeeds and the file created is the following: {code:java} select file_path from default.table1.`files`; hdfs://localhost:20500/test-warehouse/table1/data/updatetime_month=2024-07/code_bucket=9/xxx_bucket=4/ed41924564367199-298e3f73_462647000_data.0.parq {code} > Timestamp partition error in catalogd when insert data into iceberg table > -- > > Key: IMPALA-13244 > URL: https://issues.apache.org/jira/browse/IMPALA-13244 > Project: IMPALA > Issue Type: Bug > Components: Catalog >Affects Versions: Impala 4.4.0 > Environment: centos7.9 >Reporter: Pain Sun >Priority: Major > > create table sql like this: > > CREATE TABLE test111.table1 ( > xxx STRING, > code STRING, > updatetime TIMESTAMP > ) PARTITIONED BY spec( > month(updatetime), > bucket(10, code), > bucket(10, xxx) > ) STORED AS ICEBERG TBLPROPERTIES( > 'iceberg.catalog' = 'hadoop.catalog', > 'iceberg.catalog_location' = '/impalatable', > 'iceberg.table_identifier' = 'middle.table1', > 'write.metadata.previous-versions-max' = '3', > 'write.metadata.delete-after-commit.enabled' = 'true', > 'commit.manifest.min-count-to-merge' = '3', > 'commit.manifest-merge.enabled' = 'true', > 'format-version' = '1' > ); > > > > then insert data into this table like this: > insert into > test111.table1 ( > xxx, > code, > updatetime > ) > select > 'm1' as xxx, > 'c1' as code, > '2024-07-17 13:44:01' as updatetime; > Catalogd error like this : > E0719 09:50:57.458815 126128 JniUtil.java:183] > 964d388b63170b6b:7c6e06c2] Error in Update catalog for > test111.table1. Time spent: 6ms > I0719 09:50:57.459015 126128 jni-util.cc:302] > 964d388b63170b6b:7c6e06c2] java.lang.IllegalStateException > at > com.google.common.base.Preconditions.checkState(Preconditions.java:496) > at > org.apache.impala.util.IcebergUtil.parseMonthToTransformMonth(IcebergUtil.java:882) > at > org.apache.impala.util.IcebergUtil.getPartitionValue(IcebergUtil.java:826) > at > org.apache.impala.util.IcebergUtil.partitionDataFromDataFile(IcebergUtil.java:800) > at > org.apache.impala.service.IcebergCatalogOpExecutor.createDataFile(IcebergCatalogOpExecutor.java:445) > at > org.apache.impala.service.IcebergCatalogOpExecutor.appendFiles(IcebergCatalogOpExecutor.java:487) > at > org.apache.impala.service.IcebergCatalogOpExecutor.execute(IcebergCatalogOpExecutor.java:366) > at > org.apache.impala.service.CatalogOpExecutor.updateCatalogImpl(CatalogOpExecutor.java:7443) > at > org.apache.impala.service.CatalogOpExecutor.updateCatalog(CatalogOpExecutor.java:7180) > at > org.apache.impala.service.JniCatalog.lambda$updateCatalog$15(JniCatalog.java:504) > at > org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90) > at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58) > at > org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89) > at > org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:100) > at > org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:245) > at > org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:259) > at > org.apache.impala.service.JniCatalog.updateCatalog(JniCatalog.java:503) > I0719 09:50:57.459033 126128 status.cc:129] > 964d388b63170b6b:7c6e06c2] IllegalStateException: null > @ 0x10546b4 > @ 0x1b94d34 > @ 0x10040ab > @ 0xfa1c27 > @ 0xf61f84 > @ 0xf4acc3 > @ 0xf5278b > @ 0x14486aa > @ 0x143b0fa > @ 0x1c78d39 > @ 0x1c79fd1 > @ 0x256da47 > @ 0x7fabd2eb8ea5 > @ 0x7fabcfe939fd > E0719 09:50:57.459059 126128 catalog-server.cc:324] > 964d388b63170b6b:7c6e06c2] IllegalStateException: null > > but spark insert success. > > versions : > impala: 4.4.0 > jar in impala: iceberg-api-1.3.1.7.2.18.0-369.jar > spark: 3.3.4 > iceberg: apache 1.3.1 > iceberg-spark jar: iceberg-spark-runtime-3.3_2.12-1.3.1.jar > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IMPALA-13244) Timestamp partition error in catalogd when insert data into iceberg table
[ https://issues.apache.org/jira/browse/IMPALA-13244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17867239#comment-17867239 ] Gabor Kaszab commented on IMPALA-13244: --- Thanks for raising this, [~MadBeeDo]! I tried the repro steps and I also see this issue. I think that seems completely off is that updateCatalog received 'updated_partitions' in the TUpdateCatalogRequest that is completely off: {code:java} updatetime_month=4/code_bucket=9/xxx_bucket=2024-07 -> {TUpdatedPartition@7973} "TUpdatedPartition(files:[hdfs://localhost:20500/test-warehouse/table1/data/updatetime_month=4/code_bucket=9/xxx_bucket=2024-07/3044d83c3c9b17d3-ca410be0_227283434_data.0.parq])" {code} Apparently, none of the partition cols have the desire value. > Timestamp partition error in catalogd when insert data into iceberg table > -- > > Key: IMPALA-13244 > URL: https://issues.apache.org/jira/browse/IMPALA-13244 > Project: IMPALA > Issue Type: Bug > Components: Catalog >Affects Versions: Impala 4.4.0 > Environment: centos7.9 >Reporter: Pain Sun >Priority: Major > > create table sql like this: > > CREATE TABLE test111.table1 ( > xxx STRING, > code STRING, > updatetime TIMESTAMP > ) PARTITIONED BY spec( > month(updatetime), > bucket(10, code), > bucket(10, xxx) > ) STORED AS ICEBERG TBLPROPERTIES( > 'iceberg.catalog' = 'hadoop.catalog', > 'iceberg.catalog_location' = '/impalatable', > 'iceberg.table_identifier' = 'middle.table1', > 'write.metadata.previous-versions-max' = '3', > 'write.metadata.delete-after-commit.enabled' = 'true', > 'commit.manifest.min-count-to-merge' = '3', > 'commit.manifest-merge.enabled' = 'true', > 'format-version' = '1' > ); > > > > then insert data into this table like this: > insert into > test111.table1 ( > xxx, > code, > updatetime > ) > select > 'm1' as xxx, > 'c1' as code, > '2024-07-17 13:44:01' as updatetime; > Catalogd error like this : > E0719 09:50:57.458815 126128 JniUtil.java:183] > 964d388b63170b6b:7c6e06c2] Error in Update catalog for > test111.table1. Time spent: 6ms > I0719 09:50:57.459015 126128 jni-util.cc:302] > 964d388b63170b6b:7c6e06c2] java.lang.IllegalStateException > at > com.google.common.base.Preconditions.checkState(Preconditions.java:496) > at > org.apache.impala.util.IcebergUtil.parseMonthToTransformMonth(IcebergUtil.java:882) > at > org.apache.impala.util.IcebergUtil.getPartitionValue(IcebergUtil.java:826) > at > org.apache.impala.util.IcebergUtil.partitionDataFromDataFile(IcebergUtil.java:800) > at > org.apache.impala.service.IcebergCatalogOpExecutor.createDataFile(IcebergCatalogOpExecutor.java:445) > at > org.apache.impala.service.IcebergCatalogOpExecutor.appendFiles(IcebergCatalogOpExecutor.java:487) > at > org.apache.impala.service.IcebergCatalogOpExecutor.execute(IcebergCatalogOpExecutor.java:366) > at > org.apache.impala.service.CatalogOpExecutor.updateCatalogImpl(CatalogOpExecutor.java:7443) > at > org.apache.impala.service.CatalogOpExecutor.updateCatalog(CatalogOpExecutor.java:7180) > at > org.apache.impala.service.JniCatalog.lambda$updateCatalog$15(JniCatalog.java:504) > at > org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90) > at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58) > at > org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89) > at > org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:100) > at > org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:245) > at > org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:259) > at > org.apache.impala.service.JniCatalog.updateCatalog(JniCatalog.java:503) > I0719 09:50:57.459033 126128 status.cc:129] > 964d388b63170b6b:7c6e06c2] IllegalStateException: null > @ 0x10546b4 > @ 0x1b94d34 > @ 0x10040ab > @ 0xfa1c27 > @ 0xf61f84 > @ 0xf4acc3 > @ 0xf5278b > @ 0x14486aa > @ 0x143b0fa > @ 0x1c78d39 > @ 0x1c79fd1 > @ 0x256da47 > @ 0x7fabd2eb8ea5 > @ 0x7fabcfe939fd > E0719 09:50:57.459059 126128 catalog-server.cc:324] > 964d388b63170b6b:7c6e06c2] IllegalStateException: null > > but spark insert success. > > versions : > impala: 4.4.0 > jar in impala: iceberg-api-1.3.1.7.2.18.0-369.jar > spark: 3.3.4 > iceberg: apache 1.3.1 > iceberg-spark jar:
[jira] [Commented] (IMPALA-13242) DROP PARTITION can't drop partitions before a partition evolution if the partition transform was changed
[ https://issues.apache.org/jira/browse/IMPALA-13242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17867002#comment-17867002 ] Gabor Kaszab commented on IMPALA-13242: --- Actually, we don't need a partition transform change to repro this: {code:java} create table part_evol_tbl (i int, j int) partitioned by spec (i) stored as iceberg; insert into part_evol_tbl values (1, 11), (2, 22); alter table part_evol_tbl set partition spec (j); insert into part_evol_tbl values (4, 44); alter table part_evol_tbl drop partition (i=1); Query: alter table part_evol_tbl drop partition (i=1) ERROR: AnalysisException: Partition exprs cannot contain non-partition column(s): i {code} If there is a col that used to be a partition column but not anymore then we won't be able to drop the partitions involved with that col. > DROP PARTITION can't drop partitions before a partition evolution if the > partition transform was changed > > > Key: IMPALA-13242 > URL: https://issues.apache.org/jira/browse/IMPALA-13242 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Affects Versions: Impala 4.4.0 >Reporter: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > > Steps to set up the repro table: > {code:java} > create table year_part_tbl (i int, d date) partitioned by spec (year(d)) > stored as iceberg; > insert into year_part_tbl values (1, "2024-07-17"), (2, "2024-07-16"); > alter table year_part_tbl set partition spec (month(d)); > insert into year_part_tbl values (3, "2024-07-18"); > {code} > After the partition evolution we can't drop the partitions with year() > {code:java} > alter table year_part_tbl drop partition (year(d)=2024); > Query: alter table year_part_tbl drop partition (year(d)=2024) > ERROR: AnalysisException: Can't filter column 'd' with transform type: 'YEAR' > {code} > I guess the issue here is that we compare the filter expression against the > latest partition spec and there the transform on the column is month() > instead of year(). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13242) DROP PARTITION can't drop partitions before a partition evolution if the partition transform was changed
Gabor Kaszab created IMPALA-13242: - Summary: DROP PARTITION can't drop partitions before a partition evolution if the partition transform was changed Key: IMPALA-13242 URL: https://issues.apache.org/jira/browse/IMPALA-13242 Project: IMPALA Issue Type: Bug Components: Frontend Affects Versions: Impala 4.4.0 Reporter: Gabor Kaszab Steps to set up the repro table: {code:java} create table year_part_tbl (i int, d date) partitioned by spec (year(d)) stored as iceberg; insert into year_part_tbl values (1, "2024-07-17"), (2, "2024-07-16"); alter table year_part_tbl set partition spec (month(d)); insert into year_part_tbl values (3, "2024-07-18"); {code} After the partition evolution we can't drop the partitions with year() {code:java} alter table year_part_tbl drop partition (year(d)=2024); Query: alter table year_part_tbl drop partition (year(d)=2024) ERROR: AnalysisException: Can't filter column 'd' with transform type: 'YEAR' {code} I guess the issue here is that we compare the filter expression against the latest partition spec and there the transform on the column is month() instead of year(). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Closed] (IMPALA-12388) Strip file/pos information from tuples once they are not needed
[ https://issues.apache.org/jira/browse/IMPALA-12388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab closed IMPALA-12388. - Fix Version/s: Not Applicable Resolution: Won't Fix I explored some possible implementations for this, the simplest one was where I unconditionally set the relevant null indicators to true for the position delete related slots. This add the less overhead on top of the existing logic in terms of performance. I then started perf verifications on both TPCDS and TPCH, but apparently for some queries this bring actual perf degradation. In worst case (a select-only query) this results in a 5% increase of runtime. There were some queries where I observed improvements around 2-3% but the overall results weren't convincing for me to progress. Closing this as won't fix as initial results aren't good enough to proceed. > Strip file/pos information from tuples once they are not needed > --- > > Key: IMPALA-12388 > URL: https://issues.apache.org/jira/browse/IMPALA-12388 > Project: IMPALA > Issue Type: Bug > Components: Backend, Frontend >Reporter: Zoltán Borók-Nagy >Assignee: Gabor Kaszab >Priority: Major > Labels: Performance, impala-iceberg, performance > Fix For: Not Applicable > > > When Impala processes Iceberg V2 tables that have position delete files it > needs to add extra slots to the input tuples (requried by the ANTI JOIN > between data files and delete files): > * STRING file path > * BIGINT position > This makes the row-size larger by 20 bytes. Please note that this 20 bytes is > only the increase in the tuple memory (12 byte STRING slot plus 8 byte BIGINT > slot), the file path actually points to a potentially large string (100-200 > bytes) stored in a heap buffer. > In the plan fragments of the SCANs we only create a string object per file > for the file path (and set it in the template tuple), so the situation is not > that bad, but once we send the rows over the network the STRINGs are getting > duplicated per record, which can add substantial network and serialization > overhead. > One way to resolve this is to re-materialize the tuples after the Iceberg V2 > scan is done, and only store the interesting slots. This mechanism also saves > us the 20 bytes per tuple overhead, but the re-materialization cost can be > high. > Another, easier solution is to just NULL-out the file path and position slots > once they are not needed anymore. > Of course if the user SELECTs the virtual column {{INPUT_FILE_NAME / > FILE_POSITION}} we cannot re-materialize / NULL out. > Given the following plan: > {noformat} > UNION ALL > /\ >/ \ > SCAN V2 ANTI JOIN > data files / \ > without /\ > deletes SCAN SCAN > data files delete files > with deletes > {noformat} > In the "SCAN data files without deletes" we shouldn't even fill the file > path / position slots. The latter also saves some computational cost. > In our V2 ANTI JOIN operator (IcebergDeleteNode) we can NULL out the file > path / pos slots once the data records are processed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-12388) Strip file/pos information from tuples once they are not needed
[ https://issues.apache.org/jira/browse/IMPALA-12388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab reassigned IMPALA-12388: - Assignee: Gabor Kaszab > Strip file/pos information from tuples once they are not needed > --- > > Key: IMPALA-12388 > URL: https://issues.apache.org/jira/browse/IMPALA-12388 > Project: IMPALA > Issue Type: Bug > Components: Backend, Frontend >Reporter: Zoltán Borók-Nagy >Assignee: Gabor Kaszab >Priority: Major > Labels: Performance, impala-iceberg, performance > > When Impala processes Iceberg V2 tables that have position delete files it > needs to add extra slots to the input tuples (requried by the ANTI JOIN > between data files and delete files): > * STRING file path > * BIGINT position > This makes the row-size larger by 20 bytes. Please note that this 20 bytes is > only the increase in the tuple memory (12 byte STRING slot plus 8 byte BIGINT > slot), the file path actually points to a potentially large string (100-200 > bytes) stored in a heap buffer. > In the plan fragments of the SCANs we only create a string object per file > for the file path (and set it in the template tuple), so the situation is not > that bad, but once we send the rows over the network the STRINGs are getting > duplicated per record, which can add substantial network and serialization > overhead. > One way to resolve this is to re-materialize the tuples after the Iceberg V2 > scan is done, and only store the interesting slots. This mechanism also saves > us the 20 bytes per tuple overhead, but the re-materialization cost can be > high. > Another, easier solution is to just NULL-out the file path and position slots > once they are not needed anymore. > Of course if the user SELECTs the virtual column {{INPUT_FILE_NAME / > FILE_POSITION}} we cannot re-materialize / NULL out. > Given the following plan: > {noformat} > UNION ALL > /\ >/ \ > SCAN V2 ANTI JOIN > data files / \ > without /\ > deletes SCAN SCAN > data files delete files > with deletes > {noformat} > In the "SCAN data files without deletes" we shouldn't even fill the file > path / position slots. The latter also saves some computational cost. > In our V2 ANTI JOIN operator (IcebergDeleteNode) we can NULL out the file > path / pos slots once the data records are processed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-11752) Handle s3:// paths in Iceberg tables
[ https://issues.apache.org/jira/browse/IMPALA-11752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab reassigned IMPALA-11752: - Assignee: Gabor Kaszab > Handle s3:// paths in Iceberg tables > > > Key: IMPALA-11752 > URL: https://issues.apache.org/jira/browse/IMPALA-11752 > Project: IMPALA > Issue Type: Bug > Components: Backend, Frontend >Reporter: Zoltán Borók-Nagy >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > > Components using > [S3FileIO|https://iceberg.apache.org/docs/latest/aws/#s3-fileio] might write > out file paths starting with 's3://' instead of 's3a://'. The latter is used > by > [HadoopFileIO|https://iceberg.apache.org/docs/latest/aws/#hadoop-s3a-filesystem] > that Impala is using. > By default, HadoopFileIO doesn't interpret paths starting with 's3://'. > (Probably this could be resolved by setting "fs.s3.impl" to > "org.apache.hadoop.fs.s3a.S3AFileSystem" so that an s3a fs instance is > created) > [FeIcebergTable.Utils.FeIcebergTable()|https://github.com/apache/impala/blob/2733d039ad4a830a1ea34c1a75d2b666788e39a9/fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java#L671-L689] > depends on file paths returned by recursive file listing match the file > paths in Iceberg metadata files. But the recursive listing returns s3a:// > paths, while metadata contains s3:// paths, which means we'll load files > one-by-one as we won't find the files in the hash map 'hdfsFileDescMap'. > Moreover, if position delete file processing is also based on exact matches > of the file URIs. Therefore if entries with s3:// paths won't have the > desired effects. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12190) Renaming table will cause losing privileges for non-admin users
[ https://issues.apache.org/jira/browse/IMPALA-12190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848515#comment-17848515 ] Gabor Kaszab commented on IMPALA-12190: --- I don't think this can be trivially implemented from Impala side. I recall we also opened a Ranger ticket after the analysis of this issue and agreed that first Ranger should be able to give some API that the clients can use when some resources were renamed. > Renaming table will cause losing privileges for non-admin users > --- > > Key: IMPALA-12190 > URL: https://issues.apache.org/jira/browse/IMPALA-12190 > Project: IMPALA > Issue Type: Bug > Components: Catalog >Reporter: Gabor Kaszab >Assignee: Sai Hemanth Gantasala >Priority: Critical > Labels: alter-table, authorization, ranger > > Let's say user 'a' gets some privileges on table 't'. When this table gets > renamed (even by user 'a') then user 'a' loses its privileges on that table. > > Repro steps: > # Start impala with Ranger > # start impala-shell as admin (-u admin) > # create table tmp (i int, s string) stored as parquet; > # grant all on table tmp to user ; > # grant all on table tmp to user ; > {code:java} > Query: show grant user on table tmp > +++--+---++-+--+-+-+---+--+-+ > | principal_type | principal_name | database | table | column | uri | > storage_type | storage_uri | udf | privilege | grant_option | create_time | > +++--+---++-+--+-+-+---+--+-+ > | USER | | default | tmp | * | | > | | | all | false | NULL | > +++--+---++-+--+-+-+---+--+-+ > Fetched 1 row(s) in 0.01s {code} > # alter table tmp rename to tmp_1234; > # show grant user on table tmp_1234; > {code:java} > Query: show grant user on table tmp_1234 > Fetched 0 row(s) in 0.17s{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Reopened] (IMPALA-13067) Some regex make the tests unconditionally pass
[ https://issues.apache.org/jira/browse/IMPALA-13067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab reopened IMPALA-13067: --- Accidentally closed this one > Some regex make the tests unconditionally pass > -- > > Key: IMPALA-13067 > URL: https://issues.apache.org/jira/browse/IMPALA-13067 > Project: IMPALA > Issue Type: Bug > Components: Infrastructure >Reporter: Gabor Kaszab >Priority: Major > Labels: test-framework > Fix For: Impala 4.5.0 > > > This issue came out in the Iceberg metadata table tests where this regex was > used: > [1-9]\d*|0 > > The "|0" part for some reason made the test framework confused and then > regardless of what you provide as an expected result the tests passed. One > workaround was to put the regex expression between parentheses. Or simply use > "d+". https://issues.apache.org/jira/browse/IMPALA-13055 applied this second > workaround on the tests. > Some analysis would be great why this is the behavior of the test framework, > and if it's indeed the issue of the framnework, we should fix it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Closed] (IMPALA-13055) Some Iceberg metadata table tests doesn't assert
[ https://issues.apache.org/jira/browse/IMPALA-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab closed IMPALA-13055. - Fix Version/s: Impala 4.5.0 Resolution: Fixed > Some Iceberg metadata table tests doesn't assert > > > Key: IMPALA-13055 > URL: https://issues.apache.org/jira/browse/IMPALA-13055 > Project: IMPALA > Issue Type: Test >Reporter: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > Fix For: Impala 4.5.0 > > > Some test in the Iceberg metadata table suite use the following regex to > verify numbers in the output: [1-9]\d*|0 > However, if this format is given, the test unconditionally passes. On could > put the formula within parentheses, or simply verify for \d+ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Closed] (IMPALA-13067) Some regex make the tests unconditionally pass
[ https://issues.apache.org/jira/browse/IMPALA-13067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab closed IMPALA-13067. - Fix Version/s: Impala 4.5.0 Resolution: Fixed > Some regex make the tests unconditionally pass > -- > > Key: IMPALA-13067 > URL: https://issues.apache.org/jira/browse/IMPALA-13067 > Project: IMPALA > Issue Type: Bug > Components: Infrastructure >Reporter: Gabor Kaszab >Priority: Major > Labels: test-framework > Fix For: Impala 4.5.0 > > > This issue came out in the Iceberg metadata table tests where this regex was > used: > [1-9]\d*|0 > > The "|0" part for some reason made the test framework confused and then > regardless of what you provide as an expected result the tests passed. One > workaround was to put the regex expression between parentheses. Or simply use > "d+". https://issues.apache.org/jira/browse/IMPALA-13055 applied this second > workaround on the tests. > Some analysis would be great why this is the behavior of the test framework, > and if it's indeed the issue of the framnework, we should fix it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12266) Sporadic failure after migrating a table to Iceberg
[ https://issues.apache.org/jira/browse/IMPALA-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847169#comment-17847169 ] Gabor Kaszab commented on IMPALA-12266: --- [~laszlog] I see you increased the priority of this. Note, there is another Jira for the root cause: https://issues.apache.org/jira/browse/IMPALA-12712 If that's fixed this would be gone too. > Sporadic failure after migrating a table to Iceberg > --- > > Key: IMPALA-12266 > URL: https://issues.apache.org/jira/browse/IMPALA-12266 > Project: IMPALA > Issue Type: Bug > Components: fe >Affects Versions: Impala 4.2.0 >Reporter: Tamas Mate >Assignee: Gabor Kaszab >Priority: Critical > Labels: impala-iceberg > Attachments: > catalogd.bd40020df22b.invalid-user.log.INFO.20230704-181939.1, > impalad.6c0f48d9ce66.invalid-user.log.INFO.20230704-181940.1 > > > TestIcebergTable.test_convert_table test failed in a recent verify job's > dockerised tests: > https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/7629 > {code:none} > E ImpalaBeeswaxException: ImpalaBeeswaxException: > EINNER EXCEPTION: > EMESSAGE: AnalysisException: Failed to load metadata for table: > 'parquet_nopartitioned' > E CAUSED BY: TableLoadingException: Could not load table > test_convert_table_cdba7383.parquet_nopartitioned from catalog > E CAUSED BY: TException: > TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, > error_msgs:[NullPointerException: null]), lookup_status:OK) > {code} > {code:none} > E0704 19:09:22.980131 833 JniUtil.java:183] > 7145c21173f2c47b:2579db55] Error in Getting partial catalog object of > TABLE:test_convert_table_cdba7383.parquet_nopartitioned. Time spent: 49ms > I0704 19:09:22.980309 833 jni-util.cc:288] > 7145c21173f2c47b:2579db55] java.lang.NullPointerException > at > org.apache.impala.catalog.CatalogServiceCatalog.replaceTableIfUnchanged(CatalogServiceCatalog.java:2357) > at > org.apache.impala.catalog.CatalogServiceCatalog.getOrLoadTable(CatalogServiceCatalog.java:2300) > at > org.apache.impala.catalog.CatalogServiceCatalog.doGetPartialCatalogObject(CatalogServiceCatalog.java:3587) > at > org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3513) > at > org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3480) > at > org.apache.impala.service.JniCatalog.lambda$getPartialCatalogObject$11(JniCatalog.java:397) > at > org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90) > at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58) > at > org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89) > at > org.apache.impala.service.JniCatalogOp.execAndSerializeSilentStartAndFinish(JniCatalogOp.java:109) > at > org.apache.impala.service.JniCatalog.execAndSerializeSilentStartAndFinish(JniCatalog.java:238) > at > org.apache.impala.service.JniCatalog.getPartialCatalogObject(JniCatalog.java:396) > I0704 19:09:22.980324 833 status.cc:129] 7145c21173f2c47b:2579db55] > NullPointerException: null > @ 0x1012f9f impala::Status::Status() > @ 0x187f964 impala::JniUtil::GetJniExceptionMsg() > @ 0xfee920 impala::JniCall::Call<>() > @ 0xfccd0f impala::Catalog::GetPartialCatalogObject() > @ 0xfb55a5 > impala::CatalogServiceThriftIf::GetPartialCatalogObject() > @ 0xf7a691 > impala::CatalogServiceProcessorT<>::process_GetPartialCatalogObject() > @ 0xf82151 impala::CatalogServiceProcessorT<>::dispatchCall() > @ 0xee330f apache::thrift::TDispatchProcessor::process() > @ 0x1329246 > apache::thrift::server::TAcceptQueueServer::Task::run() > @ 0x1315a89 impala::ThriftThread::RunRunnable() > @ 0x131773d > boost::detail::function::void_function_obj_invoker0<>::invoke() > @ 0x195ba8c impala::Thread::SuperviseThread() > @ 0x195c895 boost::detail::thread_data<>::run() > @ 0x23a03a7 thread_proxy > @ 0x7faaad2a66ba start_thread > @ 0x7f2c151d clone > E0704 19:09:23.006968 833 catalog-server.cc:278] > 7145c21173f2c47b:2579db55] NullPointerException: null > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13067) Some regex make the tests unconditionally pass
Gabor Kaszab created IMPALA-13067: - Summary: Some regex make the tests unconditionally pass Key: IMPALA-13067 URL: https://issues.apache.org/jira/browse/IMPALA-13067 Project: IMPALA Issue Type: Bug Components: Infrastructure Reporter: Gabor Kaszab This issue came out in the Iceberg metadata table tests where this regex was used: [1-9]\d*|0 The "|0" part for some reason made the test framework confused and then regardless of what you provide as an expected result the tests passed. One workaround was to put the regex expression between parentheses. Or simply use "d+". https://issues.apache.org/jira/browse/IMPALA-13055 applied this second workaround on the tests. Some analysis would be great why this is the behavior of the test framework, and if it's indeed the issue of the framnework, we should fix it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13055) Some Iceberg metadata table tests doesn't assert
Gabor Kaszab created IMPALA-13055: - Summary: Some Iceberg metadata table tests doesn't assert Key: IMPALA-13055 URL: https://issues.apache.org/jira/browse/IMPALA-13055 Project: IMPALA Issue Type: Test Reporter: Gabor Kaszab Some test in the Iceberg metadata table suite use the following regex to verify numbers in the output: [1-9]\d*|0 However, if this format is given, the test unconditionally passes. On could put the formula within parentheses, or simply verify for \d+ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-13055) Some Iceberg metadata table tests doesn't assert
[ https://issues.apache.org/jira/browse/IMPALA-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab updated IMPALA-13055: -- Labels: impala-iceberg (was: ) > Some Iceberg metadata table tests doesn't assert > > > Key: IMPALA-13055 > URL: https://issues.apache.org/jira/browse/IMPALA-13055 > Project: IMPALA > Issue Type: Test >Reporter: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > > Some test in the Iceberg metadata table suite use the following regex to > verify numbers in the output: [1-9]\d*|0 > However, if this format is given, the test unconditionally passes. On could > put the formula within parentheses, or simply verify for \d+ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Work started] (IMPALA-13029) Add test for equality deletes with different file format
[ https://issues.apache.org/jira/browse/IMPALA-13029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on IMPALA-13029 started by Gabor Kaszab. - > Add test for equality deletes with different file format > > > Key: IMPALA-13029 > URL: https://issues.apache.org/jira/browse/IMPALA-13029 > Project: IMPALA > Issue Type: Test > Components: Frontend >Reporter: Gabor Kaszab >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > > We should test equality deletes in Parquet, ORC and AVRO similary to what > tests we have for position delete file formats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13029) Add test for equality deletes with different file format
Gabor Kaszab created IMPALA-13029: - Summary: Add test for equality deletes with different file format Key: IMPALA-13029 URL: https://issues.apache.org/jira/browse/IMPALA-13029 Project: IMPALA Issue Type: Test Components: Frontend Reporter: Gabor Kaszab We should test equality deletes in Parquet, ORC and AVRO similary to what tests we have for position delete file formats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-12970) Test failure at test_read_equality_deletes in test_iceberg in exhaustive build
[ https://issues.apache.org/jira/browse/IMPALA-12970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab resolved IMPALA-12970. --- Fix Version/s: Impala 4.4.0 Resolution: Fixed > Test failure at test_read_equality_deletes in test_iceberg in exhaustive build > -- > > Key: IMPALA-12970 > URL: https://issues.apache.org/jira/browse/IMPALA-12970 > Project: IMPALA > Issue Type: Bug > Components: Backend >Reporter: Yida Wu >Assignee: Gabor Kaszab >Priority: Major > Labels: broken-build > Fix For: Impala 4.4.0 > > > An error is observed in the data-cache exhaustive build in > test_read_equality_deletes with following message: > {code:java} > query_test.test_iceberg.TestIcebergV2Table.test_read_equality_deletes[protocol: > beeswax | table_format: parquet/none | exec_option: {'test_replan': 1, > 'disable_optimized_iceberg_v2_read': 1, 'batch_size': 0, 'num_nodes': 0, > 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, > 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0}] (from pytest) > {code} > *Error Message* > {code:java} > query_test/test_iceberg.py:1456: in test_read_equality_deletes > self.run_test_case('QueryTest/iceberg-v2-read-equality-deletes', vector) > common/impala_test_suite.py:725: in run_test_case result = exec_fn(query, > user=test_section.get('USER', '').strip() or None) > common/impala_test_suite.py:660: in __exec_in_impala result = > self.__execute_query(target_impalad_client, query, user=user) > common/impala_test_suite.py:1013: in __execute_query return > impalad_client.execute(query, user=user) common/impala_connection.py:215: in > execute fetch_profile_after_close=fetch_profile_after_close) > beeswax/impala_beeswax.py:191: in execute handle = > self.__execute_query(query_string.strip(), user=user) > beeswax/impala_beeswax.py:382: in __execute_query handle = > self.execute_query_async(query_string, user=user) > beeswax/impala_beeswax.py:376: in execute_query_async handle = > self.__do_rpc(lambda: self.imp_service.query(query,)) > beeswax/impala_beeswax.py:539: in __do_rpc raise > ImpalaBeeswaxException(self.__build_error_message(b), b) E > ImpalaBeeswaxException: ImpalaBeeswaxException: EINNER EXCEPTION: 'beeswaxd.ttypes.BeeswaxException'> EMESSAGE: > ConcurrentModificationException: null > {code} > *Stacktrace* > {code:java} > query_test/test_iceberg.py:1456: in test_read_equality_deletes > self.run_test_case('QueryTest/iceberg-v2-read-equality-deletes', vector) > common/impala_test_suite.py:725: in run_test_case > result = exec_fn(query, user=test_section.get('USER', '').strip() or None) > common/impala_test_suite.py:660: in __exec_in_impala > result = self.__execute_query(target_impalad_client, query, user=user) > common/impala_test_suite.py:1013: in __execute_query > return impalad_client.execute(query, user=user) > common/impala_connection.py:215: in execute > fetch_profile_after_close=fetch_profile_after_close) > beeswax/impala_beeswax.py:191: in execute > handle = self.__execute_query(query_string.strip(), user=user) > beeswax/impala_beeswax.py:382: in __execute_query > handle = self.execute_query_async(query_string, user=user) > beeswax/impala_beeswax.py:376: in execute_query_async > handle = self.__do_rpc(lambda: self.imp_service.query(query,)) > beeswax/impala_beeswax.py:539: in __do_rpc > raise ImpalaBeeswaxException(self.__build_error_message(b), b) > E ImpalaBeeswaxException: ImpalaBeeswaxException: > EINNER EXCEPTION: > EMESSAGE: ConcurrentModificationException: null > {code} > *Standard Error* > {code:java} > SET > client_identifier=query_test/test_iceberg.py::TestIcebergV2Table::()::test_read_equality_deletes[protocol:beeswax|table_format:parquet/none|exec_option:{'test_replan':1;'disable_optimized_iceberg_v2_read':1;'batch_size':0;'num_nodes':0;'disable_codegen_rows_threshold':0;'d; > -- connecting to: localhost:21000 > -- 2024-04-03 07:04:53,469 INFO MainThread: Could not connect to ('::1', > 21000, 0, 0) > Traceback (most recent call last): > File > "/data/jenkins/workspace/impala-asf-master-exhaustive-data-cache/repos/Impala/infra/python/env-gcc10.4.0/lib/python2.7/site-packages/thrift/transport/TSocket.py", > line 137, in open > handle.connect(sockaddr) > File > "/data/jenkins/workspace/impala-asf-master-exhaustive-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py", > line 228, in meth > return getattr(self._sock,name)(*args) > error: [Errno 111] Connection refused > -- connecting to localhost:21050 with impyla > -- 2024-04-03 07:04:53,469 INFO MainThread: Could not connect to ('::1', >
[jira] [Assigned] (IMPALA-8809) Refresh a subset of partitions for ACID tables
[ https://issues.apache.org/jira/browse/IMPALA-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab reassigned IMPALA-8809: Assignee: (was: Gabor Kaszab) > Refresh a subset of partitions for ACID tables > -- > > Key: IMPALA-8809 > URL: https://issues.apache.org/jira/browse/IMPALA-8809 > Project: IMPALA > Issue Type: Improvement > Components: Frontend >Affects Versions: Impala 3.3.0 >Reporter: Gabor Kaszab >Priority: Critical > Labels: impala-acid > > Enhancing REFRESH logic to handle ACID tables was covered by this change: > https://issues.apache.org/jira/browse/IMPALA-8600 > Basically each user initiated REFRESH PARTITION is rejected meanwhile the > REFRESH_PARTITION event in event processor are actually doing a full table > load for ACID tables. > There is room for improvement: When a full table refresh is being executed on > an ACID table we can have 2 scenarios: > - If there was some schema changes then reload the full table. Identify such > a scenario should be possible by checking the table-level writeId. However, > there is a bug in Hive that it doesn't update that field for partitioned > tables (https://issues.apache.org/jira/browse/HIVE-22062). This would be the > desired way but could also be workarounded by checking other fields lik > lastDdlChanged or such. > - If a full table refresh is not needed then we should fetch the > partition-level writeIds and reload only the ones that are out-of-date > locally. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Work stopped] (IMPALA-8809) Refresh a subset of partitions for ACID tables
[ https://issues.apache.org/jira/browse/IMPALA-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on IMPALA-8809 stopped by Gabor Kaszab. > Refresh a subset of partitions for ACID tables > -- > > Key: IMPALA-8809 > URL: https://issues.apache.org/jira/browse/IMPALA-8809 > Project: IMPALA > Issue Type: Improvement > Components: Frontend >Affects Versions: Impala 3.3.0 >Reporter: Gabor Kaszab >Assignee: Gabor Kaszab >Priority: Critical > Labels: impala-acid > > Enhancing REFRESH logic to handle ACID tables was covered by this change: > https://issues.apache.org/jira/browse/IMPALA-8600 > Basically each user initiated REFRESH PARTITION is rejected meanwhile the > REFRESH_PARTITION event in event processor are actually doing a full table > load for ACID tables. > There is room for improvement: When a full table refresh is being executed on > an ACID table we can have 2 scenarios: > - If there was some schema changes then reload the full table. Identify such > a scenario should be possible by checking the table-level writeId. However, > there is a bug in Hive that it doesn't update that field for partitioned > tables (https://issues.apache.org/jira/browse/HIVE-22062). This would be the > desired way but could also be workarounded by checking other fields lik > lastDdlChanged or such. > - If a full table refresh is not needed then we should fetch the > partition-level writeIds and reload only the ones that are out-of-date > locally. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-12729) Allow creating primary keys for Iceberg tables
[ https://issues.apache.org/jira/browse/IMPALA-12729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab resolved IMPALA-12729. --- Fix Version/s: Impala 4.4.0 Resolution: Fixed > Allow creating primary keys for Iceberg tables > -- > > Key: IMPALA-12729 > URL: https://issues.apache.org/jira/browse/IMPALA-12729 > Project: IMPALA > Issue Type: Improvement > Components: Frontend >Reporter: Gabor Kaszab >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > Fix For: Impala 4.4.0 > > > Some writer engines require primary keys on a table so that they can use them > for writing equality deletes (only the PK cols are written to the eq-delete > files). > Impala currently doesn't reject setting PKs for Iceberg tables, however it > seems to omit them. This suceeds: > {code:java} > create table ice_pk (i int, j int, primary key(i)) stored as iceberg; > {code} > However, DESCRIBE EXTENDED doesn't show 'identifier-field-ids' in the > 'current-schema'. > On the other hand for a table created by Flink these fields are there: > {code:java} > current-schema | > {\"type\":\"struct\",\"schema-id\":0,\"identifier-field-ids\":[1],\"fields\":[{\"id\":1,\"name\":\"i\",\"required\":true,\"type\":\"int\"},{\"id\":2,\"name\":\"s\",\"required\":false,\"type\":\"string\"}]} > {code} > Part2: > SHOW CREATE TABLE should also correctly print the primary key part of the > field list. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-11387) Add virtual column ICEBERG__SEQUENCE__NUMBER
[ https://issues.apache.org/jira/browse/IMPALA-11387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab resolved IMPALA-11387. --- Fix Version/s: Impala 4.3.0 Resolution: Fixed > Add virtual column ICEBERG__SEQUENCE__NUMBER > > > Key: IMPALA-11387 > URL: https://issues.apache.org/jira/browse/IMPALA-11387 > Project: IMPALA > Issue Type: New Feature > Components: Backend, Frontend >Reporter: Zoltán Borók-Nagy >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > Fix For: Impala 4.3.0 > > > A virtual column ICEBERG__SEQUENCE__NUMBER is needed to handle row-level > updates. > See details at: > https://iceberg.apache.org/spec/#scan-planning > This could be written in the template tuple, similarly to INPUT__FILE__NAME. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-12694) Test equality delete support with data from NiFi
[ https://issues.apache.org/jira/browse/IMPALA-12694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab resolved IMPALA-12694. --- Fix Version/s: Not Applicable Resolution: Fixed > Test equality delete support with data from NiFi > > > Key: IMPALA-12694 > URL: https://issues.apache.org/jira/browse/IMPALA-12694 > Project: IMPALA > Issue Type: Improvement > Components: Backend, Frontend >Reporter: Gabor Kaszab >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > Fix For: Not Applicable > > > Iceberg equality delete support in Impala is a subset of what the Iceberg > spec allows for equality deletes. Currently, we have sufficient > implementation to use eq-deletes created by Flink. As a next step, let's > examine if this implementation is sufficient for eq-deletes created by NiFi. > In theory, NiFi uses Flink's eq-delete implementation so Impala should be > fine reading such data. However, at least some manual tests needed for > verification, and if it turns out that there are some uncovered edge cases, > we should fill these holes in the implementation (probably in separate jiras). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-12600) Support equality deletes when table has partition or schema evolution
[ https://issues.apache.org/jira/browse/IMPALA-12600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab resolved IMPALA-12600. --- Fix Version/s: Impala 4.4.0 Resolution: Fixed > Support equality deletes when table has partition or schema evolution > - > > Key: IMPALA-12600 > URL: https://issues.apache.org/jira/browse/IMPALA-12600 > Project: IMPALA > Issue Type: Sub-task >Reporter: Gabor Kaszab >Assignee: Gabor Kaszab >Priority: Major > Fix For: Impala 4.4.0 > > > With adding the basic equality delete read support, we reject queries for > Iceberg tables that has equality delete files and has partition or schema > evolution. This ticket is to enhance this functionality. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12970) Test failure at test_read_equality_deletes in test_iceberg in exhaustive build
[ https://issues.apache.org/jira/browse/IMPALA-12970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834852#comment-17834852 ] Gabor Kaszab commented on IMPALA-12970: --- I kept running test_read_position_deletes locally and once in a while I ran into this ConcurrentModificationException. So far I saw 2 different stack traces for the error. Both time the test_read_position_deletes_orc failed: {code:java} select * from functional_parquet.iceberg_v2_partitioned_position_deletes_orc a, functional_parquet.iceberg_partitioned_orc_external b where a.action = b.action and b.id=3; at java.util.ArrayList.sort(ArrayList.java:1466) at java.util.Collections.sort(Collections.java:143) at org.apache.impala.planner.IcebergScanNode.(IcebergScanNode.java:105) at org.apache.impala.planner.IcebergScanNode.(IcebergScanNode.java:86) at org.apache.impala.planner.IcebergScanPlanner.createIcebergScanPlanImpl(IcebergScanPlanner.java:199) at org.apache.impala.planner.IcebergScanPlanner.createIcebergScanPlan(IcebergScanPlanner.java:157) at org.apache.impala.planner.SingleNodePlanner.createScanNode(SingleNodePlanner.java:1884) {code} {code:java} SELECT action, count(*) from iceberg_v2_partitioned_position_deletes_orc group by action; at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:911) at java.util.ArrayList$Itr.next(ArrayList.java:861) at org.apache.impala.planner.HdfsScanNode.computeScanRangeLocations(HdfsScanNode.java:1281) at org.apache.impala.planner.HdfsScanNode.init(HdfsScanNode.java:447) at org.apache.impala.planner.IcebergScanPlanner.createPositionJoinNode(IcebergScanPlanner.java:259) at org.apache.impala.planner.IcebergScanPlanner.createIcebergScanPlanImpl(IcebergScanPlanner.java:205) at org.apache.impala.planner.IcebergScanPlanner.createIcebergScanPlan(IcebergScanPlanner.java:157) at org.apache.impala.planner.SingleNodePlanner.createScanNode(SingleNodePlanner.java:1884) {code} > Test failure at test_read_equality_deletes in test_iceberg in exhaustive build > -- > > Key: IMPALA-12970 > URL: https://issues.apache.org/jira/browse/IMPALA-12970 > Project: IMPALA > Issue Type: Bug > Components: Backend >Reporter: Yida Wu >Assignee: Gabor Kaszab >Priority: Major > Labels: broken-build > > An error is observed in the data-cache exhaustive build in > test_read_equality_deletes with following message: > {code:java} > query_test.test_iceberg.TestIcebergV2Table.test_read_equality_deletes[protocol: > beeswax | table_format: parquet/none | exec_option: {'test_replan': 1, > 'disable_optimized_iceberg_v2_read': 1, 'batch_size': 0, 'num_nodes': 0, > 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, > 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0}] (from pytest) > {code} > *Error Message* > {code:java} > query_test/test_iceberg.py:1456: in test_read_equality_deletes > self.run_test_case('QueryTest/iceberg-v2-read-equality-deletes', vector) > common/impala_test_suite.py:725: in run_test_case result = exec_fn(query, > user=test_section.get('USER', '').strip() or None) > common/impala_test_suite.py:660: in __exec_in_impala result = > self.__execute_query(target_impalad_client, query, user=user) > common/impala_test_suite.py:1013: in __execute_query return > impalad_client.execute(query, user=user) common/impala_connection.py:215: in > execute fetch_profile_after_close=fetch_profile_after_close) > beeswax/impala_beeswax.py:191: in execute handle = > self.__execute_query(query_string.strip(), user=user) > beeswax/impala_beeswax.py:382: in __execute_query handle = > self.execute_query_async(query_string, user=user) > beeswax/impala_beeswax.py:376: in execute_query_async handle = > self.__do_rpc(lambda: self.imp_service.query(query,)) > beeswax/impala_beeswax.py:539: in __do_rpc raise > ImpalaBeeswaxException(self.__build_error_message(b), b) E > ImpalaBeeswaxException: ImpalaBeeswaxException: EINNER EXCEPTION: 'beeswaxd.ttypes.BeeswaxException'> EMESSAGE: > ConcurrentModificationException: null > {code} > *Stacktrace* > {code:java} > query_test/test_iceberg.py:1456: in test_read_equality_deletes > self.run_test_case('QueryTest/iceberg-v2-read-equality-deletes', vector) > common/impala_test_suite.py:725: in run_test_case > result = exec_fn(query, user=test_section.get('USER', '').strip() or None) > common/impala_test_suite.py:660: in __exec_in_impala > result = self.__execute_query(target_impalad_client, query, user=user) > common/impala_test_suite.py:1013: in __execute_query > return impalad_client.execute(query, user=user) > common/impala_connection.py:215: in execute > fetch_profile_after_close=fetch_profile_after_close) >
[jira] [Commented] (IMPALA-12970) Test failure at test_read_equality_deletes in test_iceberg in exhaustive build
[ https://issues.apache.org/jira/browse/IMPALA-12970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834246#comment-17834246 ] Gabor Kaszab commented on IMPALA-12970: --- Hey [~baggio000] , I don't think this is related to the equality delete tests, I occasionally get this error when running other Iceberg tests such as test_read_positional_deletes > Test failure at test_read_equality_deletes in test_iceberg in exhaustive build > -- > > Key: IMPALA-12970 > URL: https://issues.apache.org/jira/browse/IMPALA-12970 > Project: IMPALA > Issue Type: Bug > Components: Backend >Reporter: Yida Wu >Assignee: Gabor Kaszab >Priority: Major > Labels: broken-build > > An error is observed in the data-cache exhaustive build in > test_read_equality_deletes with following message: > {code:java} > query_test.test_iceberg.TestIcebergV2Table.test_read_equality_deletes[protocol: > beeswax | table_format: parquet/none | exec_option: {'test_replan': 1, > 'disable_optimized_iceberg_v2_read': 1, 'batch_size': 0, 'num_nodes': 0, > 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, > 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0}] (from pytest) > {code} > *Error Message* > {code:java} > query_test/test_iceberg.py:1456: in test_read_equality_deletes > self.run_test_case('QueryTest/iceberg-v2-read-equality-deletes', vector) > common/impala_test_suite.py:725: in run_test_case result = exec_fn(query, > user=test_section.get('USER', '').strip() or None) > common/impala_test_suite.py:660: in __exec_in_impala result = > self.__execute_query(target_impalad_client, query, user=user) > common/impala_test_suite.py:1013: in __execute_query return > impalad_client.execute(query, user=user) common/impala_connection.py:215: in > execute fetch_profile_after_close=fetch_profile_after_close) > beeswax/impala_beeswax.py:191: in execute handle = > self.__execute_query(query_string.strip(), user=user) > beeswax/impala_beeswax.py:382: in __execute_query handle = > self.execute_query_async(query_string, user=user) > beeswax/impala_beeswax.py:376: in execute_query_async handle = > self.__do_rpc(lambda: self.imp_service.query(query,)) > beeswax/impala_beeswax.py:539: in __do_rpc raise > ImpalaBeeswaxException(self.__build_error_message(b), b) E > ImpalaBeeswaxException: ImpalaBeeswaxException: EINNER EXCEPTION: 'beeswaxd.ttypes.BeeswaxException'> EMESSAGE: > ConcurrentModificationException: null > {code} > *Stacktrace* > {code:java} > query_test/test_iceberg.py:1456: in test_read_equality_deletes > self.run_test_case('QueryTest/iceberg-v2-read-equality-deletes', vector) > common/impala_test_suite.py:725: in run_test_case > result = exec_fn(query, user=test_section.get('USER', '').strip() or None) > common/impala_test_suite.py:660: in __exec_in_impala > result = self.__execute_query(target_impalad_client, query, user=user) > common/impala_test_suite.py:1013: in __execute_query > return impalad_client.execute(query, user=user) > common/impala_connection.py:215: in execute > fetch_profile_after_close=fetch_profile_after_close) > beeswax/impala_beeswax.py:191: in execute > handle = self.__execute_query(query_string.strip(), user=user) > beeswax/impala_beeswax.py:382: in __execute_query > handle = self.execute_query_async(query_string, user=user) > beeswax/impala_beeswax.py:376: in execute_query_async > handle = self.__do_rpc(lambda: self.imp_service.query(query,)) > beeswax/impala_beeswax.py:539: in __do_rpc > raise ImpalaBeeswaxException(self.__build_error_message(b), b) > E ImpalaBeeswaxException: ImpalaBeeswaxException: > EINNER EXCEPTION: > EMESSAGE: ConcurrentModificationException: null > {code} > *Standard Error* > {code:java} > SET > client_identifier=query_test/test_iceberg.py::TestIcebergV2Table::()::test_read_equality_deletes[protocol:beeswax|table_format:parquet/none|exec_option:{'test_replan':1;'disable_optimized_iceberg_v2_read':1;'batch_size':0;'num_nodes':0;'disable_codegen_rows_threshold':0;'d; > -- connecting to: localhost:21000 > -- 2024-04-03 07:04:53,469 INFO MainThread: Could not connect to ('::1', > 21000, 0, 0) > Traceback (most recent call last): > File > "/data/jenkins/workspace/impala-asf-master-exhaustive-data-cache/repos/Impala/infra/python/env-gcc10.4.0/lib/python2.7/site-packages/thrift/transport/TSocket.py", > line 137, in open > handle.connect(sockaddr) > File > "/data/jenkins/workspace/impala-asf-master-exhaustive-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py", > line 228, in meth > return getattr(self._sock,name)(*args) > error: [Errno 111] Connection refused > --
[jira] [Updated] (IMPALA-12894) Optimized count(*) for Iceberg gives wrong results after a Spark rewrite_data_files
[ https://issues.apache.org/jira/browse/IMPALA-12894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab updated IMPALA-12894: -- Attachment: count_star_correctness_repro.tar.gz > Optimized count(*) for Iceberg gives wrong results after a Spark > rewrite_data_files > --- > > Key: IMPALA-12894 > URL: https://issues.apache.org/jira/browse/IMPALA-12894 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Affects Versions: Impala 4.3.0 >Reporter: Gabor Kaszab >Priority: Critical > Labels: correctness, impala-iceberg > Attachments: count_star_correctness_repro.tar.gz > > > Issue was introduced by https://issues.apache.org/jira/browse/IMPALA-11802 > that implemented an optimized way to get results for count(*). However, if > the table was compacted by Spark this optimization can give incorrect results. > The reason is that Spark can[ skip dropping delete > files|https://iceberg.apache.org/docs/latest/spark-procedures/#rewrite_position_delete_files] > that are pointing to compacted data files, as a result there might be delete > files after compaction that are no longer applied to any data files. > Repro: > With Impala > {code:java} > create table default.iceberg_testing (id int, j bigint) STORED AS ICEBERG > TBLPROPERTIES('iceberg.catalog'='hadoop.catalog', > 'iceberg.catalog_location'='/tmp/spark_iceberg_catalog/', > 'iceberg.table_identifier'='iceberg_testing', > 'format-version'='2'); > insert into iceberg_testing values > (1, 1), (2, 4), (3, 9), (4, 16), (5, 25); > update iceberg_testing set j = -100 where id = 4; > delete from iceberg_testing where id = 4;{code} > Count * returns 4 at this point. > Run compaction in Spark: > {code:java} > spark.sql(s"CALL local.system.rewrite_data_files(table => > 'default.iceberg_testing', options => map('min-input-files','2') )").show() > {code} > Now count * in Impala returns 8 (might require an IM if in HadoopCatalog). > Hive returns correct results. Also a SELECT * returns correct results. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12894) Optimized count(*) for Iceberg gives wrong results after a Spark rewrite_data_files
Gabor Kaszab created IMPALA-12894: - Summary: Optimized count(*) for Iceberg gives wrong results after a Spark rewrite_data_files Key: IMPALA-12894 URL: https://issues.apache.org/jira/browse/IMPALA-12894 Project: IMPALA Issue Type: Bug Components: Frontend Affects Versions: Impala 4.3.0 Reporter: Gabor Kaszab Issue was introduced by https://issues.apache.org/jira/browse/IMPALA-11802 that implemented an optimized way to get results for count(*). However, if the table was compacted by Spark this optimization can give incorrect results. The reason is that Spark can[ skip dropping delete files|https://iceberg.apache.org/docs/latest/spark-procedures/#rewrite_position_delete_files] that are pointing to compacted data files, as a result there might be delete files after compaction that are no longer applied to any data files. Repro: With Impala {code:java} create table default.iceberg_testing (id int, j bigint) STORED AS ICEBERG TBLPROPERTIES('iceberg.catalog'='hadoop.catalog', 'iceberg.catalog_location'='/tmp/spark_iceberg_catalog/', 'iceberg.table_identifier'='iceberg_testing', 'format-version'='2'); insert into iceberg_testing values (1, 1), (2, 4), (3, 9), (4, 16), (5, 25); update iceberg_testing set j = -100 where id = 4; delete from iceberg_testing where id = 4;{code} Count * returns 4 at this point. Run compaction in Spark: {code:java} spark.sql(s"CALL local.system.rewrite_data_files(table => 'default.iceberg_testing', options => map('min-input-files','2') )").show() {code} Now count * in Impala returns 8 (might require an IM if in HadoopCatalog). Hive returns correct results. Also a SELECT * returns correct results. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12866) Add table type to the SCAN node's explain output
Gabor Kaszab created IMPALA-12866: - Summary: Add table type to the SCAN node's explain output Key: IMPALA-12866 URL: https://issues.apache.org/jira/browse/IMPALA-12866 Project: IMPALA Issue Type: Improvement Components: Frontend Reporter: Gabor Kaszab Would be nice if the explain output of a SCAN node could show what table type is the table it reads, Iceberg or Hive. Would help for debugging. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12861) File formats are confused when Iceberg tables has mixed formats
[ https://issues.apache.org/jira/browse/IMPALA-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab updated IMPALA-12861: -- Labels: impala-iceberg (was: ) > File formats are confused when Iceberg tables has mixed formats > --- > > Key: IMPALA-12861 > URL: https://issues.apache.org/jira/browse/IMPALA-12861 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Affects Versions: Impala 4.3.0 >Reporter: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > Attachments: multi_file_table_crash > > > *Repro steps:* > create table mixed_ice (i int, year int) partitioned by spec (year) stored as > iceberg tblproperties('format-version'='2'); > > 1) populate one partition with Impala (parquet) > insert into mixed_ice values (1, 2024), (2, 2024); > > 2) change the write format: > alter table mixed_ice set tblproperties ('write.format.default'='orc'); > > 3) populate another partition with Hive (orc) > insert into mixed_ice values (1, 2025), (2, 2025), (3, 2025); > > 4) then query just the parquet partition: > explain select * from mixed_ice where year = 2024; > {code:java} > | F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 > | > | Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB > thread-reservation=1 | > | PLAN-ROOT SINK > | > | | output exprs: default.mixed_ice.i, default.mixed_ice.year > | > | | mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB > thread-reservation=0 | > | | > | > | 01:EXCHANGE [UNPARTITIONED] > | > | mem-estimate=16.00KB mem-reservation=0B thread-reservation=0 > | > | tuple-ids=0 row-size=8B cardinality=2 > | > | in pipelines: 00(GETNEXT) > | > | > | > | F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1 > | > | Per-Host Resources: mem-estimate=64.05MB mem-reservation=32.00KB > thread-reservation=2 | > | DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, UNPARTITIONED] > | > | | mem-estimate=48.00KB mem-reservation=0B thread-reservation=0 > | > | 00:SCAN HDFS [default.mixed_ice, RANDOM] > | > | HDFS partitions=1/1 files=1 size=602B > | > | Iceberg snapshot id: 4964066258730898133 > | > | skipped Iceberg predicates: `year` = CAST(2024 AS INT) > | > | stored statistics: > | > | table: rows=5 size=945B > | > | columns: unavailable > | > | extrapolated-rows=disabled max-scan-range-rows=5 > | > | file formats: [ORC, PARQUET] > | > | mem-estimate=64.00MB mem-reservation=32.00KB thread-reservation=1 > | > | tuple-ids=0 row-size=8B cardinality=2 > | > | in pipelines: 00(GETNEXT) > | > +--+ > {code} > Note, the file formats: [ORC, PARQUET] part even though this query only > reads a parquet files. > > *Some analyis:* > When IcebergScanNode [is > created|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java#L129] > it holds the correct information about file formats (Parquet). > Later on the parent class, HdfsScanNode also tries to populate the file > formats [here|#L513].] > > It uses what > [getSampledOrRawPartitions()|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L431] > returns. In this use case the 'sampledPartitions_' is null, so will return > 'partitions_' > > Apparently, this 'partitions_' member holds the partition with the ORC file > so it adds ORC to the fileFormats_.
[jira] [Commented] (IMPALA-12861) File formats are confused when Iceberg tables has mixed formats
[ https://issues.apache.org/jira/browse/IMPALA-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822616#comment-17822616 ] Gabor Kaszab commented on IMPALA-12861: --- Additionally, there is an intermittent crash when running the select query in the description without explain. Attaching the resolved minidump. [^multi_file_table_crash] > File formats are confused when Iceberg tables has mixed formats > --- > > Key: IMPALA-12861 > URL: https://issues.apache.org/jira/browse/IMPALA-12861 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Affects Versions: Impala 4.3.0 >Reporter: Gabor Kaszab >Priority: Major > Attachments: multi_file_table_crash > > > *Repro steps:* > create table mixed_ice (i int, year int) partitioned by spec (year) stored as > iceberg tblproperties('format-version'='2'); > > 1) populate one partition with Impala (parquet) > insert into mixed_ice values (1, 2024), (2, 2024); > > 2) change the write format: > alter table mixed_ice set tblproperties ('write.format.default'='orc'); > > 3) populate another partition with Hive (orc) > insert into mixed_ice values (1, 2025), (2, 2025), (3, 2025); > > 4) then query just the parquet partition: > explain select * from mixed_ice where year = 2024; > {code:java} > | F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 > | > | Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB > thread-reservation=1 | > | PLAN-ROOT SINK > | > | | output exprs: default.mixed_ice.i, default.mixed_ice.year > | > | | mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB > thread-reservation=0 | > | | > | > | 01:EXCHANGE [UNPARTITIONED] > | > | mem-estimate=16.00KB mem-reservation=0B thread-reservation=0 > | > | tuple-ids=0 row-size=8B cardinality=2 > | > | in pipelines: 00(GETNEXT) > | > | > | > | F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1 > | > | Per-Host Resources: mem-estimate=64.05MB mem-reservation=32.00KB > thread-reservation=2 | > | DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, UNPARTITIONED] > | > | | mem-estimate=48.00KB mem-reservation=0B thread-reservation=0 > | > | 00:SCAN HDFS [default.mixed_ice, RANDOM] > | > | HDFS partitions=1/1 files=1 size=602B > | > | Iceberg snapshot id: 4964066258730898133 > | > | skipped Iceberg predicates: `year` = CAST(2024 AS INT) > | > | stored statistics: > | > | table: rows=5 size=945B > | > | columns: unavailable > | > | extrapolated-rows=disabled max-scan-range-rows=5 > | > | file formats: [ORC, PARQUET] > | > | mem-estimate=64.00MB mem-reservation=32.00KB thread-reservation=1 > | > | tuple-ids=0 row-size=8B cardinality=2 > | > | in pipelines: 00(GETNEXT) > | > +--+ > {code} > Note, the file formats: [ORC, PARQUET] part even though this query only > reads a parquet files. > > *Some analyis:* > When IcebergScanNode [is > created|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java#L129] > it holds the correct information about file formats (Parquet). > Later on the parent class, HdfsScanNode also tries to populate the file > formats [here|#L513].] > > It uses what > [getSampledOrRawPartitions()|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L431] > returns. In this use case the 'sampledPartitions_' is null, so will return >
[jira] [Updated] (IMPALA-12861) File formats are confused when Iceberg tables has mixed formats
[ https://issues.apache.org/jira/browse/IMPALA-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab updated IMPALA-12861: -- Attachment: multi_file_table_crash > File formats are confused when Iceberg tables has mixed formats > --- > > Key: IMPALA-12861 > URL: https://issues.apache.org/jira/browse/IMPALA-12861 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Affects Versions: Impala 4.3.0 >Reporter: Gabor Kaszab >Priority: Major > Attachments: multi_file_table_crash > > > *Repro steps:* > create table mixed_ice (i int, year int) partitioned by spec (year) stored as > iceberg tblproperties('format-version'='2'); > > 1) populate one partition with Impala (parquet) > insert into mixed_ice values (1, 2024), (2, 2024); > > 2) change the write format: > alter table mixed_ice set tblproperties ('write.format.default'='orc'); > > 3) populate another partition with Hive (orc) > insert into mixed_ice values (1, 2025), (2, 2025), (3, 2025); > > 4) then query just the parquet partition: > explain select * from mixed_ice where year = 2024; > {code:java} > | F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 > | > | Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB > thread-reservation=1 | > | PLAN-ROOT SINK > | > | | output exprs: default.mixed_ice.i, default.mixed_ice.year > | > | | mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB > thread-reservation=0 | > | | > | > | 01:EXCHANGE [UNPARTITIONED] > | > | mem-estimate=16.00KB mem-reservation=0B thread-reservation=0 > | > | tuple-ids=0 row-size=8B cardinality=2 > | > | in pipelines: 00(GETNEXT) > | > | > | > | F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1 > | > | Per-Host Resources: mem-estimate=64.05MB mem-reservation=32.00KB > thread-reservation=2 | > | DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, UNPARTITIONED] > | > | | mem-estimate=48.00KB mem-reservation=0B thread-reservation=0 > | > | 00:SCAN HDFS [default.mixed_ice, RANDOM] > | > | HDFS partitions=1/1 files=1 size=602B > | > | Iceberg snapshot id: 4964066258730898133 > | > | skipped Iceberg predicates: `year` = CAST(2024 AS INT) > | > | stored statistics: > | > | table: rows=5 size=945B > | > | columns: unavailable > | > | extrapolated-rows=disabled max-scan-range-rows=5 > | > | file formats: [ORC, PARQUET] > | > | mem-estimate=64.00MB mem-reservation=32.00KB thread-reservation=1 > | > | tuple-ids=0 row-size=8B cardinality=2 > | > | in pipelines: 00(GETNEXT) > | > +--+ > {code} > Note, the file formats: [ORC, PARQUET] part even though this query only > reads a parquet files. > > *Some analyis:* > When IcebergScanNode [is > created|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java#L129] > it holds the correct information about file formats (Parquet). > Later on the parent class, HdfsScanNode also tries to populate the file > formats [here|#L513].] > > It uses what > [getSampledOrRawPartitions()|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L431] > returns. In this use case the 'sampledPartitions_' is null, so will return > 'partitions_' > > Apparently, this 'partitions_' member holds the partition with the ORC file > so it adds ORC to the fileFormats_. Unfortunately, this >
[jira] [Updated] (IMPALA-12862) Expose Iceberg position delete records via metadata table
[ https://issues.apache.org/jira/browse/IMPALA-12862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab updated IMPALA-12862: -- Issue Type: Improvement (was: Bug) > Expose Iceberg position delete records via metadata table > - > > Key: IMPALA-12862 > URL: https://issues.apache.org/jira/browse/IMPALA-12862 > Project: IMPALA > Issue Type: Improvement > Components: Frontend >Reporter: Zoltán Borók-Nagy >Priority: Major > Labels: impala-iceberg > > To debug issues with position delete files, or detect table corruption we > could expose the delete records via the metadata table syntax, e.g.: > {noformat} > SELECT INPUT__FILE__NAME, file_path, pos > FROM db.ice_t.position_delete_records;{noformat} > Adding virtual column INPUT__FILE__NAME is useful because it can tell which > delete file contains the records. > We should re-use IcebergPositionDeleteTable for this: > [https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/IcebergPositionDeleteTable.java] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12861) File formats are confused when Iceberg tables has mixed formats
[ https://issues.apache.org/jira/browse/IMPALA-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab updated IMPALA-12861: -- Description: *Repro steps:* create table mixed_ice (i int, year int) partitioned by spec (year) stored as iceberg tblproperties('format-version'='2'); 1) populate one partition with Impala (parquet) insert into mixed_ice values (1, 2024), (2, 2024); 2) change the write format: alter table mixed_ice set tblproperties ('write.format.default'='orc'); 3) populate another partition with Hive (orc) insert into mixed_ice values (1, 2025), (2, 2025), (3, 2025); 4) then query just the parquet partition: explain select * from mixed_ice where year = 2024; {code:java} | F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 | | Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB thread-reservation=1 | | PLAN-ROOT SINK | | | output exprs: default.mixed_ice.i, default.mixed_ice.year | | | mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB thread-reservation=0 | | | | | 01:EXCHANGE [UNPARTITIONED] | | mem-estimate=16.00KB mem-reservation=0B thread-reservation=0 | | tuple-ids=0 row-size=8B cardinality=2 | | in pipelines: 00(GETNEXT) | | | | F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1 | | Per-Host Resources: mem-estimate=64.05MB mem-reservation=32.00KB thread-reservation=2 | | DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, UNPARTITIONED] | | | mem-estimate=48.00KB mem-reservation=0B thread-reservation=0 | | 00:SCAN HDFS [default.mixed_ice, RANDOM] | | HDFS partitions=1/1 files=1 size=602B | | Iceberg snapshot id: 4964066258730898133 | | skipped Iceberg predicates: `year` = CAST(2024 AS INT) | | stored statistics: | | table: rows=5 size=945B | | columns: unavailable | | extrapolated-rows=disabled max-scan-range-rows=5 | | file formats: [ORC, PARQUET] | | mem-estimate=64.00MB mem-reservation=32.00KB thread-reservation=1 | | tuple-ids=0 row-size=8B cardinality=2 | | in pipelines: 00(GETNEXT) | +--+ {code} Note, the file formats: [ORC, PARQUET] part even though this query only reads a parquet files. *Some analyis:* When IcebergScanNode [is created|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java#L129] it holds the correct information about file formats (Parquet). Later on the parent class, HdfsScanNode also tries to populate the file formats [here|#L513].] It uses what [getSampledOrRawPartitions()|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L431] returns. In this use case the 'sampledPartitions_' is null, so will return 'partitions_' Apparently, this 'partitions_' member holds the partition with the ORC file so it adds ORC to the fileFormats_. Unfortunately, this getSampledOrRawPartitions() is called in multiple locations within HdfsScanNode returning the wrong partition. *Next steps:* Check what other issues can this getSampledOrRawPartitions cause with multi file format tables. Also check if we can populate 'partitions_' properly. was: *Repro steps:* create table mixed_ice (i int, year int) partitioned by spec (year) stored as iceberg tblproperties('format-version'='2'); 1) populate one partition with Impala (parquet) insert into mixed_ice values (1, 2024), (2, 2024); 2) change the write format: alter table mixed_ice set tblproperties ('write.format.default'='orc'); 3) populate another partition with Hive (orc) insert into mixed_ice
[jira] [Created] (IMPALA-12861) File formats are confused when Iceberg tables has mixed formats
Gabor Kaszab created IMPALA-12861: - Summary: File formats are confused when Iceberg tables has mixed formats Key: IMPALA-12861 URL: https://issues.apache.org/jira/browse/IMPALA-12861 Project: IMPALA Issue Type: Bug Components: Frontend Affects Versions: Impala 4.3.0 Reporter: Gabor Kaszab *Repro steps:* create table mixed_ice (i int, year int) partitioned by spec (year) stored as iceberg tblproperties('format-version'='2'); 1) populate one partition with Impala (parquet) insert into mixed_ice values (1, 2024), (2, 2024); 2) change the write format: alter table mixed_ice set tblproperties ('write.format.default'='orc'); 3) populate another partition with Hive (orc) insert into mixed_ice values (1, 2025), (2, 2025), (3, 2025); 4) then query just the parquet partition: explain select * from mixed_ice where year = 2024; {code:java} | F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 | | Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB thread-reservation=1 | | PLAN-ROOT SINK | | | output exprs: default.mixed_ice.i, default.mixed_ice.year | | | mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB thread-reservation=0 | | | | | 01:EXCHANGE [UNPARTITIONED] | | mem-estimate=16.00KB mem-reservation=0B thread-reservation=0 | | tuple-ids=0 row-size=8B cardinality=2 | | in pipelines: 00(GETNEXT) | | | | F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1 | | Per-Host Resources: mem-estimate=64.05MB mem-reservation=32.00KB thread-reservation=2 | | DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, UNPARTITIONED] | | | mem-estimate=48.00KB mem-reservation=0B thread-reservation=0 | | 00:SCAN HDFS [default.mixed_ice, RANDOM] | | HDFS partitions=1/1 files=1 size=602B | | Iceberg snapshot id: 4964066258730898133 | | skipped Iceberg predicates: `year` = CAST(2024 AS INT) | | stored statistics: | | table: rows=5 size=945B | | columns: unavailable | | extrapolated-rows=disabled max-scan-range-rows=5 | | file formats: [ORC, PARQUET] | | mem-estimate=64.00MB mem-reservation=32.00KB thread-reservation=1 | | tuple-ids=0 row-size=8B cardinality=2 | | in pipelines: 00(GETNEXT) | +--+ {code} Note, the file formats: [ORC, PARQUET] part even though this query only reads a parquet files. *Some analyis:* When IcebergScanNode [is created|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java#L129] it holds the correct information about file formats (Parquet). Later on the parent class, HdfsScanNode also tries to populate the file formats [here|[https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L513].] It uses what [getSampledOrRawPartition()|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L431] returns. In this use case the 'sampledPartitions_' is null, so will return 'partitions_' Apparently, this 'partitions_' member holds the partition with the ORC file so it adds ORC to the fileFormats_. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-12598) Add support for multiple equality field ID list
[ https://issues.apache.org/jira/browse/IMPALA-12598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab resolved IMPALA-12598. --- Resolution: Fixed > Add support for multiple equality field ID list > --- > > Key: IMPALA-12598 > URL: https://issues.apache.org/jira/browse/IMPALA-12598 > Project: IMPALA > Issue Type: Sub-task > Components: Frontend >Reporter: Gabor Kaszab >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > > Iceberg metadata holds an equality field ID list for the equality-delete > files. It's possible to have a different equality field ID list for different > equality-delete files, for instance one file deletes by columnA while another > file deletes by columnB. > When you have such a table you should have multiple layers of ANTI JOINs, one > join for each equality field ID list. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Work started] (IMPALA-12729) Allow creating primary keys for Iceberg tables
[ https://issues.apache.org/jira/browse/IMPALA-12729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on IMPALA-12729 started by Gabor Kaszab. - > Allow creating primary keys for Iceberg tables > -- > > Key: IMPALA-12729 > URL: https://issues.apache.org/jira/browse/IMPALA-12729 > Project: IMPALA > Issue Type: Improvement > Components: Frontend >Reporter: Gabor Kaszab >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > > Some writer engines require primary keys on a table so that they can use them > for writing equality deletes (only the PK cols are written to the eq-delete > files). > Impala currently doesn't reject setting PKs for Iceberg tables, however it > seems to omit them. This suceeds: > {code:java} > create table ice_pk (i int, j int, primary key(i)) stored as iceberg; > {code} > However, DESCRIBE EXTENDED doesn't show 'identifier-field-ids' in the > 'current-schema'. > On the other hand for a table created by Flink these fields are there: > {code:java} > current-schema | > {\"type\":\"struct\",\"schema-id\":0,\"identifier-field-ids\":[1],\"fields\":[{\"id\":1,\"name\":\"i\",\"required\":true,\"type\":\"int\"},{\"id\":2,\"name\":\"s\",\"required\":false,\"type\":\"string\"}]} > {code} > Part2: > SHOW CREATE TABLE should also correctly print the primary key part of the > field list. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Comment Edited] (IMPALA-12836) Aggregation over a STRUCT throws IllegalStateException
[ https://issues.apache.org/jira/browse/IMPALA-12836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819638#comment-17819638 ] Gabor Kaszab edited comment on IMPALA-12836 at 2/22/24 12:42 PM: - There was this example query [on a conference|https://www.apachecon.com/acna2022/slides/02_Ho_Icebergs_Best_Secret.pdf] for Iceberg metadata tables to check the size of each partition: {code:java} SELECT `partition`, sum(file_size_in_bytes) AS partition_size FROM db.table.`files` GROUP BY `partition` {code} Note, in `files` metadata table the `partition` column is a struct that holds one member for each partition column. So I believe this works in Spark and would be a nice addition for us too for table analysis purposes. This query could be re-worked so that we can run it, but then for each table you'd have to write a separate query for getting these stats: {code:java} SELECT `partition`.col1, .. `partition`.colN, sum(file_size_in_bytes) AS partition_size FROM db.table.files GROUP BY `partition`.col1, .. , `partition`.colN; {code} was (Author: gaborkaszab): There was this example query [on a conference|https://www.apachecon.com/acna2022/slides/02_Ho_Icebergs_Best_Secret.pdf] for Iceberg metadata tables to check the size of each partition: {code:java} SELECT partition, sum(file_size_in_bytes) AS partition_size, FROM db.table.files GROUP BY partition {code} Note, in `files` metadata table the `partition` column is a struct that holds one member for each partition column. So I believe this works in Spark and would be a nice addition for us too for table analysis purposes. This query could be re-worked so that we can run it, but then for each table you'd have to write a separate query for getting these stats: {code:java} SELECT `partition`.col1, .. `partition`.colN, sum(file_size_in_bytes) AS partition_size, FROM db.table.files GROUP BY `partition`.col1, .. , `partition`.colN; {code} > Aggregation over a STRUCT throws IllegalStateException > -- > > Key: IMPALA-12836 > URL: https://issues.apache.org/jira/browse/IMPALA-12836 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Affects Versions: Impala 4.4.0 >Reporter: Tamas Mate >Priority: Major > > A Preconditions check will fail when trying to aggregate over a struct. > Repro query: > {code} > Query: select int_struct_col, sum(id) from functional_parquet.allcomplextypes > group by int_struct_col > Query submitted at: 2024-02-22 13:08:20 (Coordinator: > http://tmate-desktop:25000) > ERROR: IllegalStateException: null > {code} > {code:java} > I0222 13:05:21.762225 10675 jni-util.cc:302] > 3c44b4fafbbcb6b5:eee03297] java.lang.IllegalStateException > at > com.google.common.base.Preconditions.checkState(Preconditions.java:486) > at > org.apache.impala.analysis.SlotRef.addStructChildrenAsSlotRefs(SlotRef.java:268) > at org.apache.impala.analysis.SlotRef.(SlotRef.java:93) > at > org.apache.impala.analysis.AggregateInfoBase.createTupleDesc(AggregateInfoBase.java:135) > at > org.apache.impala.analysis.AggregateInfoBase.createTupleDescs(AggregateInfoBase.java:101) > at > org.apache.impala.analysis.AggregateInfo.create(AggregateInfo.java:150) > at > org.apache.impala.analysis.AggregateInfo.create(AggregateInfo.java:171) > at > org.apache.impala.analysis.MultiAggregateInfo.analyze(MultiAggregateInfo.java:301) > at > org.apache.impala.analysis.SelectStmt$SelectAnalyzer.buildAggregateExprs(SelectStmt.java:1149) > at > org.apache.impala.analysis.SelectStmt$SelectAnalyzer.analyze(SelectStmt.java:355) > at > org.apache.impala.analysis.SelectStmt$SelectAnalyzer.access$100(SelectStmt.java:282) > at org.apache.impala.analysis.SelectStmt.analyze(SelectStmt.java:274) > at > org.apache.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:545) > at > org.apache.impala.analysis.AnalysisContext.analyzeAndAuthorize(AnalysisContext.java:492) > at > org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:2364) > at > org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:2110) > at > org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1883) > at > org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:169) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Comment Edited] (IMPALA-12836) Aggregation over a STRUCT throws IllegalStateException
[ https://issues.apache.org/jira/browse/IMPALA-12836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819638#comment-17819638 ] Gabor Kaszab edited comment on IMPALA-12836 at 2/22/24 12:41 PM: - There was this example query [on a conference|https://www.apachecon.com/acna2022/slides/02_Ho_Icebergs_Best_Secret.pdf] for Iceberg metadata tables to check the size of each partition: {code:java} SELECT partition, sum(file_size_in_bytes) AS partition_size, FROM db.table.files GROUP BY partition {code} Note, in `files` metadata table the `partition` column is a struct that holds one member for each partition column. So I believe this works in Spark and would be a nice addition for us too for table analysis purposes. This query could be re-worked so that we can run it, but then for each table you'd have to write a separate query for getting these stats: {code:java} SELECT `partition`.col1, .. `partition`.colN, sum(file_size_in_bytes) AS partition_size, FROM db.table.files GROUP BY `partition`.col1, .. , `partition`.colN; {code} was (Author: gaborkaszab): There was this example query [on a conference|https://www.apachecon.com/acna2022/slides/02_Ho_Icebergs_Best_Secret.pdf] for Iceberg metadata tables to check the size of each partition: {code:java} SELECT partition, sum(file_size_in_bytes) AS partition_size, FROM db.table.files GROUP BY partition {code} Note, in `files` metadata table the `partition` column is a struct that holds one member for each partition column. So I believe this works in Spark and would be a nice addition for us too for table analysis purposes. This query could be re-worked so that we can run it, but then for each table so'd have to write a separate query for getting these stats: {code:java} SELECT `partition`.col1, .. `partition`.colN, sum(file_size_in_bytes) AS partition_size, FROM db.table.files GROUP BY `partition`.col1, .. , `partition`.colN; {code} > Aggregation over a STRUCT throws IllegalStateException > -- > > Key: IMPALA-12836 > URL: https://issues.apache.org/jira/browse/IMPALA-12836 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Affects Versions: Impala 4.4.0 >Reporter: Tamas Mate >Priority: Major > > A Preconditions check will fail when trying to aggregate over a struct. > Repro query: > {code} > Query: select int_struct_col, sum(id) from functional_parquet.allcomplextypes > group by int_struct_col > Query submitted at: 2024-02-22 13:08:20 (Coordinator: > http://tmate-desktop:25000) > ERROR: IllegalStateException: null > {code} > {code:java} > I0222 13:05:21.762225 10675 jni-util.cc:302] > 3c44b4fafbbcb6b5:eee03297] java.lang.IllegalStateException > at > com.google.common.base.Preconditions.checkState(Preconditions.java:486) > at > org.apache.impala.analysis.SlotRef.addStructChildrenAsSlotRefs(SlotRef.java:268) > at org.apache.impala.analysis.SlotRef.(SlotRef.java:93) > at > org.apache.impala.analysis.AggregateInfoBase.createTupleDesc(AggregateInfoBase.java:135) > at > org.apache.impala.analysis.AggregateInfoBase.createTupleDescs(AggregateInfoBase.java:101) > at > org.apache.impala.analysis.AggregateInfo.create(AggregateInfo.java:150) > at > org.apache.impala.analysis.AggregateInfo.create(AggregateInfo.java:171) > at > org.apache.impala.analysis.MultiAggregateInfo.analyze(MultiAggregateInfo.java:301) > at > org.apache.impala.analysis.SelectStmt$SelectAnalyzer.buildAggregateExprs(SelectStmt.java:1149) > at > org.apache.impala.analysis.SelectStmt$SelectAnalyzer.analyze(SelectStmt.java:355) > at > org.apache.impala.analysis.SelectStmt$SelectAnalyzer.access$100(SelectStmt.java:282) > at org.apache.impala.analysis.SelectStmt.analyze(SelectStmt.java:274) > at > org.apache.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:545) > at > org.apache.impala.analysis.AnalysisContext.analyzeAndAuthorize(AnalysisContext.java:492) > at > org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:2364) > at > org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:2110) > at > org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1883) > at > org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:169) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12836) Aggregation over a STRUCT throws IllegalStateException
[ https://issues.apache.org/jira/browse/IMPALA-12836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819638#comment-17819638 ] Gabor Kaszab commented on IMPALA-12836: --- There was this example query [on a conference|https://www.apachecon.com/acna2022/slides/02_Ho_Icebergs_Best_Secret.pdf] for Iceberg metadata tables to check the size of each partition: {code:java} SELECT partition, sum(file_size_in_bytes) AS partition_size, FROM db.table.files GROUP BY partition {code} Note, in `files` metadata table the `partition` column is a struct that holds one member for each partition column. So I believe this works in Spark and would be a nice addition for us too for table analysis purposes. This query could be re-worked so that we can run it, but then for each table so'd have to write a separate query for getting these stats: {code:java} SELECT `partition`.col1, .. `partition`.colN, sum(file_size_in_bytes) AS partition_size, FROM db.table.files GROUP BY `partition`.col1, .. , `partition`.colN; {code} > Aggregation over a STRUCT throws IllegalStateException > -- > > Key: IMPALA-12836 > URL: https://issues.apache.org/jira/browse/IMPALA-12836 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Affects Versions: Impala 4.4.0 >Reporter: Tamas Mate >Priority: Major > > A Preconditions check will fail when trying to aggregate over a struct. > Repro query: > {code} > Query: select int_struct_col, sum(id) from functional_parquet.allcomplextypes > group by int_struct_col > Query submitted at: 2024-02-22 13:08:20 (Coordinator: > http://tmate-desktop:25000) > ERROR: IllegalStateException: null > {code} > {code:java} > I0222 13:05:21.762225 10675 jni-util.cc:302] > 3c44b4fafbbcb6b5:eee03297] java.lang.IllegalStateException > at > com.google.common.base.Preconditions.checkState(Preconditions.java:486) > at > org.apache.impala.analysis.SlotRef.addStructChildrenAsSlotRefs(SlotRef.java:268) > at org.apache.impala.analysis.SlotRef.(SlotRef.java:93) > at > org.apache.impala.analysis.AggregateInfoBase.createTupleDesc(AggregateInfoBase.java:135) > at > org.apache.impala.analysis.AggregateInfoBase.createTupleDescs(AggregateInfoBase.java:101) > at > org.apache.impala.analysis.AggregateInfo.create(AggregateInfo.java:150) > at > org.apache.impala.analysis.AggregateInfo.create(AggregateInfo.java:171) > at > org.apache.impala.analysis.MultiAggregateInfo.analyze(MultiAggregateInfo.java:301) > at > org.apache.impala.analysis.SelectStmt$SelectAnalyzer.buildAggregateExprs(SelectStmt.java:1149) > at > org.apache.impala.analysis.SelectStmt$SelectAnalyzer.analyze(SelectStmt.java:355) > at > org.apache.impala.analysis.SelectStmt$SelectAnalyzer.access$100(SelectStmt.java:282) > at org.apache.impala.analysis.SelectStmt.analyze(SelectStmt.java:274) > at > org.apache.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:545) > at > org.apache.impala.analysis.AnalysisContext.analyzeAndAuthorize(AnalysisContext.java:492) > at > org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:2364) > at > org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:2110) > at > org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1883) > at > org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:169) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12826) Add better cardinality estimation for Iceberg V2 tables with equality deletes
Gabor Kaszab created IMPALA-12826: - Summary: Add better cardinality estimation for Iceberg V2 tables with equality deletes Key: IMPALA-12826 URL: https://issues.apache.org/jira/browse/IMPALA-12826 Project: IMPALA Issue Type: Sub-task Components: Frontend Reporter: Gabor Kaszab there is a similar ticket for positional deletes: https://issues.apache.org/jira/browse/IMPALA-12371 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Work started] (IMPALA-12600) Support equality deletes when table has partition or schema evolution
[ https://issues.apache.org/jira/browse/IMPALA-12600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on IMPALA-12600 started by Gabor Kaszab. - > Support equality deletes when table has partition or schema evolution > - > > Key: IMPALA-12600 > URL: https://issues.apache.org/jira/browse/IMPALA-12600 > Project: IMPALA > Issue Type: Sub-task >Reporter: Gabor Kaszab >Assignee: Gabor Kaszab >Priority: Major > > With adding the basic equality delete read support, we reject queries for > Iceberg tables that has equality delete files and has partition or schema > evolution. This ticket is to enhance this functionality. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12729) Allow creating primary keys for Iceberg tables
Gabor Kaszab created IMPALA-12729: - Summary: Allow creating primary keys for Iceberg tables Key: IMPALA-12729 URL: https://issues.apache.org/jira/browse/IMPALA-12729 Project: IMPALA Issue Type: Improvement Components: Frontend Reporter: Gabor Kaszab Some writer engines require primary keys on a table so that they can use them for writing equality deletes (only the PK cols are written to the eq-delete files). Impala currently doesn't reject setting PKs for Iceberg tables, however it seems to omit them. This suceeds: {code:java} create table ice_pk (i int, j int, primary key(i)) stored as iceberg; {code} However, DESCRIBE EXTENDED doesn't show 'identifier-field-ids' in the 'current-schema'. On the other hand for a table created by Flink these fields are there: {code:java} current-schema | {\"type\":\"struct\",\"schema-id\":0,\"identifier-field-ids\":[1],\"fields\":[{\"id\":1,\"name\":\"i\",\"required\":true,\"type\":\"int\"},{\"id\":2,\"name\":\"s\",\"required\":false,\"type\":\"string\"}]} {code} Part2: SHOW CREATE TABLE should also correctly print the primary key part of the field list. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-12598) Add support for multiple equality field ID list
[ https://issues.apache.org/jira/browse/IMPALA-12598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab reassigned IMPALA-12598: - Assignee: Gabor Kaszab > Add support for multiple equality field ID list > --- > > Key: IMPALA-12598 > URL: https://issues.apache.org/jira/browse/IMPALA-12598 > Project: IMPALA > Issue Type: Sub-task > Components: Frontend >Reporter: Gabor Kaszab >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > > Iceberg metadata holds an equality field ID list for the equality-delete > files. It's possible to have a different equality field ID list for different > equality-delete files, for instance one file deletes by columnA while another > file deletes by columnB. > When you have such a table you should have multiple layers of ANTI JOINs, one > join for each equality field ID list. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Work started] (IMPALA-12598) Add support for multiple equality field ID list
[ https://issues.apache.org/jira/browse/IMPALA-12598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on IMPALA-12598 started by Gabor Kaszab. - > Add support for multiple equality field ID list > --- > > Key: IMPALA-12598 > URL: https://issues.apache.org/jira/browse/IMPALA-12598 > Project: IMPALA > Issue Type: Sub-task > Components: Frontend >Reporter: Gabor Kaszab >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > > Iceberg metadata holds an equality field ID list for the equality-delete > files. It's possible to have a different equality field ID list for different > equality-delete files, for instance one file deletes by columnA while another > file deletes by columnB. > When you have such a table you should have multiple layers of ANTI JOINs, one > join for each equality field ID list. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Work started] (IMPALA-12694) Test equality delete support with data from NiFi
[ https://issues.apache.org/jira/browse/IMPALA-12694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on IMPALA-12694 started by Gabor Kaszab. - > Test equality delete support with data from NiFi > > > Key: IMPALA-12694 > URL: https://issues.apache.org/jira/browse/IMPALA-12694 > Project: IMPALA > Issue Type: Improvement > Components: Backend, Frontend >Reporter: Gabor Kaszab >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > > Iceberg equality delete support in Impala is a subset of what the Iceberg > spec allows for equality deletes. Currently, we have sufficient > implementation to use eq-deletes created by Flink. As a next step, let's > examine if this implementation is sufficient for eq-deletes created by NiFi. > In theory, NiFi uses Flink's eq-delete implementation so Impala should be > fine reading such data. However, at least some manual tests needed for > verification, and if it turns out that there are some uncovered edge cases, > we should fill these holes in the implementation (probably in separate jiras). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12694) Test equality delete support with data from NiFi
Gabor Kaszab created IMPALA-12694: - Summary: Test equality delete support with data from NiFi Key: IMPALA-12694 URL: https://issues.apache.org/jira/browse/IMPALA-12694 Project: IMPALA Issue Type: Improvement Components: Backend, Frontend Reporter: Gabor Kaszab Assignee: Gabor Kaszab Iceberg equality delete support in Impala is a subset of what the Iceberg spec allows for equality deletes. Currently, we have sufficient implementation to use eq-deletes created by Flink. As a next step, let's examine if this implementation is sufficient for eq-deletes created by NiFi. In theory, NiFi uses Flink's eq-delete implementation so Impala should be fine reading such data. However, at least some manual tests needed for verification, and if it turns out that there are some uncovered edge cases, we should fill these holes in the implementation (probably in separate jiras). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-9821) Rewrite ds_hll_sketch() and ds_hll_union() and other datasketch generating functions to return Binary
[ https://issues.apache.org/jira/browse/IMPALA-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17804253#comment-17804253 ] Gabor Kaszab commented on IMPALA-9821: -- Made the title of this ticket more generic to cover all the other sketch types too. This is a breaking change so needs a new major Impala version. > Rewrite ds_hll_sketch() and ds_hll_union() and other datasketch generating > functions to return Binary > - > > Key: IMPALA-9821 > URL: https://issues.apache.org/jira/browse/IMPALA-9821 > Project: IMPALA > Issue Type: New Feature > Components: Backend >Reporter: Gabor Kaszab >Priority: Major > > Until Binary implementation is ongoing ds_hll_sketch() and ds_hll_union() > functions return serialized sketches in String format. Once Binary is > available in Impala these can return the serialized sketches in Binary format. > Currently when sketches are written by Hive as BINARY to ORC table and this > table is loaded to Impala where the sketch columns are STRINGs then we get an > error > {code:java} > ERROR: Type mismatch: table column STRING is map to column binary in ORC file > {code} > Interestingly the works with Parquet format. > Once we have binary support make sure to add coverage for ORC table where the > table is created and populated by Hive and read for estimating by Impala. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-9821) Rewrite ds_hll_sketch() and ds_hll_union() and other datasketch generating functions to return Binary
[ https://issues.apache.org/jira/browse/IMPALA-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab updated IMPALA-9821: - Summary: Rewrite ds_hll_sketch() and ds_hll_union() and other datasketch generating functions to return Binary (was: Rewrite ds_hll_sketch() and ds_hll_union() functions to return Binary) > Rewrite ds_hll_sketch() and ds_hll_union() and other datasketch generating > functions to return Binary > - > > Key: IMPALA-9821 > URL: https://issues.apache.org/jira/browse/IMPALA-9821 > Project: IMPALA > Issue Type: New Feature > Components: Backend >Reporter: Gabor Kaszab >Priority: Major > > Until Binary implementation is ongoing ds_hll_sketch() and ds_hll_union() > functions return serialized sketches in String format. Once Binary is > available in Impala these can return the serialized sketches in Binary format. > Currently when sketches are written by Hive as BINARY to ORC table and this > table is loaded to Impala where the sketch columns are STRINGs then we get an > error > {code:java} > ERROR: Type mismatch: table column STRING is map to column binary in ORC file > {code} > Interestingly the works with Parquet format. > Once we have binary support make sure to add coverage for ORC table where the > table is created and populated by Hive and read for estimating by Impala. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-12673) Iceberg table migraton fails for '/' in partition values
[ https://issues.apache.org/jira/browse/IMPALA-12673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab resolved IMPALA-12673. --- Resolution: Fixed > Iceberg table migraton fails for '/' in partition values > > > Key: IMPALA-12673 > URL: https://issues.apache.org/jira/browse/IMPALA-12673 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Reporter: Gabor Kaszab >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > Fix For: Impala 4.4.0 > > > As a bug in Iceberg we don't allow migrating tables to Iceberg when the table > has a partition value containing a '/' character. Now, that the fix for this > Iceberg bug is picked up by Impala we can allow migrating such tables. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12673) Iceberg table migraton fails for '/' in partition values
Gabor Kaszab created IMPALA-12673: - Summary: Iceberg table migraton fails for '/' in partition values Key: IMPALA-12673 URL: https://issues.apache.org/jira/browse/IMPALA-12673 Project: IMPALA Issue Type: Bug Components: Frontend Reporter: Gabor Kaszab Fix For: Impala 4.4.0 As a bug in Iceberg we don't allow migrating tables to Iceberg when the table has a partition value containing a '/' character. Now, that the fix for this Iceberg bug is picked up by Impala we can allow migrating such tables. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-12673) Iceberg table migraton fails for '/' in partition values
[ https://issues.apache.org/jira/browse/IMPALA-12673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab reassigned IMPALA-12673: - Assignee: Gabor Kaszab > Iceberg table migraton fails for '/' in partition values > > > Key: IMPALA-12673 > URL: https://issues.apache.org/jira/browse/IMPALA-12673 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Reporter: Gabor Kaszab >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > Fix For: Impala 4.4.0 > > > As a bug in Iceberg we don't allow migrating tables to Iceberg when the table > has a partition value containing a '/' character. Now, that the fix for this > Iceberg bug is picked up by Impala we can allow migrating such tables. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-12597) Basic equality delete support
[ https://issues.apache.org/jira/browse/IMPALA-12597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab resolved IMPALA-12597. --- Fix Version/s: Impala 4.4.0 Resolution: Fixed > Basic equality delete support > - > > Key: IMPALA-12597 > URL: https://issues.apache.org/jira/browse/IMPALA-12597 > Project: IMPALA > Issue Type: Sub-task > Components: Backend, Frontend >Reporter: Gabor Kaszab >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > Fix For: Impala 4.4.0 > > > To split up the Equality-delete read support task, let's deliver a patch for > some initial support first. The idea here is that apparently Flink (one of > the engines that can write equality delete files) can write only a subset of > the possible equality delete use cases that are allowed by the Iceberg spec. > So as a first step let's deliver the functionality that is required to read > the EQ-deletes written by Flink. The use case: when Flink writes EQ-deletes > is for tables in upsert mode (primary key is a must in this case) in order to > guarantee the uniqueness of the primary key fields, for each insert (that is > in fact an upsert) Flink writes one delete file to remove the previous row > with the given PK (even if there hasn't been any) and then writes data files > with the new data. > How we can narrow down the functionality to be implemented on Impala side: > * The set of PK columns is not alterable, so we don't have to implement when > different EQ-delete files have different equality field ID lists. > * Flink's ALTER TABLE for Iceberg tables doesn't allow partition and schema > evolution. We can reject queries on eq-delete tables where there was > partition or schema evolution. > * As eq-deletes are written to NOT NULL PK's we could omit the case where > there are NULLs in the eq-delete file. (Update, this seemed easy to solve, so > will be part of this patch) > * For partitioned tables Flink requires the partition columns to be part of > the PK. As a result each EQ-delete file will have the partition values too so > no need to add extra logic to check if the partition spec ID and the > partition values match between the data and delete files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12649) Use max(data_sequence_number) fo joining equality delete rows
Gabor Kaszab created IMPALA-12649: - Summary: Use max(data_sequence_number) fo joining equality delete rows Key: IMPALA-12649 URL: https://issues.apache.org/jira/browse/IMPALA-12649 Project: IMPALA Issue Type: Sub-task Components: Frontend Reporter: Gabor Kaszab improvement idea for the future: If Flink always writes EQ-delete files, and uses the same primary key a lot, we will have the same entry in the HashMap with multiple data sequence numbers. Then during probing, for each hash table lookup we need to loop over all the sequence numbers and check them. Actually we only need the largest data sequence number, the lower sequence numbers with the same primary keys don't add any value. So we could add an Aggregation node to the right side of the join, like "PK1, PK2, ..., max(data_sequence_number), group by PK1, PK2, ...". Now, we would need to decide when to add this node to the plan, or when we shouldn't. We should also avoid having an EXCHANGE between the aggregation node and the JOIN node, as it would be redundant as they would use the same partition key expressions (the primary keys). If we had "hash teams" in Impala, we could always add this aggregator operator, as it would be in the same "hash team" with the JOIN operator, i.e. we wouldn't need to build the hash table twice. Microsoft's paper about hash joins and hash teams: [https://citeseerx.ist.psu.edu/document?repid=rep1=pdf=fc1c78cbef5062cf49fdb309b1935af08b759d2d] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12620) Missing field ID in the eq-delete file could filter out rows with null values
Gabor Kaszab created IMPALA-12620: - Summary: Missing field ID in the eq-delete file could filter out rows with null values Key: IMPALA-12620 URL: https://issues.apache.org/jira/browse/IMPALA-12620 Project: IMPALA Issue Type: Sub-task Reporter: Gabor Kaszab If a malformed equality delete file doesn't have some of the equality field IDs then Parquet schema resolver will identify these ase missing fields but won't fail the query. Missing fields instead are filled with NULL values. But when some of the columns in the equality delete tuples are NULLs then when anti-joining them with the data rows, they will match the NULL values from the data rows. As a result a malformed equality delete file could cause the rows being ommitted from the result where the field ID of the data row contains NULL and the field ID in the equality delete file is missing. E.g. Test data is i, s (1, "str1") (NULL, "str2") and equality field ID is 1 (corresponding to column i). When an equality delete file doesn't have column i and doesn't have field ID 1 then it will make the second row missing from the result. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12266) Sporadic failure after migrating a table to Iceberg
[ https://issues.apache.org/jira/browse/IMPALA-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795262#comment-17795262 ] Gabor Kaszab commented on IMPALA-12266: --- Hey [~stigahuang], I'm not actively working on this due to lack of bandwidth just monitoring the situation. I did manage to repro the issue locally, see my comment from August (already that long ago?? :) ) but wasn't able to progress from that. For me this seems a timing issue, where a query right after a CONVERT TABLE might not see the converted table, but if you re-run it, it will succeed. I was wondering if a SYNC_DDL would help, but didn't have the time to try it out. I'd be really grateful if you could take a look! > Sporadic failure after migrating a table to Iceberg > --- > > Key: IMPALA-12266 > URL: https://issues.apache.org/jira/browse/IMPALA-12266 > Project: IMPALA > Issue Type: Bug > Components: fe >Affects Versions: Impala 4.2.0 >Reporter: Tamas Mate >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > Attachments: > catalogd.bd40020df22b.invalid-user.log.INFO.20230704-181939.1, > impalad.6c0f48d9ce66.invalid-user.log.INFO.20230704-181940.1 > > > TestIcebergTable.test_convert_table test failed in a recent verify job's > dockerised tests: > https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/7629 > {code:none} > E ImpalaBeeswaxException: ImpalaBeeswaxException: > EINNER EXCEPTION: > EMESSAGE: AnalysisException: Failed to load metadata for table: > 'parquet_nopartitioned' > E CAUSED BY: TableLoadingException: Could not load table > test_convert_table_cdba7383.parquet_nopartitioned from catalog > E CAUSED BY: TException: > TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, > error_msgs:[NullPointerException: null]), lookup_status:OK) > {code} > {code:none} > E0704 19:09:22.980131 833 JniUtil.java:183] > 7145c21173f2c47b:2579db55] Error in Getting partial catalog object of > TABLE:test_convert_table_cdba7383.parquet_nopartitioned. Time spent: 49ms > I0704 19:09:22.980309 833 jni-util.cc:288] > 7145c21173f2c47b:2579db55] java.lang.NullPointerException > at > org.apache.impala.catalog.CatalogServiceCatalog.replaceTableIfUnchanged(CatalogServiceCatalog.java:2357) > at > org.apache.impala.catalog.CatalogServiceCatalog.getOrLoadTable(CatalogServiceCatalog.java:2300) > at > org.apache.impala.catalog.CatalogServiceCatalog.doGetPartialCatalogObject(CatalogServiceCatalog.java:3587) > at > org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3513) > at > org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3480) > at > org.apache.impala.service.JniCatalog.lambda$getPartialCatalogObject$11(JniCatalog.java:397) > at > org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90) > at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58) > at > org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89) > at > org.apache.impala.service.JniCatalogOp.execAndSerializeSilentStartAndFinish(JniCatalogOp.java:109) > at > org.apache.impala.service.JniCatalog.execAndSerializeSilentStartAndFinish(JniCatalog.java:238) > at > org.apache.impala.service.JniCatalog.getPartialCatalogObject(JniCatalog.java:396) > I0704 19:09:22.980324 833 status.cc:129] 7145c21173f2c47b:2579db55] > NullPointerException: null > @ 0x1012f9f impala::Status::Status() > @ 0x187f964 impala::JniUtil::GetJniExceptionMsg() > @ 0xfee920 impala::JniCall::Call<>() > @ 0xfccd0f impala::Catalog::GetPartialCatalogObject() > @ 0xfb55a5 > impala::CatalogServiceThriftIf::GetPartialCatalogObject() > @ 0xf7a691 > impala::CatalogServiceProcessorT<>::process_GetPartialCatalogObject() > @ 0xf82151 impala::CatalogServiceProcessorT<>::dispatchCall() > @ 0xee330f apache::thrift::TDispatchProcessor::process() > @ 0x1329246 > apache::thrift::server::TAcceptQueueServer::Task::run() > @ 0x1315a89 impala::ThriftThread::RunRunnable() > @ 0x131773d > boost::detail::function::void_function_obj_invoker0<>::invoke() > @ 0x195ba8c impala::Thread::SuperviseThread() > @ 0x195c895 boost::detail::thread_data<>::run() > @ 0x23a03a7 thread_proxy > @ 0x7faaad2a66ba start_thread > @ 0x7f2c151d clone > E0704 19:09:23.006968 833 catalog-server.cc:278] > 7145c21173f2c47b:2579db55] NullPointerException: null > {code}
[jira] [Created] (IMPALA-12608) Push down conjuncts to the equality delete scanner
Gabor Kaszab created IMPALA-12608: - Summary: Push down conjuncts to the equality delete scanner Key: IMPALA-12608 URL: https://issues.apache.org/jira/browse/IMPALA-12608 Project: IMPALA Issue Type: Sub-task Components: Frontend Reporter: Gabor Kaszab When we create the scan node for the Iceberg equality delete files in the initial implementation we don't push down any conjuncts to it. However, for better performance we can filter the conjuncts that are relevant for the equality delete scanner and push down them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-11388) Add support for equality-based deletes
[ https://issues.apache.org/jira/browse/IMPALA-11388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793344#comment-17793344 ] Gabor Kaszab commented on IMPALA-11388: --- FYI, I decided to make this an EPIC and split the work up to multiple items so that we can deliver functionality gradually. > Add support for equality-based deletes > -- > > Key: IMPALA-11388 > URL: https://issues.apache.org/jira/browse/IMPALA-11388 > Project: IMPALA > Issue Type: Epic > Components: Frontend >Reporter: Zoltán Borók-Nagy >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > > Iceberg V2 adds support for row-level modifications. > One way to implement this is via equality based delete files: > https://iceberg.apache.org/spec/#equality-delete-files > https://iceberg.apache.org/spec/#scan-planning > We could implement this via doing ANTI HASH JOIN between data and delete > files. Similarly to what we do for Hive full ACID tables: > https://github.com/apache/impala/blob/f5fc08573352d0a1943296209791a4db17268086/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java#L1729-L1735 > The complexity comes when different delete files use different set of > columns. In that case we will need multiple ANTI HASH JOINs on top of each > other. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Work started] (IMPALA-12597) Basic equality delete support
[ https://issues.apache.org/jira/browse/IMPALA-12597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on IMPALA-12597 started by Gabor Kaszab. - > Basic equality delete support > - > > Key: IMPALA-12597 > URL: https://issues.apache.org/jira/browse/IMPALA-12597 > Project: IMPALA > Issue Type: Sub-task > Components: Backend, Frontend >Reporter: Gabor Kaszab >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > > To split up the Equality-delete read support task, let's deliver a patch for > some initial support first. The idea here is that apparently Flink (one of > the engines that can write equality delete files) can write only a subset of > the possible equality delete use cases that are allowed by the Iceberg spec. > So as a first step let's deliver the functionality that is required to read > the EQ-deletes written by Flink. The use case: when Flink writes EQ-deletes > is for tables in upsert mode (primary key is a must in this case) in order to > guarantee the uniqueness of the primary key fields, for each insert (that is > in fact an upsert) Flink writes one delete file to remove the previous row > with the given PK (even if there hasn't been any) and then writes data files > with the new data. > How we can narrow down the functionality to be implemented on Impala side: > * The set of PK columns is not alterable, so we don't have to implement when > different EQ-delete files have different equality field ID lists. > * Flink's ALTER TABLE for Iceberg tables doesn't allow partition and schema > evolution. We can reject queries on eq-delete tables where there was > partition or schema evolution. > * As eq-deletes are written to NOT NULL PK's we could omit the case where > there are NULLs in the eq-delete file. (Update, this seemed easy to solve, so > will be part of this patch) > * For partitioned tables Flink requires the partition columns to be part of > the PK. As a result each EQ-delete file will have the partition values too so > no need to add extra logic to check if the partition spec ID and the > partition values match between the data and delete files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12600) Support equality deletes when table has partition or schema evolution
Gabor Kaszab created IMPALA-12600: - Summary: Support equality deletes when table has partition or schema evolution Key: IMPALA-12600 URL: https://issues.apache.org/jira/browse/IMPALA-12600 Project: IMPALA Issue Type: Sub-task Reporter: Gabor Kaszab With adding the basic equality delete read support, we reject queries for Iceberg tables that has equality delete files and has partition or schema evolution. This ticket is to enhance this functionality. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12599) Support equality delete files that don't contain the partition values
Gabor Kaszab created IMPALA-12599: - Summary: Support equality delete files that don't contain the partition values Key: IMPALA-12599 URL: https://issues.apache.org/jira/browse/IMPALA-12599 Project: IMPALA Issue Type: Sub-task Components: Frontend Reporter: Gabor Kaszab When you write equality delete files with Flink the partition columns have to also be part of the primary key. As a result the partition values will be added into the equality delete files. However, the Iceberg spec is more flexible than that and it's also a valid case when the partition values aren't written into the eq-delete files. To be able to read such tables Impala should also check if the partition spec and the partition values match between the data and delete files when applying the delete rows. This could be achieved by adding a virtual columns and conjuncts for the partition spec IDs and also for the partition values. These virtual columns already exist, but have to be added to the scan nodes, and the conjuncts have to be created for the ANTI JOIN node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12598) Add support for multiple equality field ID list
Gabor Kaszab created IMPALA-12598: - Summary: Add support for multiple equality field ID list Key: IMPALA-12598 URL: https://issues.apache.org/jira/browse/IMPALA-12598 Project: IMPALA Issue Type: Sub-task Components: Frontend Reporter: Gabor Kaszab Iceberg metadata holds an equality field ID list for the equality-delete files. It's possible to have a different equality field ID list for different equality-delete files, for instance one file deletes by columnA while another file deletes by columnB. When you have such a table you should have multiple layers of ANTI JOINs, one join for each equality field ID list. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12597) Basic equality delete support
Gabor Kaszab created IMPALA-12597: - Summary: Basic equality delete support Key: IMPALA-12597 URL: https://issues.apache.org/jira/browse/IMPALA-12597 Project: IMPALA Issue Type: Sub-task Components: Backend, Frontend Reporter: Gabor Kaszab To split up the Equality-delete read support task, let's deliver a patch for some initial support first. The idea here is that apparently Flink (one of the engines that can write equality delete files) can write only a subset of the possible equality delete use cases that are allowed by the Iceberg spec. So as a first step let's deliver the functionality that is required to read the EQ-deletes written by Flink. The use case: when Flink writes EQ-deletes is for tables in upsert mode (primary key is a must in this case) in order to guarantee the uniqueness of the primary key fields, for each insert (that is in fact an upsert) Flink writes one delete file to remove the previous row with the given PK (even if there hasn't been any) and then writes data files with the new data. How we can narrow down the functionality to be implemented on Impala side: * The set of PK columns is not alterable, so we don't have to implement when different EQ-delete files have different equality field ID lists. * Flink's ALTER TABLE for Iceberg tables doesn't allow partition and schema evolution. We can reject queries on eq-delete tables where there was partition or schema evolution. * As eq-deletes are written to NOT NULL PK's we could omit the case where there are NULLs in the eq-delete file. (Update, this seemed easy to solve, so will be part of this patch) * For partitioned tables Flink requires the partition columns to be part of the PK. As a result each EQ-delete file will have the partition values too so no need to add extra logic to check if the partition spec ID and the partition values match between the data and delete files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-12597) Basic equality delete support
[ https://issues.apache.org/jira/browse/IMPALA-12597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab reassigned IMPALA-12597: - Assignee: Gabor Kaszab > Basic equality delete support > - > > Key: IMPALA-12597 > URL: https://issues.apache.org/jira/browse/IMPALA-12597 > Project: IMPALA > Issue Type: Sub-task > Components: Backend, Frontend >Reporter: Gabor Kaszab >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > > To split up the Equality-delete read support task, let's deliver a patch for > some initial support first. The idea here is that apparently Flink (one of > the engines that can write equality delete files) can write only a subset of > the possible equality delete use cases that are allowed by the Iceberg spec. > So as a first step let's deliver the functionality that is required to read > the EQ-deletes written by Flink. The use case: when Flink writes EQ-deletes > is for tables in upsert mode (primary key is a must in this case) in order to > guarantee the uniqueness of the primary key fields, for each insert (that is > in fact an upsert) Flink writes one delete file to remove the previous row > with the given PK (even if there hasn't been any) and then writes data files > with the new data. > How we can narrow down the functionality to be implemented on Impala side: > * The set of PK columns is not alterable, so we don't have to implement when > different EQ-delete files have different equality field ID lists. > * Flink's ALTER TABLE for Iceberg tables doesn't allow partition and schema > evolution. We can reject queries on eq-delete tables where there was > partition or schema evolution. > * As eq-deletes are written to NOT NULL PK's we could omit the case where > there are NULLs in the eq-delete file. (Update, this seemed easy to solve, so > will be part of this patch) > * For partitioned tables Flink requires the partition columns to be part of > the PK. As a result each EQ-delete file will have the partition values too so > no need to add extra logic to check if the partition spec ID and the > partition values match between the data and delete files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-11388) Add support for equality-based deletes
[ https://issues.apache.org/jira/browse/IMPALA-11388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab updated IMPALA-11388: -- Issue Type: Epic (was: New Feature) > Add support for equality-based deletes > -- > > Key: IMPALA-11388 > URL: https://issues.apache.org/jira/browse/IMPALA-11388 > Project: IMPALA > Issue Type: Epic > Components: Frontend >Reporter: Zoltán Borók-Nagy >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > > Iceberg V2 adds support for row-level modifications. > One way to implement this is via equality based delete files: > https://iceberg.apache.org/spec/#equality-delete-files > https://iceberg.apache.org/spec/#scan-planning > We could implement this via doing ANTI HASH JOIN between data and delete > files. Similarly to what we do for Hive full ACID tables: > https://github.com/apache/impala/blob/f5fc08573352d0a1943296209791a4db17268086/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java#L1729-L1735 > The complexity comes when different delete files use different set of > columns. In that case we will need multiple ANTI HASH JOINs on top of each > other. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-11388) Add support for equality-based deletes
[ https://issues.apache.org/jira/browse/IMPALA-11388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab updated IMPALA-11388: -- Epic Link: (was: IMPALA-11386) > Add support for equality-based deletes > -- > > Key: IMPALA-11388 > URL: https://issues.apache.org/jira/browse/IMPALA-11388 > Project: IMPALA > Issue Type: Epic > Components: Frontend >Reporter: Zoltán Borók-Nagy >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > > Iceberg V2 adds support for row-level modifications. > One way to implement this is via equality based delete files: > https://iceberg.apache.org/spec/#equality-delete-files > https://iceberg.apache.org/spec/#scan-planning > We could implement this via doing ANTI HASH JOIN between data and delete > files. Similarly to what we do for Hive full ACID tables: > https://github.com/apache/impala/blob/f5fc08573352d0a1943296209791a4db17268086/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java#L1729-L1735 > The complexity comes when different delete files use different set of > columns. In that case we will need multiple ANTI HASH JOINs on top of each > other. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-12308) Implement DIRECTED distribution mode for Iceberg tables
[ https://issues.apache.org/jira/browse/IMPALA-12308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab resolved IMPALA-12308. --- Fix Version/s: Impala 4.4.0 Resolution: Fixed > Implement DIRECTED distribution mode for Iceberg tables > --- > > Key: IMPALA-12308 > URL: https://issues.apache.org/jira/browse/IMPALA-12308 > Project: IMPALA > Issue Type: Improvement > Components: Backend, Frontend >Reporter: Zoltán Borók-Nagy >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg, performance > Fix For: Impala 4.4.0 > > > Currently there are two distribution modes for JOIN-operators: > * BROADCAST: RHS is delivered to all executors of LHS > * PARTITIONED: both LHS and RHS are shuffled across executors > We implement reading of an Iceberg V2 table (with position delete files) via > an ANTI JOIN operator. LHS is the SCAN operator of the data records, RHS is > the SCAN operator of the delete records. The delete record contain > (file_path, pos) information of the deleted rows. > This means we can invent another distribution mode, just for Iceberg V2 > tables with position deletes: DIRECTED distribution mode. > At scheduling we must save the information about data SCAN operators, i.e. on > which nodes are they going to be executed. The LHS don't need to be shuffled > over the network. > The delete records of RHS can use the scheduling information to transfer > delete records to the hosts that process the corresponding data file. > This minimizes network communication. > We can also add further optimizations to the Iceberg V2 operator > (IcebergDeleteNode): > * Compare the pointers of the file paths instead of doing string compare > * Each tuple in a rowbatch belong to the same file, and positions are in > ascending order > ** Onlyone lookup is needed from the Hash table > ** We can add fast paths to skip testing the whole rowbatch (when the row > batch's position range is outside of the delete position range) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12543) test_iceberg_self_events failed in JDK11 build
[ https://issues.apache.org/jira/browse/IMPALA-12543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783138#comment-17783138 ] Gabor Kaszab commented on IMPALA-12543: --- Hey [~rizaon] , Does this test fail constantly? IMPALA-11387 seems pretty unrelated for me. Isn't it possible that this test is simply flaky > test_iceberg_self_events failed in JDK11 build > -- > > Key: IMPALA-12543 > URL: https://issues.apache.org/jira/browse/IMPALA-12543 > Project: IMPALA > Issue Type: Bug >Reporter: Riza Suminto >Assignee: Gabor Kaszab >Priority: Major > Labels: broken-build > > test_iceberg_self_events failed in JDK11 build with following error. > > {code:java} > Error Message > assert 0 == 1 > Stacktrace > custom_cluster/test_events_custom_configs.py:637: in test_iceberg_self_events > check_self_events("ALTER TABLE {0} ADD COLUMN j INT".format(tbl_name)) > custom_cluster/test_events_custom_configs.py:624: in check_self_events > assert tbls_refreshed_before == tbls_refreshed_after > E assert 0 == 1 {code} > This test still pass before IMPALA-11387 merged. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12457) Conversion from non-supported column types for Iceberg tables
[ https://issues.apache.org/jira/browse/IMPALA-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab updated IMPALA-12457: -- Issue Type: Improvement (was: New Feature) > Conversion from non-supported column types for Iceberg tables > - > > Key: IMPALA-12457 > URL: https://issues.apache.org/jira/browse/IMPALA-12457 > Project: IMPALA > Issue Type: Improvement > Components: Frontend >Reporter: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > > Assume you have a Hive table with one VARCHAR(N) column. The following now > fails: > CREATE TABLE ice_tbl STORED AS ICEBERG AS SELECT * FROM hive_tbl; > Fails because varchar(N) is not a supported Iceberg column type. Note, simple > varchar works because it's just a string under the hood. > I think this behaviour is just fine, Hive also gives an error for the above, > however, Hive has a switch called 'iceberg.mr.schema.auto.conversion' that > when you turn on Hive would do a conversion into string for varchar(N) > automatically. Also smallint and tinyint could be converted into int. > Would be nice to have something similar in Impala. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12457) Conversion from non-supported column types for Iceberg tables
Gabor Kaszab created IMPALA-12457: - Summary: Conversion from non-supported column types for Iceberg tables Key: IMPALA-12457 URL: https://issues.apache.org/jira/browse/IMPALA-12457 Project: IMPALA Issue Type: New Feature Components: Frontend Reporter: Gabor Kaszab Assume you have a Hive table with one VARCHAR(N) column. The following now fails: CREATE TABLE ice_tbl STORED AS ICEBERG AS SELECT * FROM hive_tbl; Fails because varchar(N) is not a supported Iceberg column type. Note, simple varchar works because it's just a string under the hood. I think this behaviour is just fine, Hive also gives an error for the above, however, Hive has a switch called 'iceberg.mr.schema.auto.conversion' that when you turn on Hive would do a conversion into string for varchar(N) automatically. Also smallint and tinyint could be converted into int. Would be nice to have something similar in Impala. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12409) Don't allow EXTERNAL Iceberg tables to point another Iceberg table in Hive catalog
[ https://issues.apache.org/jira/browse/IMPALA-12409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab updated IMPALA-12409: -- Description: We shouldn't allow users creating an EXTERNAL Iceberg table that points to another Iceberg table. I.e. the following should be forbidden: {noformat} CREATE EXTERNAL TABLE ice_ext STORED BY ICEBERG TBLPROPERTIES ('iceberg.table_identifier'='db.tbl');{noformat} was: We shouldn't allow users creating an EXTERNAL Iceberg table that points to another Iceberg catalog. I.e. the following should be forbidden: {noformat} CREATE EXTERNAL TABLE ice_ext STORED BY ICEBERG TBLPROPERTIES ('iceberg.table_identifier'='db.tbl');{noformat} > Don't allow EXTERNAL Iceberg tables to point another Iceberg table in Hive > catalog > -- > > Key: IMPALA-12409 > URL: https://issues.apache.org/jira/browse/IMPALA-12409 > Project: IMPALA > Issue Type: Bug > Components: Catalog, Frontend >Reporter: Zoltán Borók-Nagy >Assignee: Zoltán Borók-Nagy >Priority: Major > Labels: impala-iceberg > > We shouldn't allow users creating an EXTERNAL Iceberg table that points to > another Iceberg table. I.e. the following should be forbidden: > {noformat} > CREATE EXTERNAL TABLE ice_ext > STORED BY ICEBERG > TBLPROPERTIES ('iceberg.table_identifier'='db.tbl');{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12410) Impala's CONVERT TO ICEBERG statement does not retain table properties
[ https://issues.apache.org/jira/browse/IMPALA-12410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760756#comment-17760756 ] Gabor Kaszab commented on IMPALA-12410: --- Note, we have to be careful with migrating the table properties. E.g. if a user had set 'iceberg.table_identifier' then we don't want to migrate that property as it could point to another Iceberg table, that is in fact restricted by https://issues.apache.org/jira/browse/IMPALA-12409 I think the properties with 'iceberg.' prefix shouldn't be kept during migration. Not sure about 'name', but we might want to drop that as well. > Impala's CONVERT TO ICEBERG statement does not retain table properties > -- > > Key: IMPALA-12410 > URL: https://issues.apache.org/jira/browse/IMPALA-12410 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Reporter: Zoltán Borók-Nagy >Priority: Major > Labels: impala-iceberg > > Impala's CONVERT TO ICEBERG statement does not retain table properties. > Table properties should be retained except the ones used by Iceberg, e.g.: > * metadata_location > * iceberg.table_identifier > * name > * > iceberg.mr.table.identifier -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12266) Sporadic failure after migrating a table to Iceberg
[ https://issues.apache.org/jira/browse/IMPALA-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17758001#comment-17758001 ] Gabor Kaszab commented on IMPALA-12266: --- With the repros steps I found 3 different issues that randomly can occur: 1) the one mentioned in the description, 2) Could not resolve path 3) Table does not exist. I believe that all 3 are for the same root cause. > Sporadic failure after migrating a table to Iceberg > --- > > Key: IMPALA-12266 > URL: https://issues.apache.org/jira/browse/IMPALA-12266 > Project: IMPALA > Issue Type: Bug > Components: fe >Affects Versions: Impala 4.2.0 >Reporter: Tamas Mate >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > Attachments: > catalogd.bd40020df22b.invalid-user.log.INFO.20230704-181939.1, > impalad.6c0f48d9ce66.invalid-user.log.INFO.20230704-181940.1 > > > TestIcebergTable.test_convert_table test failed in a recent verify job's > dockerised tests: > https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/7629 > {code:none} > E ImpalaBeeswaxException: ImpalaBeeswaxException: > EINNER EXCEPTION: > EMESSAGE: AnalysisException: Failed to load metadata for table: > 'parquet_nopartitioned' > E CAUSED BY: TableLoadingException: Could not load table > test_convert_table_cdba7383.parquet_nopartitioned from catalog > E CAUSED BY: TException: > TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, > error_msgs:[NullPointerException: null]), lookup_status:OK) > {code} > {code:none} > E0704 19:09:22.980131 833 JniUtil.java:183] > 7145c21173f2c47b:2579db55] Error in Getting partial catalog object of > TABLE:test_convert_table_cdba7383.parquet_nopartitioned. Time spent: 49ms > I0704 19:09:22.980309 833 jni-util.cc:288] > 7145c21173f2c47b:2579db55] java.lang.NullPointerException > at > org.apache.impala.catalog.CatalogServiceCatalog.replaceTableIfUnchanged(CatalogServiceCatalog.java:2357) > at > org.apache.impala.catalog.CatalogServiceCatalog.getOrLoadTable(CatalogServiceCatalog.java:2300) > at > org.apache.impala.catalog.CatalogServiceCatalog.doGetPartialCatalogObject(CatalogServiceCatalog.java:3587) > at > org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3513) > at > org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3480) > at > org.apache.impala.service.JniCatalog.lambda$getPartialCatalogObject$11(JniCatalog.java:397) > at > org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90) > at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58) > at > org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89) > at > org.apache.impala.service.JniCatalogOp.execAndSerializeSilentStartAndFinish(JniCatalogOp.java:109) > at > org.apache.impala.service.JniCatalog.execAndSerializeSilentStartAndFinish(JniCatalog.java:238) > at > org.apache.impala.service.JniCatalog.getPartialCatalogObject(JniCatalog.java:396) > I0704 19:09:22.980324 833 status.cc:129] 7145c21173f2c47b:2579db55] > NullPointerException: null > @ 0x1012f9f impala::Status::Status() > @ 0x187f964 impala::JniUtil::GetJniExceptionMsg() > @ 0xfee920 impala::JniCall::Call<>() > @ 0xfccd0f impala::Catalog::GetPartialCatalogObject() > @ 0xfb55a5 > impala::CatalogServiceThriftIf::GetPartialCatalogObject() > @ 0xf7a691 > impala::CatalogServiceProcessorT<>::process_GetPartialCatalogObject() > @ 0xf82151 impala::CatalogServiceProcessorT<>::dispatchCall() > @ 0xee330f apache::thrift::TDispatchProcessor::process() > @ 0x1329246 > apache::thrift::server::TAcceptQueueServer::Task::run() > @ 0x1315a89 impala::ThriftThread::RunRunnable() > @ 0x131773d > boost::detail::function::void_function_obj_invoker0<>::invoke() > @ 0x195ba8c impala::Thread::SuperviseThread() > @ 0x195c895 boost::detail::thread_data<>::run() > @ 0x23a03a7 thread_proxy > @ 0x7faaad2a66ba start_thread > @ 0x7f2c151d clone > E0704 19:09:23.006968 833 catalog-server.cc:278] > 7145c21173f2c47b:2579db55] NullPointerException: null > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12266) Sporadic failure after migrating a table to Iceberg
[ https://issues.apache.org/jira/browse/IMPALA-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab updated IMPALA-12266: -- Labels: impala-iceberg (was: ) > Sporadic failure after migrating a table to Iceberg > --- > > Key: IMPALA-12266 > URL: https://issues.apache.org/jira/browse/IMPALA-12266 > Project: IMPALA > Issue Type: Bug > Components: fe >Affects Versions: Impala 4.2.0 >Reporter: Tamas Mate >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > Attachments: > catalogd.bd40020df22b.invalid-user.log.INFO.20230704-181939.1, > impalad.6c0f48d9ce66.invalid-user.log.INFO.20230704-181940.1 > > > TestIcebergTable.test_convert_table test failed in a recent verify job's > dockerised tests: > https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/7629 > {code:none} > E ImpalaBeeswaxException: ImpalaBeeswaxException: > EINNER EXCEPTION: > EMESSAGE: AnalysisException: Failed to load metadata for table: > 'parquet_nopartitioned' > E CAUSED BY: TableLoadingException: Could not load table > test_convert_table_cdba7383.parquet_nopartitioned from catalog > E CAUSED BY: TException: > TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, > error_msgs:[NullPointerException: null]), lookup_status:OK) > {code} > {code:none} > E0704 19:09:22.980131 833 JniUtil.java:183] > 7145c21173f2c47b:2579db55] Error in Getting partial catalog object of > TABLE:test_convert_table_cdba7383.parquet_nopartitioned. Time spent: 49ms > I0704 19:09:22.980309 833 jni-util.cc:288] > 7145c21173f2c47b:2579db55] java.lang.NullPointerException > at > org.apache.impala.catalog.CatalogServiceCatalog.replaceTableIfUnchanged(CatalogServiceCatalog.java:2357) > at > org.apache.impala.catalog.CatalogServiceCatalog.getOrLoadTable(CatalogServiceCatalog.java:2300) > at > org.apache.impala.catalog.CatalogServiceCatalog.doGetPartialCatalogObject(CatalogServiceCatalog.java:3587) > at > org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3513) > at > org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3480) > at > org.apache.impala.service.JniCatalog.lambda$getPartialCatalogObject$11(JniCatalog.java:397) > at > org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90) > at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58) > at > org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89) > at > org.apache.impala.service.JniCatalogOp.execAndSerializeSilentStartAndFinish(JniCatalogOp.java:109) > at > org.apache.impala.service.JniCatalog.execAndSerializeSilentStartAndFinish(JniCatalog.java:238) > at > org.apache.impala.service.JniCatalog.getPartialCatalogObject(JniCatalog.java:396) > I0704 19:09:22.980324 833 status.cc:129] 7145c21173f2c47b:2579db55] > NullPointerException: null > @ 0x1012f9f impala::Status::Status() > @ 0x187f964 impala::JniUtil::GetJniExceptionMsg() > @ 0xfee920 impala::JniCall::Call<>() > @ 0xfccd0f impala::Catalog::GetPartialCatalogObject() > @ 0xfb55a5 > impala::CatalogServiceThriftIf::GetPartialCatalogObject() > @ 0xf7a691 > impala::CatalogServiceProcessorT<>::process_GetPartialCatalogObject() > @ 0xf82151 impala::CatalogServiceProcessorT<>::dispatchCall() > @ 0xee330f apache::thrift::TDispatchProcessor::process() > @ 0x1329246 > apache::thrift::server::TAcceptQueueServer::Task::run() > @ 0x1315a89 impala::ThriftThread::RunRunnable() > @ 0x131773d > boost::detail::function::void_function_obj_invoker0<>::invoke() > @ 0x195ba8c impala::Thread::SuperviseThread() > @ 0x195c895 boost::detail::thread_data<>::run() > @ 0x23a03a7 thread_proxy > @ 0x7faaad2a66ba start_thread > @ 0x7f2c151d clone > E0704 19:09:23.006968 833 catalog-server.cc:278] > 7145c21173f2c47b:2579db55] NullPointerException: null > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12266) Sporadic failure after migrating a table to Iceberg
[ https://issues.apache.org/jira/browse/IMPALA-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757380#comment-17757380 ] Gabor Kaszab commented on IMPALA-12266: --- I managed to repro this with running the following SQL in a loop: {code:java} create table tmp_conv_tbl (i int) stored as parquet; insert into tmp_conv_tbl values (1), (2), (3); alter table tmp_conv_tbl convert to iceberg; alter table tmp_conv_tbl set tblproperties ('format-version'='2'); drop table tmp_conv_tbl; {code} For me the DROP TABLE statement failed with "Table doesn not exist" error. I guess it depends on which command is run on a different coordinator after the table conversion. Note, that this repro came in local catalog mode, however, I wouldn't be surprised if this repro-ed in normal catalog mode. This is how I enabled local catalog mode: {code:java} bin/start-impala-cluster.py --impalad_args='--use_local_catalog=true' --catalogd_args='--catalog_topic_mode=minimal' {code} > Sporadic failure after migrating a table to Iceberg > --- > > Key: IMPALA-12266 > URL: https://issues.apache.org/jira/browse/IMPALA-12266 > Project: IMPALA > Issue Type: Bug > Components: fe >Affects Versions: Impala 4.2.0 >Reporter: Tamas Mate >Assignee: Gabor Kaszab >Priority: Major > Attachments: > catalogd.bd40020df22b.invalid-user.log.INFO.20230704-181939.1, > impalad.6c0f48d9ce66.invalid-user.log.INFO.20230704-181940.1 > > > TestIcebergTable.test_convert_table test failed in a recent verify job's > dockerised tests: > https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/7629 > {code:none} > E ImpalaBeeswaxException: ImpalaBeeswaxException: > EINNER EXCEPTION: > EMESSAGE: AnalysisException: Failed to load metadata for table: > 'parquet_nopartitioned' > E CAUSED BY: TableLoadingException: Could not load table > test_convert_table_cdba7383.parquet_nopartitioned from catalog > E CAUSED BY: TException: > TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, > error_msgs:[NullPointerException: null]), lookup_status:OK) > {code} > {code:none} > E0704 19:09:22.980131 833 JniUtil.java:183] > 7145c21173f2c47b:2579db55] Error in Getting partial catalog object of > TABLE:test_convert_table_cdba7383.parquet_nopartitioned. Time spent: 49ms > I0704 19:09:22.980309 833 jni-util.cc:288] > 7145c21173f2c47b:2579db55] java.lang.NullPointerException > at > org.apache.impala.catalog.CatalogServiceCatalog.replaceTableIfUnchanged(CatalogServiceCatalog.java:2357) > at > org.apache.impala.catalog.CatalogServiceCatalog.getOrLoadTable(CatalogServiceCatalog.java:2300) > at > org.apache.impala.catalog.CatalogServiceCatalog.doGetPartialCatalogObject(CatalogServiceCatalog.java:3587) > at > org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3513) > at > org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3480) > at > org.apache.impala.service.JniCatalog.lambda$getPartialCatalogObject$11(JniCatalog.java:397) > at > org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90) > at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58) > at > org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89) > at > org.apache.impala.service.JniCatalogOp.execAndSerializeSilentStartAndFinish(JniCatalogOp.java:109) > at > org.apache.impala.service.JniCatalog.execAndSerializeSilentStartAndFinish(JniCatalog.java:238) > at > org.apache.impala.service.JniCatalog.getPartialCatalogObject(JniCatalog.java:396) > I0704 19:09:22.980324 833 status.cc:129] 7145c21173f2c47b:2579db55] > NullPointerException: null > @ 0x1012f9f impala::Status::Status() > @ 0x187f964 impala::JniUtil::GetJniExceptionMsg() > @ 0xfee920 impala::JniCall::Call<>() > @ 0xfccd0f impala::Catalog::GetPartialCatalogObject() > @ 0xfb55a5 > impala::CatalogServiceThriftIf::GetPartialCatalogObject() > @ 0xf7a691 > impala::CatalogServiceProcessorT<>::process_GetPartialCatalogObject() > @ 0xf82151 impala::CatalogServiceProcessorT<>::dispatchCall() > @ 0xee330f apache::thrift::TDispatchProcessor::process() > @ 0x1329246 > apache::thrift::server::TAcceptQueueServer::Task::run() > @ 0x1315a89 impala::ThriftThread::RunRunnable() > @ 0x131773d > boost::detail::function::void_function_obj_invoker0<>::invoke() > @ 0x195ba8c impala::Thread::SuperviseThread() > @ 0x195c895 boost::detail::thread_data<>::run() > @
[jira] [Updated] (IMPALA-12266) Sporadic failure after migrating a table to Iceberg
[ https://issues.apache.org/jira/browse/IMPALA-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab updated IMPALA-12266: -- Summary: Sporadic failure after migrating a table to Iceberg (was: Flaky TestIcebergTable.test_convert_table NPE) > Sporadic failure after migrating a table to Iceberg > --- > > Key: IMPALA-12266 > URL: https://issues.apache.org/jira/browse/IMPALA-12266 > Project: IMPALA > Issue Type: Bug > Components: fe >Affects Versions: Impala 4.2.0 >Reporter: Tamas Mate >Assignee: Gabor Kaszab >Priority: Major > Attachments: > catalogd.bd40020df22b.invalid-user.log.INFO.20230704-181939.1, > impalad.6c0f48d9ce66.invalid-user.log.INFO.20230704-181940.1 > > > TestIcebergTable.test_convert_table test failed in a recent verify job's > dockerised tests: > https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/7629 > {code:none} > E ImpalaBeeswaxException: ImpalaBeeswaxException: > EINNER EXCEPTION: > EMESSAGE: AnalysisException: Failed to load metadata for table: > 'parquet_nopartitioned' > E CAUSED BY: TableLoadingException: Could not load table > test_convert_table_cdba7383.parquet_nopartitioned from catalog > E CAUSED BY: TException: > TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, > error_msgs:[NullPointerException: null]), lookup_status:OK) > {code} > {code:none} > E0704 19:09:22.980131 833 JniUtil.java:183] > 7145c21173f2c47b:2579db55] Error in Getting partial catalog object of > TABLE:test_convert_table_cdba7383.parquet_nopartitioned. Time spent: 49ms > I0704 19:09:22.980309 833 jni-util.cc:288] > 7145c21173f2c47b:2579db55] java.lang.NullPointerException > at > org.apache.impala.catalog.CatalogServiceCatalog.replaceTableIfUnchanged(CatalogServiceCatalog.java:2357) > at > org.apache.impala.catalog.CatalogServiceCatalog.getOrLoadTable(CatalogServiceCatalog.java:2300) > at > org.apache.impala.catalog.CatalogServiceCatalog.doGetPartialCatalogObject(CatalogServiceCatalog.java:3587) > at > org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3513) > at > org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3480) > at > org.apache.impala.service.JniCatalog.lambda$getPartialCatalogObject$11(JniCatalog.java:397) > at > org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90) > at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58) > at > org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89) > at > org.apache.impala.service.JniCatalogOp.execAndSerializeSilentStartAndFinish(JniCatalogOp.java:109) > at > org.apache.impala.service.JniCatalog.execAndSerializeSilentStartAndFinish(JniCatalog.java:238) > at > org.apache.impala.service.JniCatalog.getPartialCatalogObject(JniCatalog.java:396) > I0704 19:09:22.980324 833 status.cc:129] 7145c21173f2c47b:2579db55] > NullPointerException: null > @ 0x1012f9f impala::Status::Status() > @ 0x187f964 impala::JniUtil::GetJniExceptionMsg() > @ 0xfee920 impala::JniCall::Call<>() > @ 0xfccd0f impala::Catalog::GetPartialCatalogObject() > @ 0xfb55a5 > impala::CatalogServiceThriftIf::GetPartialCatalogObject() > @ 0xf7a691 > impala::CatalogServiceProcessorT<>::process_GetPartialCatalogObject() > @ 0xf82151 impala::CatalogServiceProcessorT<>::dispatchCall() > @ 0xee330f apache::thrift::TDispatchProcessor::process() > @ 0x1329246 > apache::thrift::server::TAcceptQueueServer::Task::run() > @ 0x1315a89 impala::ThriftThread::RunRunnable() > @ 0x131773d > boost::detail::function::void_function_obj_invoker0<>::invoke() > @ 0x195ba8c impala::Thread::SuperviseThread() > @ 0x195c895 boost::detail::thread_data<>::run() > @ 0x23a03a7 thread_proxy > @ 0x7faaad2a66ba start_thread > @ 0x7f2c151d clone > E0704 19:09:23.006968 833 catalog-server.cc:278] > 7145c21173f2c47b:2579db55] NullPointerException: null > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12266) Flaky TestIcebergTable.test_convert_table NPE
[ https://issues.apache.org/jira/browse/IMPALA-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757376#comment-17757376 ] Gabor Kaszab commented on IMPALA-12266: --- This is actually more than just a flaky tests as it comes in various scenarios. I'll rename the ticket to reflect this. > Flaky TestIcebergTable.test_convert_table NPE > - > > Key: IMPALA-12266 > URL: https://issues.apache.org/jira/browse/IMPALA-12266 > Project: IMPALA > Issue Type: Bug > Components: fe >Affects Versions: Impala 4.2.0 >Reporter: Tamas Mate >Assignee: Gabor Kaszab >Priority: Major > Attachments: > catalogd.bd40020df22b.invalid-user.log.INFO.20230704-181939.1, > impalad.6c0f48d9ce66.invalid-user.log.INFO.20230704-181940.1 > > > TestIcebergTable.test_convert_table test failed in a recent verify job's > dockerised tests: > https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/7629 > {code:none} > E ImpalaBeeswaxException: ImpalaBeeswaxException: > EINNER EXCEPTION: > EMESSAGE: AnalysisException: Failed to load metadata for table: > 'parquet_nopartitioned' > E CAUSED BY: TableLoadingException: Could not load table > test_convert_table_cdba7383.parquet_nopartitioned from catalog > E CAUSED BY: TException: > TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, > error_msgs:[NullPointerException: null]), lookup_status:OK) > {code} > {code:none} > E0704 19:09:22.980131 833 JniUtil.java:183] > 7145c21173f2c47b:2579db55] Error in Getting partial catalog object of > TABLE:test_convert_table_cdba7383.parquet_nopartitioned. Time spent: 49ms > I0704 19:09:22.980309 833 jni-util.cc:288] > 7145c21173f2c47b:2579db55] java.lang.NullPointerException > at > org.apache.impala.catalog.CatalogServiceCatalog.replaceTableIfUnchanged(CatalogServiceCatalog.java:2357) > at > org.apache.impala.catalog.CatalogServiceCatalog.getOrLoadTable(CatalogServiceCatalog.java:2300) > at > org.apache.impala.catalog.CatalogServiceCatalog.doGetPartialCatalogObject(CatalogServiceCatalog.java:3587) > at > org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3513) > at > org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3480) > at > org.apache.impala.service.JniCatalog.lambda$getPartialCatalogObject$11(JniCatalog.java:397) > at > org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90) > at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58) > at > org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89) > at > org.apache.impala.service.JniCatalogOp.execAndSerializeSilentStartAndFinish(JniCatalogOp.java:109) > at > org.apache.impala.service.JniCatalog.execAndSerializeSilentStartAndFinish(JniCatalog.java:238) > at > org.apache.impala.service.JniCatalog.getPartialCatalogObject(JniCatalog.java:396) > I0704 19:09:22.980324 833 status.cc:129] 7145c21173f2c47b:2579db55] > NullPointerException: null > @ 0x1012f9f impala::Status::Status() > @ 0x187f964 impala::JniUtil::GetJniExceptionMsg() > @ 0xfee920 impala::JniCall::Call<>() > @ 0xfccd0f impala::Catalog::GetPartialCatalogObject() > @ 0xfb55a5 > impala::CatalogServiceThriftIf::GetPartialCatalogObject() > @ 0xf7a691 > impala::CatalogServiceProcessorT<>::process_GetPartialCatalogObject() > @ 0xf82151 impala::CatalogServiceProcessorT<>::dispatchCall() > @ 0xee330f apache::thrift::TDispatchProcessor::process() > @ 0x1329246 > apache::thrift::server::TAcceptQueueServer::Task::run() > @ 0x1315a89 impala::ThriftThread::RunRunnable() > @ 0x131773d > boost::detail::function::void_function_obj_invoker0<>::invoke() > @ 0x195ba8c impala::Thread::SuperviseThread() > @ 0x195c895 boost::detail::thread_data<>::run() > @ 0x23a03a7 thread_proxy > @ 0x7faaad2a66ba start_thread > @ 0x7f2c151d clone > E0704 19:09:23.006968 833 catalog-server.cc:278] > 7145c21173f2c47b:2579db55] NullPointerException: null > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12308) Implement DIRECTED distribution mode for Iceberg tables
[ https://issues.apache.org/jira/browse/IMPALA-12308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab updated IMPALA-12308: -- Issue Type: Improvement (was: Bug) > Implement DIRECTED distribution mode for Iceberg tables > --- > > Key: IMPALA-12308 > URL: https://issues.apache.org/jira/browse/IMPALA-12308 > Project: IMPALA > Issue Type: Improvement > Components: Backend, Frontend >Reporter: Zoltán Borók-Nagy >Priority: Major > Labels: impala-iceberg, performance > > Currently there are two distribution modes for JOIN-operators: > * BROADCAST: RHS is delivered to all executors of LHS > * PARTITIONED: both LHS and RHS are shuffled across executors > We implement reading of an Iceberg V2 table (with position delete files) via > an ANTI JOIN operator. LHS is the SCAN operator of the data records, RHS is > the SCAN operator of the delete records. The delete record contain > (file_path, pos) information of the deleted rows. > This means we can invent another distribution mode, just for Iceberg V2 > tables with position deletes: DIRECTED distribution mode. > At scheduling we must save the information about data SCAN operators, i.e. on > which nodes are they going to be executed. The LHS don't need to be shuffled > over the network. > The delete records of RHS can use the scheduling information to transfer > delete records to the hosts that process the corresponding data file. > This minimizes network communication. > We can also add further optimizations to the Iceberg V2 operator > (IcebergDeleteNode): > * Compare the pointers of the file paths instead of doing string compare > * Each tuple in a rowbatch belong to the same file, and positions are in > ascending order > ** Onlyone lookup is needed from the Hash table > ** We can add fast paths to skip testing the whole rowbatch (when the row > batch's position range is outside of the delete position range) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-12308) Implement DIRECTED distribution mode for Iceberg tables
[ https://issues.apache.org/jira/browse/IMPALA-12308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab reassigned IMPALA-12308: - Assignee: Gabor Kaszab > Implement DIRECTED distribution mode for Iceberg tables > --- > > Key: IMPALA-12308 > URL: https://issues.apache.org/jira/browse/IMPALA-12308 > Project: IMPALA > Issue Type: Improvement > Components: Backend, Frontend >Reporter: Zoltán Borók-Nagy >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg, performance > > Currently there are two distribution modes for JOIN-operators: > * BROADCAST: RHS is delivered to all executors of LHS > * PARTITIONED: both LHS and RHS are shuffled across executors > We implement reading of an Iceberg V2 table (with position delete files) via > an ANTI JOIN operator. LHS is the SCAN operator of the data records, RHS is > the SCAN operator of the delete records. The delete record contain > (file_path, pos) information of the deleted rows. > This means we can invent another distribution mode, just for Iceberg V2 > tables with position deletes: DIRECTED distribution mode. > At scheduling we must save the information about data SCAN operators, i.e. on > which nodes are they going to be executed. The LHS don't need to be shuffled > over the network. > The delete records of RHS can use the scheduling information to transfer > delete records to the hosts that process the corresponding data file. > This minimizes network communication. > We can also add further optimizations to the Iceberg V2 operator > (IcebergDeleteNode): > * Compare the pointers of the file paths instead of doing string compare > * Each tuple in a rowbatch belong to the same file, and positions are in > ascending order > ** Onlyone lookup is needed from the Hash table > ** We can add fast paths to skip testing the whole rowbatch (when the row > batch's position range is outside of the delete position range) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12266) Flaky TestIcebergTable.test_convert_table NPE
[ https://issues.apache.org/jira/browse/IMPALA-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17743796#comment-17743796 ] Gabor Kaszab commented on IMPALA-12266: --- I checked the logs the other day and for me it seems that the table migration to Iceberg was successful, but the first following query in that converted table failed, I recall it was a simple select count(*). It's a bit strange that this is flaky. I suspect that the issue is only with the GVO build, most probably with local catalog mode turned on. So there might be some timing issues when we have converted the table but some of the coordinators don't see it under the original name. > Flaky TestIcebergTable.test_convert_table NPE > - > > Key: IMPALA-12266 > URL: https://issues.apache.org/jira/browse/IMPALA-12266 > Project: IMPALA > Issue Type: Bug > Components: fe >Affects Versions: Impala 4.2.0 >Reporter: Tamas Mate >Assignee: Gabor Kaszab >Priority: Major > Attachments: > catalogd.bd40020df22b.invalid-user.log.INFO.20230704-181939.1, > impalad.6c0f48d9ce66.invalid-user.log.INFO.20230704-181940.1 > > > TestIcebergTable.test_convert_table test failed in a recent verify job's > dockerised tests: > https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/7629 > {code:none} > E ImpalaBeeswaxException: ImpalaBeeswaxException: > EINNER EXCEPTION: > EMESSAGE: AnalysisException: Failed to load metadata for table: > 'parquet_nopartitioned' > E CAUSED BY: TableLoadingException: Could not load table > test_convert_table_cdba7383.parquet_nopartitioned from catalog > E CAUSED BY: TException: > TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, > error_msgs:[NullPointerException: null]), lookup_status:OK) > {code} > {code:none} > E0704 19:09:22.980131 833 JniUtil.java:183] > 7145c21173f2c47b:2579db55] Error in Getting partial catalog object of > TABLE:test_convert_table_cdba7383.parquet_nopartitioned. Time spent: 49ms > I0704 19:09:22.980309 833 jni-util.cc:288] > 7145c21173f2c47b:2579db55] java.lang.NullPointerException > at > org.apache.impala.catalog.CatalogServiceCatalog.replaceTableIfUnchanged(CatalogServiceCatalog.java:2357) > at > org.apache.impala.catalog.CatalogServiceCatalog.getOrLoadTable(CatalogServiceCatalog.java:2300) > at > org.apache.impala.catalog.CatalogServiceCatalog.doGetPartialCatalogObject(CatalogServiceCatalog.java:3587) > at > org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3513) > at > org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3480) > at > org.apache.impala.service.JniCatalog.lambda$getPartialCatalogObject$11(JniCatalog.java:397) > at > org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90) > at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58) > at > org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89) > at > org.apache.impala.service.JniCatalogOp.execAndSerializeSilentStartAndFinish(JniCatalogOp.java:109) > at > org.apache.impala.service.JniCatalog.execAndSerializeSilentStartAndFinish(JniCatalog.java:238) > at > org.apache.impala.service.JniCatalog.getPartialCatalogObject(JniCatalog.java:396) > I0704 19:09:22.980324 833 status.cc:129] 7145c21173f2c47b:2579db55] > NullPointerException: null > @ 0x1012f9f impala::Status::Status() > @ 0x187f964 impala::JniUtil::GetJniExceptionMsg() > @ 0xfee920 impala::JniCall::Call<>() > @ 0xfccd0f impala::Catalog::GetPartialCatalogObject() > @ 0xfb55a5 > impala::CatalogServiceThriftIf::GetPartialCatalogObject() > @ 0xf7a691 > impala::CatalogServiceProcessorT<>::process_GetPartialCatalogObject() > @ 0xf82151 impala::CatalogServiceProcessorT<>::dispatchCall() > @ 0xee330f apache::thrift::TDispatchProcessor::process() > @ 0x1329246 > apache::thrift::server::TAcceptQueueServer::Task::run() > @ 0x1315a89 impala::ThriftThread::RunRunnable() > @ 0x131773d > boost::detail::function::void_function_obj_invoker0<>::invoke() > @ 0x195ba8c impala::Thread::SuperviseThread() > @ 0x195c895 boost::detail::thread_data<>::run() > @ 0x23a03a7 thread_proxy > @ 0x7faaad2a66ba start_thread > @ 0x7f2c151d clone > E0704 19:09:23.006968 833 catalog-server.cc:278] > 7145c21173f2c47b:2579db55] NullPointerException: null > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (IMPALA-11013) Support migrating external tables to Iceberg tables
[ https://issues.apache.org/jira/browse/IMPALA-11013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab resolved IMPALA-11013. --- Fix Version/s: Impala 4.3.0 Resolution: Fixed > Support migrating external tables to Iceberg tables > --- > > Key: IMPALA-11013 > URL: https://issues.apache.org/jira/browse/IMPALA-11013 > Project: IMPALA > Issue Type: Bug > Components: Catalog, Frontend >Reporter: Zoltán Borók-Nagy >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > Fix For: Impala 4.3.0 > > > E.g. Hive supports migrating external tables to Iceberg tables via the > following command: > {noformat} > ALTER TABLE t SET TBLPROPERTIES > ('storage_handler'='org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'); > {noformat} > Maybe we could support table migration with the same command. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12251) Table migration to run on multiple partitions in parallel
Gabor Kaszab created IMPALA-12251: - Summary: Table migration to run on multiple partitions in parallel Key: IMPALA-12251 URL: https://issues.apache.org/jira/browse/IMPALA-12251 Project: IMPALA Issue Type: New Feature Components: Frontend Reporter: Gabor Kaszab https://issues.apache.org/jira/browse/IMPALA-11013 Introduces table migration from legacy Hive tables to Iceberg tables. The parallelization in this patch is based on files within a partition. But if there are a lot of partitions and only few files in them this approach is not performant. Instead, as an improvement we can implement the parallelisation based on partitions and then decide which one to used based on a # partitions / avg # of files in a partition ratio. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-11013) Support migrating external tables to Iceberg tables
[ https://issues.apache.org/jira/browse/IMPALA-11013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab reassigned IMPALA-11013: - Assignee: Gabor Kaszab (was: Andrew Sherman) > Support migrating external tables to Iceberg tables > --- > > Key: IMPALA-11013 > URL: https://issues.apache.org/jira/browse/IMPALA-11013 > Project: IMPALA > Issue Type: Bug > Components: Catalog, Frontend >Reporter: Zoltán Borók-Nagy >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > > E.g. Hive supports migrating external tables to Iceberg tables via the > following command: > {noformat} > ALTER TABLE t SET TBLPROPERTIES > ('storage_handler'='org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'); > {noformat} > Maybe we could support table migration with the same command. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-12190) Renaming table will cause losing privileges for non-admin users
[ https://issues.apache.org/jira/browse/IMPALA-12190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab reassigned IMPALA-12190: - Assignee: (was: Gabor Kaszab) > Renaming table will cause losing privileges for non-admin users > --- > > Key: IMPALA-12190 > URL: https://issues.apache.org/jira/browse/IMPALA-12190 > Project: IMPALA > Issue Type: Bug > Components: Catalog >Reporter: Gabor Kaszab >Priority: Critical > Labels: alter-table, authorization, ranger > > Let's say user 'a' gets some privileges on table 't'. When this table gets > renamed (even by user 'a') then user 'a' loses its privileges on that table. > > Repro steps: > # Start impala with Ranger > # start impala-shell as admin (-u admin) > # create table tmp (i int, s string) stored as parquet; > # grant all on table tmp to user ; > # grant all on table tmp to user ; > {code:java} > Query: show grant user on table tmp > +++--+---++-+--+-+-+---+--+-+ > | principal_type | principal_name | database | table | column | uri | > storage_type | storage_uri | udf | privilege | grant_option | create_time | > +++--+---++-+--+-+-+---+--+-+ > | USER | | default | tmp | * | | > | | | all | false | NULL | > +++--+---++-+--+-+-+---+--+-+ > Fetched 1 row(s) in 0.01s {code} > # alter table tmp rename to tmp_1234; > # show grant user on table tmp_1234; > {code:java} > Query: show grant user on table tmp_1234 > Fetched 0 row(s) in 0.17s{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12209) format-version is not present in DESCRIBE FORMATTED and SHOW CREATE TABLE outputs
[ https://issues.apache.org/jira/browse/IMPALA-12209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab updated IMPALA-12209: -- Description: Repro: {code:java} create table tmp (i int, s string) stored as iceberg tblproperties ('format-version'='2'); describe extended/formatted tmp; show create table tmp; {code} Current behaviour: None of the following 2 commands contain 'format-version' in the output. Additionally, if you run what is returned from SHOW CREATE TABLE then you end up creating a V1 table instead of V2. The reason might be that format-version in the metadata.json is not stored within the tableproperties but it's on level above: {code:java} hdfs dfs -cat hdfs://localhost:20500/test-warehouse/tmp/metadata/0-55bcfe84-1819-4fb7-ade8-9c132b117880.metadata.json { "format-version" : 2, "table-uuid" : "9f11c0c4-02c7-4688-823c-fe95dbe3ff72", "location" : "hdfs://localhost:20500/test-warehouse/tmp", "last-sequence-number" : 0, "last-updated-ms" : 1686640775184, "last-column-id" : 2, "current-schema-id" : 0, "schemas" : [ { "type" : "struct", "schema-id" : 0, "fields" : [ { "id" : 1, "name" : "i", "required" : false, "type" : "int" }, { "id" : 2, "name" : "s", "required" : false, "type" : "string" } ] } ], "default-spec-id" : 0, "partition-specs" : [ { "spec-id" : 0, "fields" : [ ] } ], "last-partition-id" : 999, "default-sort-order-id" : 0, "sort-orders" : [ { "order-id" : 0, "fields" : [ ] } ], "properties" : { "engine.hive.enabled" : "true", "external.table.purge" : "TRUE", "write.merge.mode" : "merge-on-read", "write.format.default" : "parquet", "write.delete.mode" : "merge-on-read", "OBJCAPABILITIES" : "EXTREAD,EXTWRITE", "write.update.mode" : "merge-on-read", "storage_handler" : "org.apache.iceberg.mr.hive.HiveIcebergStorageHandler" }, "current-snapshot-id" : -1, "refs" : { }, "snapshots" : [ ], "statistics" : [ ], "snapshot-log" : [ ], "metadata-log" : [ ] {code} was: Repro: {code:java} create table tmp (i int, s string) stored as iceberg tblproperties ('format-version'='2'); describe extended/formatted tmp; show create table tmp; {code} Current behaviour: Non of the following 2 commands contain 'format-version' in the output. Additionally, if you run what is returned from SHOW CREATE TABLE then you end up creating a V1 table instead of V2. The reson might be that format-version in the metadata.json is not stored within the tableproperties but it's on level above: {code:java} hdfs dfs -cat hdfs://localhost:20500/test-warehouse/tmp/metadata/0-55bcfe84-1819-4fb7-ade8-9c132b117880.metadata.json { "format-version" : 2, "table-uuid" : "9f11c0c4-02c7-4688-823c-fe95dbe3ff72", "location" : "hdfs://localhost:20500/test-warehouse/tmp", "last-sequence-number" : 0, "last-updated-ms" : 1686640775184, "last-column-id" : 2, "current-schema-id" : 0, "schemas" : [ { "type" : "struct", "schema-id" : 0, "fields" : [ { "id" : 1, "name" : "i", "required" : false, "type" : "int" }, { "id" : 2, "name" : "s", "required" : false, "type" : "string" } ] } ], "default-spec-id" : 0, "partition-specs" : [ { "spec-id" : 0, "fields" : [ ] } ], "last-partition-id" : 999, "default-sort-order-id" : 0, "sort-orders" : [ { "order-id" : 0, "fields" : [ ] } ], "properties" : { "engine.hive.enabled" : "true", "external.table.purge" : "TRUE", "write.merge.mode" : "merge-on-read", "write.format.default" : "parquet", "write.delete.mode" : "merge-on-read", "OBJCAPABILITIES" : "EXTREAD,EXTWRITE", "write.update.mode" : "merge-on-read", "storage_handler" : "org.apache.iceberg.mr.hive.HiveIcebergStorageHandler" }, "current-snapshot-id" : -1, "refs" : { }, "snapshots" : [ ], "statistics" : [ ], "snapshot-log" : [ ], "metadata-log" : [ ] {code} > format-version is not present in DESCRIBE FORMATTED and SHOW CREATE TABLE > outputs > - > > Key: IMPALA-12209 > URL: https://issues.apache.org/jira/browse/IMPALA-12209 > Project: IMPALA > Issue Type: Bug > Components: from >Reporter: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > > Repro: > > {code:java} > create table tmp (i int, s string) stored as iceberg tblproperties > ('format-version'='2'); > describe extended/formatted tmp; > show create table tmp; > {code} > Current behaviour: > None of the following 2 commands contain 'format-version' in the output. > Additionally, if you run what is returned from SHOW CREATE TABLE then you end > up creating a V1 table instead of V2.
[jira] [Commented] (IMPALA-11710) Table properties are not updated in Iceberg metadata files
[ https://issues.apache.org/jira/browse/IMPALA-11710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17731984#comment-17731984 ] Gabor Kaszab commented on IMPALA-11710: --- One addition: when creating an Iceberg table without providing tblproperties then 'external.table.purge' is defaulted to true and we can alter this property to false later on. This is also persisted in the metadata.json file. However, once it's false it can't be changed back to true again. > Table properties are not updated in Iceberg metadata files > -- > > Key: IMPALA-11710 > URL: https://issues.apache.org/jira/browse/IMPALA-11710 > Project: IMPALA > Issue Type: Bug >Reporter: Noemi Pap-Takacs >Priority: Major > Labels: impala-iceberg > > This issue occurs in true external Hive Catalog tables. > Iceberg stores the default file format in a table property called > 'write.format.default'. HMS also stores this value loaded from the Iceberg > metadata json file. > However, when this table property is altered through Impala, it is only > changed in HMS, but does not update the Iceberg snapshot. When the table data > is reloaded from Iceberg metadata, the old value will appear in HMS and the > change is lost. > This bug does not affect table properties that are not stored in Iceberg, > because they will not be reloaded. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12209) format-version is not present in DESCRIBE FORMATTED and SHOW CREATE TABLE outputs
Gabor Kaszab created IMPALA-12209: - Summary: format-version is not present in DESCRIBE FORMATTED and SHOW CREATE TABLE outputs Key: IMPALA-12209 URL: https://issues.apache.org/jira/browse/IMPALA-12209 Project: IMPALA Issue Type: Bug Components: from Reporter: Gabor Kaszab Repro: {code:java} create table tmp (i int, s string) stored as iceberg tblproperties ('format-version'='2'); describe extended/formatted tmp; show create table tmp; {code} Current behaviour: Non of the following 2 commands contain 'format-version' in the output. Additionally, if you run what is returned from SHOW CREATE TABLE then you end up creating a V1 table instead of V2. The reson might be that format-version in the metadata.json is not stored within the tableproperties but it's on level above: {code:java} hdfs dfs -cat hdfs://localhost:20500/test-warehouse/tmp/metadata/0-55bcfe84-1819-4fb7-ade8-9c132b117880.metadata.json { "format-version" : 2, "table-uuid" : "9f11c0c4-02c7-4688-823c-fe95dbe3ff72", "location" : "hdfs://localhost:20500/test-warehouse/tmp", "last-sequence-number" : 0, "last-updated-ms" : 1686640775184, "last-column-id" : 2, "current-schema-id" : 0, "schemas" : [ { "type" : "struct", "schema-id" : 0, "fields" : [ { "id" : 1, "name" : "i", "required" : false, "type" : "int" }, { "id" : 2, "name" : "s", "required" : false, "type" : "string" } ] } ], "default-spec-id" : 0, "partition-specs" : [ { "spec-id" : 0, "fields" : [ ] } ], "last-partition-id" : 999, "default-sort-order-id" : 0, "sort-orders" : [ { "order-id" : 0, "fields" : [ ] } ], "properties" : { "engine.hive.enabled" : "true", "external.table.purge" : "TRUE", "write.merge.mode" : "merge-on-read", "write.format.default" : "parquet", "write.delete.mode" : "merge-on-read", "OBJCAPABILITIES" : "EXTREAD,EXTWRITE", "write.update.mode" : "merge-on-read", "storage_handler" : "org.apache.iceberg.mr.hive.HiveIcebergStorageHandler" }, "current-snapshot-id" : -1, "refs" : { }, "snapshots" : [ ], "statistics" : [ ], "snapshot-log" : [ ], "metadata-log" : [ ] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-11552) Support migrating Iceberg v1 tables to Iceberg v2
[ https://issues.apache.org/jira/browse/IMPALA-11552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17731937#comment-17731937 ] Gabor Kaszab commented on IMPALA-11552: --- can we close this? > Support migrating Iceberg v1 tables to Iceberg v2 > - > > Key: IMPALA-11552 > URL: https://issues.apache.org/jira/browse/IMPALA-11552 > Project: IMPALA > Issue Type: Improvement >Reporter: Manish Maheshwari >Priority: Major > Labels: impala-iceberg > > Support migrating Iceberg v1 tables to Iceberg v2 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-11710) Table properties are not updated in Iceberg metadata files
[ https://issues.apache.org/jira/browse/IMPALA-11710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730952#comment-17730952 ] Gabor Kaszab commented on IMPALA-11710: --- Ran into the same with a different table property. For the record let me add the repro steps: create table tmp_ice (i int, s string) stored as iceberg tblproperties ('external.table.purge'='false'); alter table tmp_ice set tblproperties('external.table.purge'='true'); now if you call {{describe formatted tmp_ice;}} it's set to true as expected. insert into tmp_ice values (1, "str1"); But if you for instance insert a row into a table it's set to false again. (checked in {{describe formatted}} and also {{{}show create table{}}}) > Table properties are not updated in Iceberg metadata files > -- > > Key: IMPALA-11710 > URL: https://issues.apache.org/jira/browse/IMPALA-11710 > Project: IMPALA > Issue Type: Bug >Reporter: Noemi Pap-Takacs >Priority: Major > Labels: impala-iceberg > > This issue occurs in true external Hive Catalog tables. > Iceberg stores the default file format in a table property called > 'write.format.default'. HMS also stores this value loaded from the Iceberg > metadata json file. > However, when this table property is altered through Impala, it is only > changed in HMS, but does not update the Iceberg snapshot. When the table data > is reloaded from Iceberg metadata, the old value will appear in HMS and the > change is lost. > This bug does not affect table properties that are not stored in Iceberg, > because they will not be reloaded. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-12190) Renaming table will cause losing privileges for non-admin users
[ https://issues.apache.org/jira/browse/IMPALA-12190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab reassigned IMPALA-12190: - Assignee: Gabor Kaszab > Renaming table will cause losing privileges for non-admin users > --- > > Key: IMPALA-12190 > URL: https://issues.apache.org/jira/browse/IMPALA-12190 > Project: IMPALA > Issue Type: Bug > Components: Catalog >Reporter: Gabor Kaszab >Assignee: Gabor Kaszab >Priority: Critical > Labels: alter-table, authorization, ranger > > Let's say user 'a' gets some privileges on table 't'. When this table gets > renamed (even by user 'a') then user 'a' loses its privileges on that table. > > Repro steps: > # Start impala with Ranger > # start impala-shell as admin (-u admin) > # create table tmp (i int, s string) stored as parquet; > # grant all on table tmp to user ; > # grant all on table tmp to user ; > {code:java} > Query: show grant user on table tmp > +++--+---++-+--+-+-+---+--+-+ > | principal_type | principal_name | database | table | column | uri | > storage_type | storage_uri | udf | privilege | grant_option | create_time | > +++--+---++-+--+-+-+---+--+-+ > | USER | | default | tmp | * | | > | | | all | false | NULL | > +++--+---++-+--+-+-+---+--+-+ > Fetched 1 row(s) in 0.01s {code} > # alter table tmp rename to tmp_1234; > # show grant user on table tmp_1234; > {code:java} > Query: show grant user on table tmp_1234 > Fetched 0 row(s) in 0.17s{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12190) Renaming table will cause losing privileges for non-admin users
[ https://issues.apache.org/jira/browse/IMPALA-12190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab updated IMPALA-12190: -- Priority: Critical (was: Major) > Renaming table will cause losing privileges for non-admin users > --- > > Key: IMPALA-12190 > URL: https://issues.apache.org/jira/browse/IMPALA-12190 > Project: IMPALA > Issue Type: Bug > Components: Catalog >Reporter: Gabor Kaszab >Priority: Critical > Labels: alter-table, authorization, ranger > > Let's say user 'a' gets some privileges on table 't'. When this table gets > renamed (even by user 'a') then user 'a' loses its privileges on that table. > > Repro steps: > # Start impala with Ranger > # start impala-shell as admin (-u admin) > # create table tmp (i int, s string) stored as parquet; > # grant all on table tmp to user ; > # grant all on table tmp to user ; > {code:java} > Query: show grant user on table tmp > +++--+---++-+--+-+-+---+--+-+ > | principal_type | principal_name | database | table | column | uri | > storage_type | storage_uri | udf | privilege | grant_option | create_time | > +++--+---++-+--+-+-+---+--+-+ > | USER | | default | tmp | * | | > | | | all | false | NULL | > +++--+---++-+--+-+-+---+--+-+ > Fetched 1 row(s) in 0.01s {code} > # alter table tmp rename to tmp_1234; > # show grant user on table tmp_1234; > {code:java} > Query: show grant user on table tmp_1234 > Fetched 0 row(s) in 0.17s{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12190) Renaming table will cause losing privileges for non-admin users
Gabor Kaszab created IMPALA-12190: - Summary: Renaming table will cause losing privileges for non-admin users Key: IMPALA-12190 URL: https://issues.apache.org/jira/browse/IMPALA-12190 Project: IMPALA Issue Type: Bug Components: Catalog Reporter: Gabor Kaszab Let's say user 'a' gets some privileges on table 't'. When this table gets renamed (even by user 'a') then user 'a' loses its privileges on that table. Repro steps: # Start impala with Ranger # start impala-shell as admin (-u admin) # create table tmp (i int, s string) stored as parquet; # grant all on table tmp to user ; # grant all on table tmp to user ; {code:java} Query: show grant user on table tmp +++--+---++-+--+-+-+---+--+-+ | principal_type | principal_name | database | table | column | uri | storage_type | storage_uri | udf | privilege | grant_option | create_time | +++--+---++-+--+-+-+---+--+-+ | USER | | default | tmp | * | | | | | all | false | NULL | +++--+---++-+--+-+-+---+--+-+ Fetched 1 row(s) in 0.01s {code} # alter table tmp rename to tmp_1234; # show grant user on table tmp_1234; {code:java} Query: show grant user on table tmp_1234 Fetched 0 row(s) in 0.17s{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-12153) Parquet STRUCT reader doesn't fill position slots
[ https://issues.apache.org/jira/browse/IMPALA-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab resolved IMPALA-12153. --- Fix Version/s: Impala 4.3.0 Resolution: Fixed > Parquet STRUCT reader doesn't fill position slots > - > > Key: IMPALA-12153 > URL: https://issues.apache.org/jira/browse/IMPALA-12153 > Project: IMPALA > Issue Type: Bug > Components: Backend >Reporter: Zoltán Borók-Nagy >Assignee: Zoltán Borók-Nagy >Priority: Major > Fix For: Impala 4.3.0 > > > The Parquet STRUCT reader doesn't fill the collection position slot, neither > the file position slot. > E.g.: > {noformat} > select id, file__position, pos, item > from complextypestbl c, c.nested_struct.c.d.item; > SET expand_complex_types=True; > select file__position, * from complextypestbl;{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-11701) Skip pushing down Iceberg predicates to Impala scanner if not needed
[ https://issues.apache.org/jira/browse/IMPALA-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17720443#comment-17720443 ] Gabor Kaszab commented on IMPALA-11701: --- [~lipenglin] Frankly, I haven't looked into the residual() code to see what stats it takes into account. > Skip pushing down Iceberg predicates to Impala scanner if not needed > > > Key: IMPALA-11701 > URL: https://issues.apache.org/jira/browse/IMPALA-11701 > Project: IMPALA > Issue Type: Sub-task > Components: Backend, Frontend >Reporter: Qizhu Chan >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > Fix For: Impala 4.3.0 > > Attachments: image-2022-11-03-17-37-14-712.png, > profile_cf446a1ab3a5e852_1b1005de.txt > > > I use impala to query iceberg table, but the query efficiency is not ideal, > compared with querying the hive format table of the same data, the > time-consuming increase is dozens of times. > The sql statement used is a very simple statistical query, be like : > select count(*) from `db_name`.tbl_name where datekey='20221001' and > event='xxx' > ('datekey' and 'event' are the partition fields) > My personal feeling is that impala might fetch iceberg's metadata stats and > return results very quickly, but it doesn't. > The catalog of iceberg table is of the hadoop type, and Impala can access it > by creating an external table in hive. By the way, iceberg table will > perform snapshot expiration and data compaction on a daily basis, so there > should be no small file problems. > I found this warning using the explain statement: > {code:java} > | WARNING: The following tables are missing relevant table and/or column > statistics. | > | iceberg.gamebox_event_iceberg > {code} > Query: SHOW TABLE STATS `iceberg`.gamebox_event_iceberg > +---+++--+---+-+---+-+ > | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | > Incremental stats | Location > | > +---+++--+---+-+---+-+ > | 0 | 590509 | 1.91TB | NOT CACHED | NOT CACHED| PARQUET | > false | hdfs:///hive/warehouse/iceberg/gamebox_event_iceberg | > +---+++--+---+-+---+-+ > It seems like Impala is not syncing iceberg's table and column statistics. > I'm not sure if this has anything to do with slow queries. > As shown in the screenshot, i think the query time is mainly on planning and > execution backends , but I don't know what is the reason for these two time > consuming. > Attachment is the complete profile for this query. > How do I speed up the query? Can someone help with my question?plz. > !image-2022-11-03-17-37-14-712.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-11701) Skip pushing down Iceberg predicates to Impala scanner if not needed
[ https://issues.apache.org/jira/browse/IMPALA-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab reassigned IMPALA-11701: - Assignee: Gabor Kaszab (was: Wenzhe Zhou) > Skip pushing down Iceberg predicates to Impala scanner if not needed > > > Key: IMPALA-11701 > URL: https://issues.apache.org/jira/browse/IMPALA-11701 > Project: IMPALA > Issue Type: Sub-task > Components: Backend, Frontend >Reporter: Qizhu Chan >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > Fix For: Impala 4.3.0 > > Attachments: image-2022-11-03-17-37-14-712.png, > profile_cf446a1ab3a5e852_1b1005de.txt > > > I use impala to query iceberg table, but the query efficiency is not ideal, > compared with querying the hive format table of the same data, the > time-consuming increase is dozens of times. > The sql statement used is a very simple statistical query, be like : > select count(*) from `db_name`.tbl_name where datekey='20221001' and > event='xxx' > ('datekey' and 'event' are the partition fields) > My personal feeling is that impala might fetch iceberg's metadata stats and > return results very quickly, but it doesn't. > The catalog of iceberg table is of the hadoop type, and Impala can access it > by creating an external table in hive. By the way, iceberg table will > perform snapshot expiration and data compaction on a daily basis, so there > should be no small file problems. > I found this warning using the explain statement: > {code:java} > | WARNING: The following tables are missing relevant table and/or column > statistics. | > | iceberg.gamebox_event_iceberg > {code} > Query: SHOW TABLE STATS `iceberg`.gamebox_event_iceberg > +---+++--+---+-+---+-+ > | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | > Incremental stats | Location > | > +---+++--+---+-+---+-+ > | 0 | 590509 | 1.91TB | NOT CACHED | NOT CACHED| PARQUET | > false | hdfs:///hive/warehouse/iceberg/gamebox_event_iceberg | > +---+++--+---+-+---+-+ > It seems like Impala is not syncing iceberg's table and column statistics. > I'm not sure if this has anything to do with slow queries. > As shown in the screenshot, i think the query time is mainly on planning and > execution backends , but I don't know what is the reason for these two time > consuming. > Attachment is the complete profile for this query. > How do I speed up the query? Can someone help with my question?plz. > !image-2022-11-03-17-37-14-712.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12107) Precodition check fail when creating range partitioned Kudu table with unsupported types
Gabor Kaszab created IMPALA-12107: - Summary: Precodition check fail when creating range partitioned Kudu table with unsupported types Key: IMPALA-12107 URL: https://issues.apache.org/jira/browse/IMPALA-12107 Project: IMPALA Issue Type: Bug Components: Frontend Reporter: Gabor Kaszab {code:java} CREATE TABLE example_table ( id INT, value DECIMAL(18,2), PRIMARY KEY (id, value) ) PARTITION BY RANGE (value) ( PARTITION VALUES <= 1000.00, PARTITION 1000.00 < VALUES <= 5000.00, PARTITION 5000.00 < VALUES <= 1.00, PARTITION 1.00 < VALUES ) STORED AS KUDU; {code} This leads to an IllegalStateException. {code:java} I0428 14:17:47.564204 10195 jni-util.cc:288] 8f47bda158e1bba1:1d38855b] java.lang.IllegalStateException at com.google.common.base.Preconditions.checkState(Preconditions.java:492) at org.apache.impala.analysis.RangePartition.analyzeBoundaryValue(RangePartition.java:180) at org.apache.impala.analysis.RangePartition.analyzeBoundaryValues(RangePartition.java:150) at org.apache.impala.analysis.RangePartition.analyze(RangePartition.java:135) at org.apache.impala.analysis.KuduPartitionParam.analyzeRangeParam(KuduPartitionParam.java:144) at org.apache.impala.analysis.KuduPartitionParam.analyze(KuduPartitionParam.java:132) at org.apache.impala.analysis.CreateTableStmt.analyzeKuduPartitionParams(CreateTableStmt.java:550) at org.apache.impala.analysis.CreateTableStmt.analyzeSynchronizedKuduTableParams(CreateTableStmt.java:502) at org.apache.impala.analysis.CreateTableStmt.analyzeKuduFormat(CreateTableStmt.java:352) at org.apache.impala.analysis.CreateTableStmt.analyze(CreateTableStmt.java:266) at org.apache.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:521) at org.apache.impala.analysis.AnalysisContext.analyzeAndAuthorize(AnalysisContext.java:468) at org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:2059) at org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:1967) at org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1788) at org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:164) {code} Here: https://github.com/apache/impala/blob/112bab64b77d6ed966b1c67bd503ed632da6f208/fe/src/main/java/org/apache/impala/analysis/RangePartition.java#L198 Instead of running into a Precondition check failure we should detect unsupported types beforehand and return and fail the query with a proper error message. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12089) Be able to skip pushing down a subset of the predicates
Gabor Kaszab created IMPALA-12089: - Summary: Be able to skip pushing down a subset of the predicates Key: IMPALA-12089 URL: https://issues.apache.org/jira/browse/IMPALA-12089 Project: IMPALA Issue Type: Sub-task Components: Frontend Reporter: Gabor Kaszab https://issues.apache.org/jira/browse/IMPALA-11701 introduced logic to skip pushing down predicates to Impala scanners if they are already applied by Iceberg and won't filter any further rows. This is an "all or nothing" approach where we either skip pushing down all the predicates or we push down all of them. As a more sophisticated approach we should be able to push down a subset of the predicates to Impala Scan nodes. For this we should be able to map Iceberg predicates (returned from residual()) to Impala predicates. This might not be that trivial as Iceberg sometimes doesn't return the exact same predicates as it received through planFiles(). E.g. the object ID might be different making the mapping more difficult. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-11701) Skip pushing down Iceberg predicates to Impala scanner if not needed
[ https://issues.apache.org/jira/browse/IMPALA-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Kaszab resolved IMPALA-11701. --- Fix Version/s: Impala 4.3.0 Resolution: Fixed > Skip pushing down Iceberg predicates to Impala scanner if not needed > > > Key: IMPALA-11701 > URL: https://issues.apache.org/jira/browse/IMPALA-11701 > Project: IMPALA > Issue Type: Sub-task > Components: Backend, Frontend >Reporter: Qizhu Chan >Assignee: Gabor Kaszab >Priority: Major > Labels: impala-iceberg > Fix For: Impala 4.3.0 > > Attachments: image-2022-11-03-17-37-14-712.png, > profile_cf446a1ab3a5e852_1b1005de.txt > > > I use impala to query iceberg table, but the query efficiency is not ideal, > compared with querying the hive format table of the same data, the > time-consuming increase is dozens of times. > The sql statement used is a very simple statistical query, be like : > select count(*) from `db_name`.tbl_name where datekey='20221001' and > event='xxx' > ('datekey' and 'event' are the partition fields) > My personal feeling is that impala might fetch iceberg's metadata stats and > return results very quickly, but it doesn't. > The catalog of iceberg table is of the hadoop type, and Impala can access it > by creating an external table in hive. By the way, iceberg table will > perform snapshot expiration and data compaction on a daily basis, so there > should be no small file problems. > I found this warning using the explain statement: > {code:java} > | WARNING: The following tables are missing relevant table and/or column > statistics. | > | iceberg.gamebox_event_iceberg > {code} > Query: SHOW TABLE STATS `iceberg`.gamebox_event_iceberg > +---+++--+---+-+---+-+ > | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | > Incremental stats | Location > | > +---+++--+---+-+---+-+ > | 0 | 590509 | 1.91TB | NOT CACHED | NOT CACHED| PARQUET | > false | hdfs:///hive/warehouse/iceberg/gamebox_event_iceberg | > +---+++--+---+-+---+-+ > It seems like Impala is not syncing iceberg's table and column statistics. > I'm not sure if this has anything to do with slow queries. > As shown in the screenshot, i think the query time is mainly on planning and > execution backends , but I don't know what is the reason for these two time > consuming. > Attachment is the complete profile for this query. > How do I speed up the query? Can someone help with my question?plz. > !image-2022-11-03-17-37-14-712.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org