[jira] [Commented] (IMPALA-11265) Iceberg tables have a large memory footprint in catalog cache

2024-07-29 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17869374#comment-17869374
 ] 

Gabor Kaszab commented on IMPALA-11265:
---

I did the experiment myself too. For me the 
functional_parquet.iceberg_partitioned table has the size of 2.8M - 3.1M (not 
always the same for some reason). Could the difference be caused by a potential 
Iceberg version bump since your measurements?
Anyway, I checked the size of the 
[BaseTable|https://github.com/apache/iceberg/blob/1.3.x/core/src/main/java/org/apache/iceberg/BaseTable.java]
 object and for me it seems that mostly the TableOperations object takes all 
the memory while the size of other members within this class are negligible.

{code:java}
if (value instanceof BaseTable) {
  BaseTable bt = (BaseTable)value;
  long size1 = SIZEOF.deepSizeOf(bt.operations());
  long size2 = SIZEOF.deepSizeOf(bt.name());
  long size3 = SIZEOF.deepSizeOf(LoggingMetricsReporter.instance());
  if (size1 < 0 || size2 < 0 || size3 < 0) throw new 
RuntimeException("something");
}
{code}
With the above code snippet size1=3145000, size2=184, size3=16
Note, the MetricsReporter is not exposed from BaseTable in Iceberg 1.3 only in 
newer versions so I simply took the LoggingMetricsReporter as it is used within 
BaseTable anyway.

So the next step here is to grind one step deeper and check what is consuming 
that amount of memory in HadoopTableOperation. After a first glance it seems 
that there are a lot of string configs stored, but will keep investigating 
further.

> Iceberg tables have a large memory footprint in catalog cache
> -
>
> Key: IMPALA-11265
> URL: https://issues.apache.org/jira/browse/IMPALA-11265
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Quanlong Huang
>Priority: Major
>  Labels: impala-iceberg
>
> During the investigation of IMPALA-11260, I found the cache item size of a 
> (IcebergApiTableCacheKey, org.apache.iceberg.BaseTable) pair could be 30MB.
> For instance, here are the cache items of the iceberg table 
> {{{}functional_parquet.iceberg_partitioned{}}}:
> {code:java}
> weigh=3792, keyClass=class 
> org.apache.impala.catalog.local.CatalogdMetaProvider$TableCacheKey, 
> valueClass=class 
> org.apache.impala.catalog.local.CatalogdMetaProvider$TableMetaRefImpl
> weigh=14960, keyClass=class 
> org.apache.impala.catalog.local.CatalogdMetaProvider$IcebergMetaCacheKey, 
> valueClass=class org.apache.impala.thrift.TPartialTableInfo
> weigh=30546992, keyClass=class 
> org.apache.impala.catalog.local.CatalogdMetaProvider$IcebergApiTableCacheKey, 
> valueClass=class org.apache.iceberg.BaseTable
> weigh=496, keyClass=class 
> org.apache.impala.catalog.local.CatalogdMetaProvider$ColStatsCacheKey, 
> valueClass=class org.apache.hadoop.hive.metastore.api.ColumnStatisticsObj
> weigh=496, keyClass=class 
> org.apache.impala.catalog.local.CatalogdMetaProvider$ColStatsCacheKey, 
> valueClass=class org.apache.hadoop.hive.metastore.api.ColumnStatisticsObj
> weigh=496, keyClass=class 
> org.apache.impala.catalog.local.CatalogdMetaProvider$ColStatsCacheKey, 
> valueClass=class org.apache.hadoop.hive.metastore.api.ColumnStatisticsObj
> weigh=512, keyClass=class 
> org.apache.impala.catalog.local.CatalogdMetaProvider$ColStatsCacheKey, 
> valueClass=class org.apache.hadoop.hive.metastore.api.ColumnStatisticsObj
> weigh=472, keyClass=class 
> org.apache.impala.catalog.local.CatalogdMetaProvider$PartitionListCacheKey, 
> valueClass=class java.util.ArrayList
> weigh=10328, keyClass=class 
> org.apache.impala.catalog.local.CatalogdMetaProvider$PartitionCacheKey, 
> valueClass=class 
> org.apache.impala.catalog.local.CatalogdMetaProvider$PartitionMetadataImpl{code}
> Note that this table just have 20 rows. The total memory footprint size is 
> 30MB.
> For a normal partitioned partquet table, the memory footprint is not that 
> large. For instance, here are the cache items for 
> {{{}functional_parquet.alltypes{}}}:
> {code:java}
> weigh=4216, keyClass=class 
> org.apache.impala.catalog.local.CatalogdMetaProvider$TableCacheKey, 
> valueClass=class 
> org.apache.impala.catalog.local.CatalogdMetaProvider$TableMetaRefImpl
> weigh=480, keyClass=class 
> org.apache.impala.catalog.local.CatalogdMetaProvider$ColStatsCacheKey, 
> valueClass=class org.apache.hadoop.hive.metastore.api.ColumnStatisticsObj
> weigh=472, keyClass=class 
> org.apache.impala.catalog.local.CatalogdMetaProvider$ColStatsCacheKey, 
> valueClass=class org.apache.hadoop.hive.metastore.api.ColumnStatisticsObj
> weigh=488, keyClass=class 
> org.apache.impala.catalog.local.CatalogdMetaProvider$ColStatsCacheKey, 
> valueClass=class org.apache.hadoop.hive.metastore.api.ColumnStatisticsObj
> weigh=488, 

[jira] [Commented] (IMPALA-13244) Timestamp partition error in catalogd when insert data into iceberg table

2024-07-19 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-13244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17867242#comment-17867242
 ] 

Gabor Kaszab commented on IMPALA-13244:
---

Ok, so I guess the issue is with the order of the partition cols. So you 
defined the table cols this order: 'xxx', 'code', 'updatetime'. While the 
partition cols are defined in the opposite order.  So when you insert with 
providing also the table structure then there the columns aren't in the order 
how the actual order of the cols in the table are (in the order of how you 
defined the partition cols).

I'm not sure this is a real issue here.

> Timestamp partition error in catalogd when insert data into iceberg table 
> --
>
> Key: IMPALA-13244
> URL: https://issues.apache.org/jira/browse/IMPALA-13244
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Affects Versions: Impala 4.4.0
> Environment: centos7.9
>Reporter: Pain Sun
>Priority: Major
>
> create table sql like this:
>       
> CREATE TABLE test111.table1 (
>     xxx STRING,
>     code STRING,
>     updatetime TIMESTAMP
> ) PARTITIONED BY spec(
>     month(updatetime),
>     bucket(10, code),
>     bucket(10, xxx)
> ) STORED AS ICEBERG TBLPROPERTIES(
>     'iceberg.catalog' = 'hadoop.catalog',
>     'iceberg.catalog_location' = '/impalatable',
>     'iceberg.table_identifier' = 'middle.table1',
>     'write.metadata.previous-versions-max' = '3',
>     'write.metadata.delete-after-commit.enabled' = 'true',
>     'commit.manifest.min-count-to-merge' = '3',
>     'commit.manifest-merge.enabled' = 'true',
>     'format-version' = '1'
> );
>  
>  
>  
> then insert data into this table like this:
> insert into
>     test111.table1 (
>         xxx,
>         code,
>         updatetime
>     )
> select
>     'm1' as xxx,
>     'c1' as code,
>     '2024-07-17 13:44:01' as updatetime;
> Catalogd error like this :
> E0719 09:50:57.458815 126128 JniUtil.java:183] 
> 964d388b63170b6b:7c6e06c2] Error in Update catalog for 
> test111.table1. Time spent: 6ms
> I0719 09:50:57.459015 126128 jni-util.cc:302] 
> 964d388b63170b6b:7c6e06c2] java.lang.IllegalStateException
>         at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:496)
>         at 
> org.apache.impala.util.IcebergUtil.parseMonthToTransformMonth(IcebergUtil.java:882)
>         at 
> org.apache.impala.util.IcebergUtil.getPartitionValue(IcebergUtil.java:826)
>         at 
> org.apache.impala.util.IcebergUtil.partitionDataFromDataFile(IcebergUtil.java:800)
>         at 
> org.apache.impala.service.IcebergCatalogOpExecutor.createDataFile(IcebergCatalogOpExecutor.java:445)
>         at 
> org.apache.impala.service.IcebergCatalogOpExecutor.appendFiles(IcebergCatalogOpExecutor.java:487)
>         at 
> org.apache.impala.service.IcebergCatalogOpExecutor.execute(IcebergCatalogOpExecutor.java:366)
>         at 
> org.apache.impala.service.CatalogOpExecutor.updateCatalogImpl(CatalogOpExecutor.java:7443)
>         at 
> org.apache.impala.service.CatalogOpExecutor.updateCatalog(CatalogOpExecutor.java:7180)
>         at 
> org.apache.impala.service.JniCatalog.lambda$updateCatalog$15(JniCatalog.java:504)
>         at 
> org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90)
>         at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58)
>         at 
> org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89)
>         at 
> org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:100)
>         at 
> org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:245)
>         at 
> org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:259)
>         at 
> org.apache.impala.service.JniCatalog.updateCatalog(JniCatalog.java:503)
> I0719 09:50:57.459033 126128 status.cc:129] 
> 964d388b63170b6b:7c6e06c2] IllegalStateException: null
>     @          0x10546b4
>     @          0x1b94d34
>     @          0x10040ab
>     @           0xfa1c27
>     @           0xf61f84
>     @           0xf4acc3
>     @           0xf5278b
>     @          0x14486aa
>     @          0x143b0fa
>     @          0x1c78d39
>     @          0x1c79fd1
>     @          0x256da47
>     @     0x7fabd2eb8ea5
>     @     0x7fabcfe939fd
> E0719 09:50:57.459059 126128 catalog-server.cc:324] 
> 964d388b63170b6b:7c6e06c2] IllegalStateException: null
>  
> but spark insert success.
>  
> versions :
> impala:  4.4.0
> jar in impala:  iceberg-api-1.3.1.7.2.18.0-369.jar
> spark:  3.3.4
> iceberg:  apache 1.3.1
> iceberg-spark jar:  iceberg-spark-runtime-3.3_2.12-1.3.1.jar
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IMPALA-13244) Timestamp partition error in catalogd when insert data into iceberg table

2024-07-19 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-13244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17867240#comment-17867240
 ] 

Gabor Kaszab commented on IMPALA-13244:
---

Note, if you rewrite the query a bit then it works as expected:
{code:java}
insert into table1  
select
'm1' as xxx,  
'c1' as code, 
'2024-07-17 13:44:01' as updatetime;
{code}

This succeeds and the file created is the following:

{code:java}
select file_path from default.table1.`files`;
hdfs://localhost:20500/test-warehouse/table1/data/updatetime_month=2024-07/code_bucket=9/xxx_bucket=4/ed41924564367199-298e3f73_462647000_data.0.parq
{code}


> Timestamp partition error in catalogd when insert data into iceberg table 
> --
>
> Key: IMPALA-13244
> URL: https://issues.apache.org/jira/browse/IMPALA-13244
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Affects Versions: Impala 4.4.0
> Environment: centos7.9
>Reporter: Pain Sun
>Priority: Major
>
> create table sql like this:
>       
> CREATE TABLE test111.table1 (
>     xxx STRING,
>     code STRING,
>     updatetime TIMESTAMP
> ) PARTITIONED BY spec(
>     month(updatetime),
>     bucket(10, code),
>     bucket(10, xxx)
> ) STORED AS ICEBERG TBLPROPERTIES(
>     'iceberg.catalog' = 'hadoop.catalog',
>     'iceberg.catalog_location' = '/impalatable',
>     'iceberg.table_identifier' = 'middle.table1',
>     'write.metadata.previous-versions-max' = '3',
>     'write.metadata.delete-after-commit.enabled' = 'true',
>     'commit.manifest.min-count-to-merge' = '3',
>     'commit.manifest-merge.enabled' = 'true',
>     'format-version' = '1'
> );
>  
>  
>  
> then insert data into this table like this:
> insert into
>     test111.table1 (
>         xxx,
>         code,
>         updatetime
>     )
> select
>     'm1' as xxx,
>     'c1' as code,
>     '2024-07-17 13:44:01' as updatetime;
> Catalogd error like this :
> E0719 09:50:57.458815 126128 JniUtil.java:183] 
> 964d388b63170b6b:7c6e06c2] Error in Update catalog for 
> test111.table1. Time spent: 6ms
> I0719 09:50:57.459015 126128 jni-util.cc:302] 
> 964d388b63170b6b:7c6e06c2] java.lang.IllegalStateException
>         at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:496)
>         at 
> org.apache.impala.util.IcebergUtil.parseMonthToTransformMonth(IcebergUtil.java:882)
>         at 
> org.apache.impala.util.IcebergUtil.getPartitionValue(IcebergUtil.java:826)
>         at 
> org.apache.impala.util.IcebergUtil.partitionDataFromDataFile(IcebergUtil.java:800)
>         at 
> org.apache.impala.service.IcebergCatalogOpExecutor.createDataFile(IcebergCatalogOpExecutor.java:445)
>         at 
> org.apache.impala.service.IcebergCatalogOpExecutor.appendFiles(IcebergCatalogOpExecutor.java:487)
>         at 
> org.apache.impala.service.IcebergCatalogOpExecutor.execute(IcebergCatalogOpExecutor.java:366)
>         at 
> org.apache.impala.service.CatalogOpExecutor.updateCatalogImpl(CatalogOpExecutor.java:7443)
>         at 
> org.apache.impala.service.CatalogOpExecutor.updateCatalog(CatalogOpExecutor.java:7180)
>         at 
> org.apache.impala.service.JniCatalog.lambda$updateCatalog$15(JniCatalog.java:504)
>         at 
> org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90)
>         at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58)
>         at 
> org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89)
>         at 
> org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:100)
>         at 
> org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:245)
>         at 
> org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:259)
>         at 
> org.apache.impala.service.JniCatalog.updateCatalog(JniCatalog.java:503)
> I0719 09:50:57.459033 126128 status.cc:129] 
> 964d388b63170b6b:7c6e06c2] IllegalStateException: null
>     @          0x10546b4
>     @          0x1b94d34
>     @          0x10040ab
>     @           0xfa1c27
>     @           0xf61f84
>     @           0xf4acc3
>     @           0xf5278b
>     @          0x14486aa
>     @          0x143b0fa
>     @          0x1c78d39
>     @          0x1c79fd1
>     @          0x256da47
>     @     0x7fabd2eb8ea5
>     @     0x7fabcfe939fd
> E0719 09:50:57.459059 126128 catalog-server.cc:324] 
> 964d388b63170b6b:7c6e06c2] IllegalStateException: null
>  
> but spark insert success.
>  
> versions :
> impala:  4.4.0
> jar in impala:  iceberg-api-1.3.1.7.2.18.0-369.jar
> spark:  3.3.4
> iceberg:  apache 1.3.1
> iceberg-spark jar:  iceberg-spark-runtime-3.3_2.12-1.3.1.jar
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IMPALA-13244) Timestamp partition error in catalogd when insert data into iceberg table

2024-07-19 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-13244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17867239#comment-17867239
 ] 

Gabor Kaszab commented on IMPALA-13244:
---

Thanks for raising this, [~MadBeeDo]! I tried the repro steps and I also see 
this issue.

I think that seems completely off is that updateCatalog received 
'updated_partitions' in the TUpdateCatalogRequest that is completely off:
{code:java}
updatetime_month=4/code_bucket=9/xxx_bucket=2024-07 -> {TUpdatedPartition@7973} 
"TUpdatedPartition(files:[hdfs://localhost:20500/test-warehouse/table1/data/updatetime_month=4/code_bucket=9/xxx_bucket=2024-07/3044d83c3c9b17d3-ca410be0_227283434_data.0.parq])"
{code}
Apparently, none of the partition cols have the desire value.


> Timestamp partition error in catalogd when insert data into iceberg table 
> --
>
> Key: IMPALA-13244
> URL: https://issues.apache.org/jira/browse/IMPALA-13244
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Affects Versions: Impala 4.4.0
> Environment: centos7.9
>Reporter: Pain Sun
>Priority: Major
>
> create table sql like this:
>       
> CREATE TABLE test111.table1 (
>     xxx STRING,
>     code STRING,
>     updatetime TIMESTAMP
> ) PARTITIONED BY spec(
>     month(updatetime),
>     bucket(10, code),
>     bucket(10, xxx)
> ) STORED AS ICEBERG TBLPROPERTIES(
>     'iceberg.catalog' = 'hadoop.catalog',
>     'iceberg.catalog_location' = '/impalatable',
>     'iceberg.table_identifier' = 'middle.table1',
>     'write.metadata.previous-versions-max' = '3',
>     'write.metadata.delete-after-commit.enabled' = 'true',
>     'commit.manifest.min-count-to-merge' = '3',
>     'commit.manifest-merge.enabled' = 'true',
>     'format-version' = '1'
> );
>  
>  
>  
> then insert data into this table like this:
> insert into
>     test111.table1 (
>         xxx,
>         code,
>         updatetime
>     )
> select
>     'm1' as xxx,
>     'c1' as code,
>     '2024-07-17 13:44:01' as updatetime;
> Catalogd error like this :
> E0719 09:50:57.458815 126128 JniUtil.java:183] 
> 964d388b63170b6b:7c6e06c2] Error in Update catalog for 
> test111.table1. Time spent: 6ms
> I0719 09:50:57.459015 126128 jni-util.cc:302] 
> 964d388b63170b6b:7c6e06c2] java.lang.IllegalStateException
>         at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:496)
>         at 
> org.apache.impala.util.IcebergUtil.parseMonthToTransformMonth(IcebergUtil.java:882)
>         at 
> org.apache.impala.util.IcebergUtil.getPartitionValue(IcebergUtil.java:826)
>         at 
> org.apache.impala.util.IcebergUtil.partitionDataFromDataFile(IcebergUtil.java:800)
>         at 
> org.apache.impala.service.IcebergCatalogOpExecutor.createDataFile(IcebergCatalogOpExecutor.java:445)
>         at 
> org.apache.impala.service.IcebergCatalogOpExecutor.appendFiles(IcebergCatalogOpExecutor.java:487)
>         at 
> org.apache.impala.service.IcebergCatalogOpExecutor.execute(IcebergCatalogOpExecutor.java:366)
>         at 
> org.apache.impala.service.CatalogOpExecutor.updateCatalogImpl(CatalogOpExecutor.java:7443)
>         at 
> org.apache.impala.service.CatalogOpExecutor.updateCatalog(CatalogOpExecutor.java:7180)
>         at 
> org.apache.impala.service.JniCatalog.lambda$updateCatalog$15(JniCatalog.java:504)
>         at 
> org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90)
>         at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58)
>         at 
> org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89)
>         at 
> org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:100)
>         at 
> org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:245)
>         at 
> org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:259)
>         at 
> org.apache.impala.service.JniCatalog.updateCatalog(JniCatalog.java:503)
> I0719 09:50:57.459033 126128 status.cc:129] 
> 964d388b63170b6b:7c6e06c2] IllegalStateException: null
>     @          0x10546b4
>     @          0x1b94d34
>     @          0x10040ab
>     @           0xfa1c27
>     @           0xf61f84
>     @           0xf4acc3
>     @           0xf5278b
>     @          0x14486aa
>     @          0x143b0fa
>     @          0x1c78d39
>     @          0x1c79fd1
>     @          0x256da47
>     @     0x7fabd2eb8ea5
>     @     0x7fabcfe939fd
> E0719 09:50:57.459059 126128 catalog-server.cc:324] 
> 964d388b63170b6b:7c6e06c2] IllegalStateException: null
>  
> but spark insert success.
>  
> versions :
> impala:  4.4.0
> jar in impala:  iceberg-api-1.3.1.7.2.18.0-369.jar
> spark:  3.3.4
> iceberg:  apache 1.3.1
> iceberg-spark jar:  

[jira] [Commented] (IMPALA-13242) DROP PARTITION can't drop partitions before a partition evolution if the partition transform was changed

2024-07-18 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-13242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17867002#comment-17867002
 ] 

Gabor Kaszab commented on IMPALA-13242:
---

Actually, we don't need a partition transform change to repro this:
{code:java}
create table part_evol_tbl (i int, j int) partitioned by spec (i) stored as 
iceberg;
insert into part_evol_tbl values (1, 11), (2, 22);
alter table part_evol_tbl set partition spec (j);
insert into part_evol_tbl values (4, 44);

alter table part_evol_tbl drop partition (i=1);
Query: alter table part_evol_tbl drop partition (i=1)
ERROR: AnalysisException: Partition exprs cannot contain non-partition 
column(s): i
{code}

If there is a col that used to be a partition column but not anymore then we 
won't be able to drop the partitions involved with that col.

> DROP PARTITION can't drop partitions before a partition evolution if the 
> partition transform was changed
> 
>
> Key: IMPALA-13242
> URL: https://issues.apache.org/jira/browse/IMPALA-13242
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 4.4.0
>Reporter: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
>
> Steps to set up the repro table:
> {code:java}
> create table year_part_tbl (i int, d date) partitioned by spec (year(d)) 
> stored as iceberg;
> insert into year_part_tbl values (1, "2024-07-17"), (2, "2024-07-16");
> alter table year_part_tbl set partition spec (month(d));
> insert into year_part_tbl values (3, "2024-07-18");
> {code}
> After the partition evolution we can't drop the partitions with year()
> {code:java}
> alter table year_part_tbl drop partition (year(d)=2024);
> Query: alter table year_part_tbl drop partition (year(d)=2024)
> ERROR: AnalysisException: Can't filter column 'd' with transform type: 'YEAR'
> {code}
> I guess the issue here is that we compare the filter expression against the 
> latest partition spec and there the transform on the column is month() 
> instead of year().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13242) DROP PARTITION can't drop partitions before a partition evolution if the partition transform was changed

2024-07-18 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-13242:
-

 Summary: DROP PARTITION can't drop partitions before a partition 
evolution if the partition transform was changed
 Key: IMPALA-13242
 URL: https://issues.apache.org/jira/browse/IMPALA-13242
 Project: IMPALA
  Issue Type: Bug
  Components: Frontend
Affects Versions: Impala 4.4.0
Reporter: Gabor Kaszab


Steps to set up the repro table:
{code:java}
create table year_part_tbl (i int, d date) partitioned by spec (year(d)) stored 
as iceberg;

insert into year_part_tbl values (1, "2024-07-17"), (2, "2024-07-16");

alter table year_part_tbl set partition spec (month(d));

insert into year_part_tbl values (3, "2024-07-18");
{code}

After the partition evolution we can't drop the partitions with year()
{code:java}
alter table year_part_tbl drop partition (year(d)=2024);
Query: alter table year_part_tbl drop partition (year(d)=2024)
ERROR: AnalysisException: Can't filter column 'd' with transform type: 'YEAR'
{code}

I guess the issue here is that we compare the filter expression against the 
latest partition spec and there the transform on the column is month() instead 
of year().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Closed] (IMPALA-12388) Strip file/pos information from tuples once they are not needed

2024-07-09 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab closed IMPALA-12388.
-
Fix Version/s: Not Applicable
   Resolution: Won't Fix

I explored some possible implementations for this, the simplest one was where I 
unconditionally set the relevant null indicators to true for the position 
delete related slots. This add the less overhead on top of the existing logic 
in terms of performance.

I then started perf verifications on both TPCDS and TPCH, but apparently for 
some queries this bring actual perf degradation. In worst case (a select-only 
query) this results in a 5% increase of runtime. There were some queries where 
I observed improvements around 2-3% but the overall results weren't convincing 
for me to progress.

Closing this as won't fix as initial results aren't good enough to proceed.

> Strip file/pos information from tuples once they are not needed
> ---
>
> Key: IMPALA-12388
> URL: https://issues.apache.org/jira/browse/IMPALA-12388
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend, Frontend
>Reporter: Zoltán Borók-Nagy
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: Performance, impala-iceberg, performance
> Fix For: Not Applicable
>
>
> When Impala processes Iceberg V2 tables that have position delete files it 
> needs to add extra slots to the input tuples (requried by the ANTI JOIN 
> between data files and delete files):
>  * STRING file path
>  * BIGINT position
> This makes the row-size larger by 20 bytes. Please note that this 20 bytes is 
> only the increase in the tuple memory (12 byte STRING slot plus 8 byte BIGINT 
> slot), the file path actually points to a potentially large string (100-200 
> bytes) stored in a heap buffer.
> In the plan fragments of the SCANs we only create a string object per file 
> for the file path (and set it in the template tuple), so the situation is not 
> that bad, but once we send the rows over the network the STRINGs are getting 
> duplicated per record, which can add substantial network and serialization 
> overhead.
> One way to resolve this is to re-materialize the tuples after the Iceberg V2 
> scan is done, and only store the interesting slots. This mechanism also saves 
> us the 20 bytes per tuple overhead, but the re-materialization cost can be 
> high.
> Another, easier solution is to just NULL-out the file path and position slots 
> once they are not needed anymore.
> Of course if the user SELECTs the virtual column {{INPUT_FILE_NAME / 
> FILE_POSITION}} we cannot re-materialize / NULL out.
> Given the following plan:
> {noformat}
> UNION ALL
> /\
>/  \
> SCAN  V2 ANTI JOIN
> data files   /  \
> without /\
> deletes SCAN SCAN
> data files   delete files
> with deletes
> {noformat}
> In the "SCAN  data files without deletes" we shouldn't even fill the file 
> path / position slots. The latter also saves some computational cost.
> In our V2 ANTI JOIN operator (IcebergDeleteNode) we can NULL out the file 
> path / pos slots once the data records are processed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-12388) Strip file/pos information from tuples once they are not needed

2024-07-09 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab reassigned IMPALA-12388:
-

Assignee: Gabor Kaszab

> Strip file/pos information from tuples once they are not needed
> ---
>
> Key: IMPALA-12388
> URL: https://issues.apache.org/jira/browse/IMPALA-12388
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend, Frontend
>Reporter: Zoltán Borók-Nagy
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: Performance, impala-iceberg, performance
>
> When Impala processes Iceberg V2 tables that have position delete files it 
> needs to add extra slots to the input tuples (requried by the ANTI JOIN 
> between data files and delete files):
>  * STRING file path
>  * BIGINT position
> This makes the row-size larger by 20 bytes. Please note that this 20 bytes is 
> only the increase in the tuple memory (12 byte STRING slot plus 8 byte BIGINT 
> slot), the file path actually points to a potentially large string (100-200 
> bytes) stored in a heap buffer.
> In the plan fragments of the SCANs we only create a string object per file 
> for the file path (and set it in the template tuple), so the situation is not 
> that bad, but once we send the rows over the network the STRINGs are getting 
> duplicated per record, which can add substantial network and serialization 
> overhead.
> One way to resolve this is to re-materialize the tuples after the Iceberg V2 
> scan is done, and only store the interesting slots. This mechanism also saves 
> us the 20 bytes per tuple overhead, but the re-materialization cost can be 
> high.
> Another, easier solution is to just NULL-out the file path and position slots 
> once they are not needed anymore.
> Of course if the user SELECTs the virtual column {{INPUT_FILE_NAME / 
> FILE_POSITION}} we cannot re-materialize / NULL out.
> Given the following plan:
> {noformat}
> UNION ALL
> /\
>/  \
> SCAN  V2 ANTI JOIN
> data files   /  \
> without /\
> deletes SCAN SCAN
> data files   delete files
> with deletes
> {noformat}
> In the "SCAN  data files without deletes" we shouldn't even fill the file 
> path / position slots. The latter also saves some computational cost.
> In our V2 ANTI JOIN operator (IcebergDeleteNode) we can NULL out the file 
> path / pos slots once the data records are processed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-11752) Handle s3:// paths in Iceberg tables

2024-07-09 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-11752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab reassigned IMPALA-11752:
-

Assignee: Gabor Kaszab

> Handle s3:// paths in Iceberg tables
> 
>
> Key: IMPALA-11752
> URL: https://issues.apache.org/jira/browse/IMPALA-11752
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend, Frontend
>Reporter: Zoltán Borók-Nagy
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
>
> Components using 
> [S3FileIO|https://iceberg.apache.org/docs/latest/aws/#s3-fileio] might write 
> out file paths starting with 's3://' instead of 's3a://'. The latter is used 
> by 
> [HadoopFileIO|https://iceberg.apache.org/docs/latest/aws/#hadoop-s3a-filesystem]
>  that Impala is using.
> By default, HadoopFileIO doesn't interpret paths starting with 's3://'. 
> (Probably this could be resolved by setting "fs.s3.impl" to 
> "org.apache.hadoop.fs.s3a.S3AFileSystem" so that an s3a fs instance is 
> created)
> [FeIcebergTable.Utils.FeIcebergTable()|https://github.com/apache/impala/blob/2733d039ad4a830a1ea34c1a75d2b666788e39a9/fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java#L671-L689]
>  depends on file paths returned by recursive file listing match the file 
> paths in Iceberg metadata files. But the recursive listing returns s3a:// 
> paths, while metadata contains s3:// paths, which means we'll load files 
> one-by-one as we won't find the files in the hash map 'hdfsFileDescMap'.
> Moreover, if position delete file processing is also based on exact matches 
> of the file URIs. Therefore if entries with s3:// paths won't have the 
> desired effects.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12190) Renaming table will cause losing privileges for non-admin users

2024-05-22 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848515#comment-17848515
 ] 

Gabor Kaszab commented on IMPALA-12190:
---

I don't think this can be trivially implemented from Impala side. I recall we 
also opened a Ranger ticket after the analysis of this issue and agreed that 
first Ranger should be able to give some API that the clients can use when some 
resources were renamed.

> Renaming table will cause losing privileges for non-admin users
> ---
>
> Key: IMPALA-12190
> URL: https://issues.apache.org/jira/browse/IMPALA-12190
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Gabor Kaszab
>Assignee: Sai Hemanth Gantasala
>Priority: Critical
>  Labels: alter-table, authorization, ranger
>
> Let's say user 'a' gets some privileges on table 't'. When this table gets 
> renamed (even by user 'a') then user 'a' loses its privileges on that table.
>  
> Repro steps:
>  # Start impala with Ranger
>  # start impala-shell as admin (-u admin)
>  # create table tmp (i int, s string) stored as parquet;
>  # grant all on table tmp to user ;
>  # grant all on table tmp to user ;
> {code:java}
> Query: show grant user  on table tmp
> +++--+---++-+--+-+-+---+--+-+
> | principal_type | principal_name | database | table | column | uri | 
> storage_type | storage_uri | udf | privilege | grant_option | create_time |
> +++--+---++-+--+-+-+---+--+-+
> | USER           |     | default  | tmp   | *      |     |          
>     |             |     | all       | false        | NULL        |
> +++--+---++-+--+-+-+---+--+-+
> Fetched 1 row(s) in 0.01s {code}
>  #  alter table tmp rename to tmp_1234;
>  # show grant user  on table tmp_1234;
> {code:java}
> Query: show grant user  on table tmp_1234
> Fetched 0 row(s) in 0.17s{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Reopened] (IMPALA-13067) Some regex make the tests unconditionally pass

2024-05-21 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab reopened IMPALA-13067:
---

Accidentally closed this one

> Some regex make the tests unconditionally pass
> --
>
> Key: IMPALA-13067
> URL: https://issues.apache.org/jira/browse/IMPALA-13067
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Reporter: Gabor Kaszab
>Priority: Major
>  Labels: test-framework
> Fix For: Impala 4.5.0
>
>
> This issue came out in the Iceberg metadata table tests where this regex was 
> used:
> [1-9]\d*|0
>  
> The "|0" part for some reason made the test framework confused and then 
> regardless of what you provide as an expected result the tests passed. One 
> workaround was to put the regex expression between parentheses. Or simply use 
> "d+". https://issues.apache.org/jira/browse/IMPALA-13055 applied this second 
> workaround on the tests.
> Some analysis would be great why this is the behavior of the test framework, 
> and if it's indeed the issue of the framnework, we should fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Closed] (IMPALA-13055) Some Iceberg metadata table tests doesn't assert

2024-05-21 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab closed IMPALA-13055.
-
Fix Version/s: Impala 4.5.0
   Resolution: Fixed

> Some Iceberg metadata table tests doesn't assert
> 
>
> Key: IMPALA-13055
> URL: https://issues.apache.org/jira/browse/IMPALA-13055
> Project: IMPALA
>  Issue Type: Test
>Reporter: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
> Fix For: Impala 4.5.0
>
>
> Some test in the Iceberg metadata table suite use the following regex to 
> verify numbers in the output: [1-9]\d*|0
> However, if this format is given, the test unconditionally passes. On could 
> put the formula within parentheses, or simply verify for \d+



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Closed] (IMPALA-13067) Some regex make the tests unconditionally pass

2024-05-21 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab closed IMPALA-13067.
-
Fix Version/s: Impala 4.5.0
   Resolution: Fixed

> Some regex make the tests unconditionally pass
> --
>
> Key: IMPALA-13067
> URL: https://issues.apache.org/jira/browse/IMPALA-13067
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Reporter: Gabor Kaszab
>Priority: Major
>  Labels: test-framework
> Fix For: Impala 4.5.0
>
>
> This issue came out in the Iceberg metadata table tests where this regex was 
> used:
> [1-9]\d*|0
>  
> The "|0" part for some reason made the test framework confused and then 
> regardless of what you provide as an expected result the tests passed. One 
> workaround was to put the regex expression between parentheses. Or simply use 
> "d+". https://issues.apache.org/jira/browse/IMPALA-13055 applied this second 
> workaround on the tests.
> Some analysis would be great why this is the behavior of the test framework, 
> and if it's indeed the issue of the framnework, we should fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12266) Sporadic failure after migrating a table to Iceberg

2024-05-17 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847169#comment-17847169
 ] 

Gabor Kaszab commented on IMPALA-12266:
---

[~laszlog] I see you increased the priority of this. Note, there is another 
Jira for the root cause: https://issues.apache.org/jira/browse/IMPALA-12712 If 
that's fixed this would be gone too.

> Sporadic failure after migrating a table to Iceberg
> ---
>
> Key: IMPALA-12266
> URL: https://issues.apache.org/jira/browse/IMPALA-12266
> Project: IMPALA
>  Issue Type: Bug
>  Components: fe
>Affects Versions: Impala 4.2.0
>Reporter: Tamas Mate
>Assignee: Gabor Kaszab
>Priority: Critical
>  Labels: impala-iceberg
> Attachments: 
> catalogd.bd40020df22b.invalid-user.log.INFO.20230704-181939.1, 
> impalad.6c0f48d9ce66.invalid-user.log.INFO.20230704-181940.1
>
>
> TestIcebergTable.test_convert_table test failed in a recent verify job's 
> dockerised tests:
> https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/7629
> {code:none}
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> EINNER EXCEPTION: 
> EMESSAGE: AnalysisException: Failed to load metadata for table: 
> 'parquet_nopartitioned'
> E   CAUSED BY: TableLoadingException: Could not load table 
> test_convert_table_cdba7383.parquet_nopartitioned from catalog
> E   CAUSED BY: TException: 
> TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, 
> error_msgs:[NullPointerException: null]), lookup_status:OK)
> {code}
> {code:none}
> E0704 19:09:22.980131   833 JniUtil.java:183] 
> 7145c21173f2c47b:2579db55] Error in Getting partial catalog object of 
> TABLE:test_convert_table_cdba7383.parquet_nopartitioned. Time spent: 49ms
> I0704 19:09:22.980309   833 jni-util.cc:288] 
> 7145c21173f2c47b:2579db55] java.lang.NullPointerException
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.replaceTableIfUnchanged(CatalogServiceCatalog.java:2357)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getOrLoadTable(CatalogServiceCatalog.java:2300)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.doGetPartialCatalogObject(CatalogServiceCatalog.java:3587)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3513)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3480)
>   at 
> org.apache.impala.service.JniCatalog.lambda$getPartialCatalogObject$11(JniCatalog.java:397)
>   at 
> org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90)
>   at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58)
>   at 
> org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89)
>   at 
> org.apache.impala.service.JniCatalogOp.execAndSerializeSilentStartAndFinish(JniCatalogOp.java:109)
>   at 
> org.apache.impala.service.JniCatalog.execAndSerializeSilentStartAndFinish(JniCatalog.java:238)
>   at 
> org.apache.impala.service.JniCatalog.getPartialCatalogObject(JniCatalog.java:396)
> I0704 19:09:22.980324   833 status.cc:129] 7145c21173f2c47b:2579db55] 
> NullPointerException: null
> @  0x1012f9f  impala::Status::Status()
> @  0x187f964  impala::JniUtil::GetJniExceptionMsg()
> @   0xfee920  impala::JniCall::Call<>()
> @   0xfccd0f  impala::Catalog::GetPartialCatalogObject()
> @   0xfb55a5  
> impala::CatalogServiceThriftIf::GetPartialCatalogObject()
> @   0xf7a691  
> impala::CatalogServiceProcessorT<>::process_GetPartialCatalogObject()
> @   0xf82151  impala::CatalogServiceProcessorT<>::dispatchCall()
> @   0xee330f  apache::thrift::TDispatchProcessor::process()
> @  0x1329246  
> apache::thrift::server::TAcceptQueueServer::Task::run()
> @  0x1315a89  impala::ThriftThread::RunRunnable()
> @  0x131773d  
> boost::detail::function::void_function_obj_invoker0<>::invoke()
> @  0x195ba8c  impala::Thread::SuperviseThread()
> @  0x195c895  boost::detail::thread_data<>::run()
> @  0x23a03a7  thread_proxy
> @ 0x7faaad2a66ba  start_thread
> @ 0x7f2c151d  clone
> E0704 19:09:23.006968   833 catalog-server.cc:278] 
> 7145c21173f2c47b:2579db55] NullPointerException: null
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13067) Some regex make the tests unconditionally pass

2024-05-09 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-13067:
-

 Summary: Some regex make the tests unconditionally pass
 Key: IMPALA-13067
 URL: https://issues.apache.org/jira/browse/IMPALA-13067
 Project: IMPALA
  Issue Type: Bug
  Components: Infrastructure
Reporter: Gabor Kaszab


This issue came out in the Iceberg metadata table tests where this regex was 
used:

[1-9]\d*|0

 

The "|0" part for some reason made the test framework confused and then 
regardless of what you provide as an expected result the tests passed. One 
workaround was to put the regex expression between parentheses. Or simply use 
"d+". https://issues.apache.org/jira/browse/IMPALA-13055 applied this second 
workaround on the tests.

Some analysis would be great why this is the behavior of the test framework, 
and if it's indeed the issue of the framnework, we should fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13055) Some Iceberg metadata table tests doesn't assert

2024-05-03 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-13055:
-

 Summary: Some Iceberg metadata table tests doesn't assert
 Key: IMPALA-13055
 URL: https://issues.apache.org/jira/browse/IMPALA-13055
 Project: IMPALA
  Issue Type: Test
Reporter: Gabor Kaszab


Some test in the Iceberg metadata table suite use the following regex to verify 
numbers in the output: [1-9]\d*|0

However, if this format is given, the test unconditionally passes. On could put 
the formula within parentheses, or simply verify for \d+



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-13055) Some Iceberg metadata table tests doesn't assert

2024-05-03 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab updated IMPALA-13055:
--
Labels: impala-iceberg  (was: )

> Some Iceberg metadata table tests doesn't assert
> 
>
> Key: IMPALA-13055
> URL: https://issues.apache.org/jira/browse/IMPALA-13055
> Project: IMPALA
>  Issue Type: Test
>Reporter: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
>
> Some test in the Iceberg metadata table suite use the following regex to 
> verify numbers in the output: [1-9]\d*|0
> However, if this format is given, the test unconditionally passes. On could 
> put the formula within parentheses, or simply verify for \d+



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Work started] (IMPALA-13029) Add test for equality deletes with different file format

2024-05-03 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on IMPALA-13029 started by Gabor Kaszab.
-
> Add test for equality deletes with different file format
> 
>
> Key: IMPALA-13029
> URL: https://issues.apache.org/jira/browse/IMPALA-13029
> Project: IMPALA
>  Issue Type: Test
>  Components: Frontend
>Reporter: Gabor Kaszab
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
>
> We should test equality deletes in Parquet, ORC and AVRO similary to what 
> tests we have for position delete file formats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-13029) Add test for equality deletes with different file format

2024-04-23 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-13029:
-

 Summary: Add test for equality deletes with different file format
 Key: IMPALA-13029
 URL: https://issues.apache.org/jira/browse/IMPALA-13029
 Project: IMPALA
  Issue Type: Test
  Components: Frontend
Reporter: Gabor Kaszab


We should test equality deletes in Parquet, ORC and AVRO similary to what tests 
we have for position delete file formats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-12970) Test failure at test_read_equality_deletes in test_iceberg in exhaustive build

2024-04-11 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab resolved IMPALA-12970.
---
Fix Version/s: Impala 4.4.0
   Resolution: Fixed

> Test failure at test_read_equality_deletes in test_iceberg in exhaustive build
> --
>
> Key: IMPALA-12970
> URL: https://issues.apache.org/jira/browse/IMPALA-12970
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Reporter: Yida Wu
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: broken-build
> Fix For: Impala 4.4.0
>
>
> An error is observed in the data-cache exhaustive build in 
> test_read_equality_deletes with following message:
> {code:java}
> query_test.test_iceberg.TestIcebergV2Table.test_read_equality_deletes[protocol:
>  beeswax | table_format: parquet/none | exec_option: {'test_replan': 1, 
> 'disable_optimized_iceberg_v2_read': 1, 'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0}] (from pytest)
> {code}
> *Error Message*
> {code:java}
> query_test/test_iceberg.py:1456: in test_read_equality_deletes 
> self.run_test_case('QueryTest/iceberg-v2-read-equality-deletes', vector) 
> common/impala_test_suite.py:725: in run_test_case result = exec_fn(query, 
> user=test_section.get('USER', '').strip() or None) 
> common/impala_test_suite.py:660: in __exec_in_impala result = 
> self.__execute_query(target_impalad_client, query, user=user) 
> common/impala_test_suite.py:1013: in __execute_query return 
> impalad_client.execute(query, user=user) common/impala_connection.py:215: in 
> execute fetch_profile_after_close=fetch_profile_after_close) 
> beeswax/impala_beeswax.py:191: in execute handle = 
> self.__execute_query(query_string.strip(), user=user) 
> beeswax/impala_beeswax.py:382: in __execute_query handle = 
> self.execute_query_async(query_string, user=user) 
> beeswax/impala_beeswax.py:376: in execute_query_async handle = 
> self.__do_rpc(lambda: self.imp_service.query(query,)) 
> beeswax/impala_beeswax.py:539: in __do_rpc raise 
> ImpalaBeeswaxException(self.__build_error_message(b), b) E   
> ImpalaBeeswaxException: ImpalaBeeswaxException: EINNER EXCEPTION:  'beeswaxd.ttypes.BeeswaxException'> EMESSAGE: 
> ConcurrentModificationException: null
> {code}
> *Stacktrace*
> {code:java}
> query_test/test_iceberg.py:1456: in test_read_equality_deletes
> self.run_test_case('QueryTest/iceberg-v2-read-equality-deletes', vector)
> common/impala_test_suite.py:725: in run_test_case
> result = exec_fn(query, user=test_section.get('USER', '').strip() or None)
> common/impala_test_suite.py:660: in __exec_in_impala
> result = self.__execute_query(target_impalad_client, query, user=user)
> common/impala_test_suite.py:1013: in __execute_query
> return impalad_client.execute(query, user=user)
> common/impala_connection.py:215: in execute
> fetch_profile_after_close=fetch_profile_after_close)
> beeswax/impala_beeswax.py:191: in execute
> handle = self.__execute_query(query_string.strip(), user=user)
> beeswax/impala_beeswax.py:382: in __execute_query
> handle = self.execute_query_async(query_string, user=user)
> beeswax/impala_beeswax.py:376: in execute_query_async
> handle = self.__do_rpc(lambda: self.imp_service.query(query,))
> beeswax/impala_beeswax.py:539: in __do_rpc
> raise ImpalaBeeswaxException(self.__build_error_message(b), b)
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> EINNER EXCEPTION: 
> EMESSAGE: ConcurrentModificationException: null
> {code}
> *Standard Error*
> {code:java}
> SET 
> client_identifier=query_test/test_iceberg.py::TestIcebergV2Table::()::test_read_equality_deletes[protocol:beeswax|table_format:parquet/none|exec_option:{'test_replan':1;'disable_optimized_iceberg_v2_read':1;'batch_size':0;'num_nodes':0;'disable_codegen_rows_threshold':0;'d;
> -- connecting to: localhost:21000
> -- 2024-04-03 07:04:53,469 INFO MainThread: Could not connect to ('::1', 
> 21000, 0, 0)
> Traceback (most recent call last):
>   File 
> "/data/jenkins/workspace/impala-asf-master-exhaustive-data-cache/repos/Impala/infra/python/env-gcc10.4.0/lib/python2.7/site-packages/thrift/transport/TSocket.py",
>  line 137, in open
> handle.connect(sockaddr)
>   File 
> "/data/jenkins/workspace/impala-asf-master-exhaustive-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py",
>  line 228, in meth
> return getattr(self._sock,name)(*args)
> error: [Errno 111] Connection refused
> -- connecting to localhost:21050 with impyla
> -- 2024-04-03 07:04:53,469 INFO MainThread: Could not connect to ('::1', 
> 

[jira] [Assigned] (IMPALA-8809) Refresh a subset of partitions for ACID tables

2024-04-09 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab reassigned IMPALA-8809:


Assignee: (was: Gabor Kaszab)

> Refresh a subset of partitions for ACID tables
> --
>
> Key: IMPALA-8809
> URL: https://issues.apache.org/jira/browse/IMPALA-8809
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Affects Versions: Impala 3.3.0
>Reporter: Gabor Kaszab
>Priority: Critical
>  Labels: impala-acid
>
> Enhancing REFRESH logic to handle ACID tables was covered by this change: 
> https://issues.apache.org/jira/browse/IMPALA-8600
> Basically each user initiated REFRESH PARTITION is rejected meanwhile the 
> REFRESH_PARTITION event in event processor are actually doing a full table 
> load for ACID tables.
> There is room for improvement: When a full table refresh is being executed on 
> an ACID table we can have 2 scenarios:
> - If there was some schema changes then reload the full table. Identify such 
> a scenario should be possible by checking the table-level writeId. However, 
> there is a bug in Hive that it doesn't update that field for partitioned 
> tables (https://issues.apache.org/jira/browse/HIVE-22062). This would be the 
> desired way but could also be workarounded by checking other fields lik 
> lastDdlChanged or such.
> - If a full table refresh is not needed then we should fetch the 
> partition-level writeIds and reload only the ones that are out-of-date 
> locally.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Work stopped] (IMPALA-8809) Refresh a subset of partitions for ACID tables

2024-04-09 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on IMPALA-8809 stopped by Gabor Kaszab.

> Refresh a subset of partitions for ACID tables
> --
>
> Key: IMPALA-8809
> URL: https://issues.apache.org/jira/browse/IMPALA-8809
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Affects Versions: Impala 3.3.0
>Reporter: Gabor Kaszab
>Assignee: Gabor Kaszab
>Priority: Critical
>  Labels: impala-acid
>
> Enhancing REFRESH logic to handle ACID tables was covered by this change: 
> https://issues.apache.org/jira/browse/IMPALA-8600
> Basically each user initiated REFRESH PARTITION is rejected meanwhile the 
> REFRESH_PARTITION event in event processor are actually doing a full table 
> load for ACID tables.
> There is room for improvement: When a full table refresh is being executed on 
> an ACID table we can have 2 scenarios:
> - If there was some schema changes then reload the full table. Identify such 
> a scenario should be possible by checking the table-level writeId. However, 
> there is a bug in Hive that it doesn't update that field for partitioned 
> tables (https://issues.apache.org/jira/browse/HIVE-22062). This would be the 
> desired way but could also be workarounded by checking other fields lik 
> lastDdlChanged or such.
> - If a full table refresh is not needed then we should fetch the 
> partition-level writeIds and reload only the ones that are out-of-date 
> locally.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-12729) Allow creating primary keys for Iceberg tables

2024-04-09 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab resolved IMPALA-12729.
---
Fix Version/s: Impala 4.4.0
   Resolution: Fixed

> Allow creating primary keys for Iceberg tables
> --
>
> Key: IMPALA-12729
> URL: https://issues.apache.org/jira/browse/IMPALA-12729
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Reporter: Gabor Kaszab
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
> Fix For: Impala 4.4.0
>
>
> Some writer engines require primary keys on a table so that they can use them 
> for writing equality deletes (only the PK cols are written to the eq-delete 
> files).
> Impala currently doesn't reject setting PKs for Iceberg tables, however it 
> seems to omit them. This suceeds:
> {code:java}
> create table ice_pk (i int, j int, primary key(i)) stored as iceberg;
> {code}
> However, DESCRIBE EXTENDED doesn't show 'identifier-field-ids' in the 
> 'current-schema'.
> On the other hand for a table created by Flink these fields are there:
> {code:java}
> current-schema                                     | 
> {\"type\":\"struct\",\"schema-id\":0,\"identifier-field-ids\":[1],\"fields\":[{\"id\":1,\"name\":\"i\",\"required\":true,\"type\":\"int\"},{\"id\":2,\"name\":\"s\",\"required\":false,\"type\":\"string\"}]}
>  {code}
> Part2:
> SHOW CREATE TABLE should also correctly print the primary key part of the 
> field list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-11387) Add virtual column ICEBERG__SEQUENCE__NUMBER

2024-04-09 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-11387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab resolved IMPALA-11387.
---
Fix Version/s: Impala 4.3.0
   Resolution: Fixed

> Add virtual column ICEBERG__SEQUENCE__NUMBER
> 
>
> Key: IMPALA-11387
> URL: https://issues.apache.org/jira/browse/IMPALA-11387
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Backend, Frontend
>Reporter: Zoltán Borók-Nagy
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
> Fix For: Impala 4.3.0
>
>
> A virtual column ICEBERG__SEQUENCE__NUMBER is needed to handle row-level 
> updates.
> See details at:
>  https://iceberg.apache.org/spec/#scan-planning
> This could be written in the template tuple, similarly to INPUT__FILE__NAME.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-12694) Test equality delete support with data from NiFi

2024-04-09 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab resolved IMPALA-12694.
---
Fix Version/s: Not Applicable
   Resolution: Fixed

> Test equality delete support with data from NiFi
> 
>
> Key: IMPALA-12694
> URL: https://issues.apache.org/jira/browse/IMPALA-12694
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend, Frontend
>Reporter: Gabor Kaszab
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
> Fix For: Not Applicable
>
>
> Iceberg equality delete support in Impala is a subset of what the Iceberg 
> spec allows for equality deletes. Currently, we have sufficient 
> implementation to use eq-deletes created by Flink. As a next step, let's 
> examine if this implementation is sufficient for eq-deletes created by NiFi.
> In theory, NiFi uses Flink's eq-delete implementation so Impala should be 
> fine reading such data. However, at least some manual tests needed for 
> verification, and if it turns out that there are some uncovered edge cases, 
> we should fill these holes in the implementation (probably in separate jiras).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-12600) Support equality deletes when table has partition or schema evolution

2024-04-09 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab resolved IMPALA-12600.
---
Fix Version/s: Impala 4.4.0
   Resolution: Fixed

> Support equality deletes when table has partition or schema evolution
> -
>
> Key: IMPALA-12600
> URL: https://issues.apache.org/jira/browse/IMPALA-12600
> Project: IMPALA
>  Issue Type: Sub-task
>Reporter: Gabor Kaszab
>Assignee: Gabor Kaszab
>Priority: Major
> Fix For: Impala 4.4.0
>
>
> With adding the basic equality delete read support, we reject queries for 
> Iceberg tables that has equality delete files and has partition or schema 
> evolution. This ticket is to enhance this functionality.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12970) Test failure at test_read_equality_deletes in test_iceberg in exhaustive build

2024-04-08 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834852#comment-17834852
 ] 

Gabor Kaszab commented on IMPALA-12970:
---

I kept running test_read_position_deletes locally and once in a while I ran 
into this ConcurrentModificationException. So far I saw 2 different stack 
traces for the error. Both time the test_read_position_deletes_orc failed:

 
{code:java}
select * from functional_parquet.iceberg_v2_partitioned_position_deletes_orc a, 
     functional_parquet.iceberg_partitioned_orc_external b where a.action = 
b.action and b.id=3;
at java.util.ArrayList.sort(ArrayList.java:1466)
at java.util.Collections.sort(Collections.java:143)
at org.apache.impala.planner.IcebergScanNode.(IcebergScanNode.java:105)
at org.apache.impala.planner.IcebergScanNode.(IcebergScanNode.java:86)
at 
org.apache.impala.planner.IcebergScanPlanner.createIcebergScanPlanImpl(IcebergScanPlanner.java:199)
at 
org.apache.impala.planner.IcebergScanPlanner.createIcebergScanPlan(IcebergScanPlanner.java:157)
at 
org.apache.impala.planner.SingleNodePlanner.createScanNode(SingleNodePlanner.java:1884)
 
{code}
 
{code:java}
SELECT action, count(*) from iceberg_v2_partitioned_position_deletes_orc
group by action;
at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:911)
at java.util.ArrayList$Itr.next(ArrayList.java:861)
at 
org.apache.impala.planner.HdfsScanNode.computeScanRangeLocations(HdfsScanNode.java:1281)
at org.apache.impala.planner.HdfsScanNode.init(HdfsScanNode.java:447)
at 
org.apache.impala.planner.IcebergScanPlanner.createPositionJoinNode(IcebergScanPlanner.java:259)
at 
org.apache.impala.planner.IcebergScanPlanner.createIcebergScanPlanImpl(IcebergScanPlanner.java:205)
at 
org.apache.impala.planner.IcebergScanPlanner.createIcebergScanPlan(IcebergScanPlanner.java:157)
at 
org.apache.impala.planner.SingleNodePlanner.createScanNode(SingleNodePlanner.java:1884)
{code}
 

> Test failure at test_read_equality_deletes in test_iceberg in exhaustive build
> --
>
> Key: IMPALA-12970
> URL: https://issues.apache.org/jira/browse/IMPALA-12970
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Reporter: Yida Wu
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: broken-build
>
> An error is observed in the data-cache exhaustive build in 
> test_read_equality_deletes with following message:
> {code:java}
> query_test.test_iceberg.TestIcebergV2Table.test_read_equality_deletes[protocol:
>  beeswax | table_format: parquet/none | exec_option: {'test_replan': 1, 
> 'disable_optimized_iceberg_v2_read': 1, 'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0}] (from pytest)
> {code}
> *Error Message*
> {code:java}
> query_test/test_iceberg.py:1456: in test_read_equality_deletes 
> self.run_test_case('QueryTest/iceberg-v2-read-equality-deletes', vector) 
> common/impala_test_suite.py:725: in run_test_case result = exec_fn(query, 
> user=test_section.get('USER', '').strip() or None) 
> common/impala_test_suite.py:660: in __exec_in_impala result = 
> self.__execute_query(target_impalad_client, query, user=user) 
> common/impala_test_suite.py:1013: in __execute_query return 
> impalad_client.execute(query, user=user) common/impala_connection.py:215: in 
> execute fetch_profile_after_close=fetch_profile_after_close) 
> beeswax/impala_beeswax.py:191: in execute handle = 
> self.__execute_query(query_string.strip(), user=user) 
> beeswax/impala_beeswax.py:382: in __execute_query handle = 
> self.execute_query_async(query_string, user=user) 
> beeswax/impala_beeswax.py:376: in execute_query_async handle = 
> self.__do_rpc(lambda: self.imp_service.query(query,)) 
> beeswax/impala_beeswax.py:539: in __do_rpc raise 
> ImpalaBeeswaxException(self.__build_error_message(b), b) E   
> ImpalaBeeswaxException: ImpalaBeeswaxException: EINNER EXCEPTION:  'beeswaxd.ttypes.BeeswaxException'> EMESSAGE: 
> ConcurrentModificationException: null
> {code}
> *Stacktrace*
> {code:java}
> query_test/test_iceberg.py:1456: in test_read_equality_deletes
> self.run_test_case('QueryTest/iceberg-v2-read-equality-deletes', vector)
> common/impala_test_suite.py:725: in run_test_case
> result = exec_fn(query, user=test_section.get('USER', '').strip() or None)
> common/impala_test_suite.py:660: in __exec_in_impala
> result = self.__execute_query(target_impalad_client, query, user=user)
> common/impala_test_suite.py:1013: in __execute_query
> return impalad_client.execute(query, user=user)
> common/impala_connection.py:215: in execute
> fetch_profile_after_close=fetch_profile_after_close)
> 

[jira] [Commented] (IMPALA-12970) Test failure at test_read_equality_deletes in test_iceberg in exhaustive build

2024-04-05 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834246#comment-17834246
 ] 

Gabor Kaszab commented on IMPALA-12970:
---

Hey [~baggio000] ,

I don't think this is related to the equality delete tests, I occasionally get 
this error when running other Iceberg tests such as test_read_positional_deletes

> Test failure at test_read_equality_deletes in test_iceberg in exhaustive build
> --
>
> Key: IMPALA-12970
> URL: https://issues.apache.org/jira/browse/IMPALA-12970
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Reporter: Yida Wu
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: broken-build
>
> An error is observed in the data-cache exhaustive build in 
> test_read_equality_deletes with following message:
> {code:java}
> query_test.test_iceberg.TestIcebergV2Table.test_read_equality_deletes[protocol:
>  beeswax | table_format: parquet/none | exec_option: {'test_replan': 1, 
> 'disable_optimized_iceberg_v2_read': 1, 'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0}] (from pytest)
> {code}
> *Error Message*
> {code:java}
> query_test/test_iceberg.py:1456: in test_read_equality_deletes 
> self.run_test_case('QueryTest/iceberg-v2-read-equality-deletes', vector) 
> common/impala_test_suite.py:725: in run_test_case result = exec_fn(query, 
> user=test_section.get('USER', '').strip() or None) 
> common/impala_test_suite.py:660: in __exec_in_impala result = 
> self.__execute_query(target_impalad_client, query, user=user) 
> common/impala_test_suite.py:1013: in __execute_query return 
> impalad_client.execute(query, user=user) common/impala_connection.py:215: in 
> execute fetch_profile_after_close=fetch_profile_after_close) 
> beeswax/impala_beeswax.py:191: in execute handle = 
> self.__execute_query(query_string.strip(), user=user) 
> beeswax/impala_beeswax.py:382: in __execute_query handle = 
> self.execute_query_async(query_string, user=user) 
> beeswax/impala_beeswax.py:376: in execute_query_async handle = 
> self.__do_rpc(lambda: self.imp_service.query(query,)) 
> beeswax/impala_beeswax.py:539: in __do_rpc raise 
> ImpalaBeeswaxException(self.__build_error_message(b), b) E   
> ImpalaBeeswaxException: ImpalaBeeswaxException: EINNER EXCEPTION:  'beeswaxd.ttypes.BeeswaxException'> EMESSAGE: 
> ConcurrentModificationException: null
> {code}
> *Stacktrace*
> {code:java}
> query_test/test_iceberg.py:1456: in test_read_equality_deletes
> self.run_test_case('QueryTest/iceberg-v2-read-equality-deletes', vector)
> common/impala_test_suite.py:725: in run_test_case
> result = exec_fn(query, user=test_section.get('USER', '').strip() or None)
> common/impala_test_suite.py:660: in __exec_in_impala
> result = self.__execute_query(target_impalad_client, query, user=user)
> common/impala_test_suite.py:1013: in __execute_query
> return impalad_client.execute(query, user=user)
> common/impala_connection.py:215: in execute
> fetch_profile_after_close=fetch_profile_after_close)
> beeswax/impala_beeswax.py:191: in execute
> handle = self.__execute_query(query_string.strip(), user=user)
> beeswax/impala_beeswax.py:382: in __execute_query
> handle = self.execute_query_async(query_string, user=user)
> beeswax/impala_beeswax.py:376: in execute_query_async
> handle = self.__do_rpc(lambda: self.imp_service.query(query,))
> beeswax/impala_beeswax.py:539: in __do_rpc
> raise ImpalaBeeswaxException(self.__build_error_message(b), b)
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> EINNER EXCEPTION: 
> EMESSAGE: ConcurrentModificationException: null
> {code}
> *Standard Error*
> {code:java}
> SET 
> client_identifier=query_test/test_iceberg.py::TestIcebergV2Table::()::test_read_equality_deletes[protocol:beeswax|table_format:parquet/none|exec_option:{'test_replan':1;'disable_optimized_iceberg_v2_read':1;'batch_size':0;'num_nodes':0;'disable_codegen_rows_threshold':0;'d;
> -- connecting to: localhost:21000
> -- 2024-04-03 07:04:53,469 INFO MainThread: Could not connect to ('::1', 
> 21000, 0, 0)
> Traceback (most recent call last):
>   File 
> "/data/jenkins/workspace/impala-asf-master-exhaustive-data-cache/repos/Impala/infra/python/env-gcc10.4.0/lib/python2.7/site-packages/thrift/transport/TSocket.py",
>  line 137, in open
> handle.connect(sockaddr)
>   File 
> "/data/jenkins/workspace/impala-asf-master-exhaustive-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py",
>  line 228, in meth
> return getattr(self._sock,name)(*args)
> error: [Errno 111] Connection refused
> -- 

[jira] [Updated] (IMPALA-12894) Optimized count(*) for Iceberg gives wrong results after a Spark rewrite_data_files

2024-03-12 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab updated IMPALA-12894:
--
Attachment: count_star_correctness_repro.tar.gz

> Optimized count(*) for Iceberg gives wrong results after a Spark 
> rewrite_data_files
> ---
>
> Key: IMPALA-12894
> URL: https://issues.apache.org/jira/browse/IMPALA-12894
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 4.3.0
>Reporter: Gabor Kaszab
>Priority: Critical
>  Labels: correctness, impala-iceberg
> Attachments: count_star_correctness_repro.tar.gz
>
>
> Issue was introduced by https://issues.apache.org/jira/browse/IMPALA-11802 
> that implemented an optimized way to get results for count(*). However, if 
> the table was compacted by Spark this optimization can give incorrect results.
> The reason is that Spark can[ skip dropping delete 
> files|https://iceberg.apache.org/docs/latest/spark-procedures/#rewrite_position_delete_files]
>  that are pointing to compacted data files, as a result there might be delete 
> files after compaction that are no longer applied to any data files.
> Repro:
> With Impala
> {code:java}
> create table default.iceberg_testing (id int, j bigint) STORED AS ICEBERG
> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>               'iceberg.catalog_location'='/tmp/spark_iceberg_catalog/',
>               'iceberg.table_identifier'='iceberg_testing',
>               'format-version'='2');
> insert into iceberg_testing values
> (1, 1), (2, 4), (3, 9), (4, 16), (5, 25);
> update iceberg_testing set j = -100 where id = 4;
> delete from iceberg_testing where id = 4;{code}
> Count * returns 4 at this point.
> Run compaction in Spark:
> {code:java}
> spark.sql(s"CALL local.system.rewrite_data_files(table => 
> 'default.iceberg_testing', options => map('min-input-files','2') )").show() 
> {code}
> Now count * in Impala returns 8 (might require an IM if in HadoopCatalog). 
> Hive returns correct results. Also a SELECT * returns correct results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12894) Optimized count(*) for Iceberg gives wrong results after a Spark rewrite_data_files

2024-03-12 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-12894:
-

 Summary: Optimized count(*) for Iceberg gives wrong results after 
a Spark rewrite_data_files
 Key: IMPALA-12894
 URL: https://issues.apache.org/jira/browse/IMPALA-12894
 Project: IMPALA
  Issue Type: Bug
  Components: Frontend
Affects Versions: Impala 4.3.0
Reporter: Gabor Kaszab


Issue was introduced by https://issues.apache.org/jira/browse/IMPALA-11802 that 
implemented an optimized way to get results for count(*). However, if the table 
was compacted by Spark this optimization can give incorrect results.

The reason is that Spark can[ skip dropping delete 
files|https://iceberg.apache.org/docs/latest/spark-procedures/#rewrite_position_delete_files]
 that are pointing to compacted data files, as a result there might be delete 
files after compaction that are no longer applied to any data files.

Repro:

With Impala
{code:java}
create table default.iceberg_testing (id int, j bigint) STORED AS ICEBERG
TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
              'iceberg.catalog_location'='/tmp/spark_iceberg_catalog/',
              'iceberg.table_identifier'='iceberg_testing',
              'format-version'='2');
insert into iceberg_testing values
(1, 1), (2, 4), (3, 9), (4, 16), (5, 25);
update iceberg_testing set j = -100 where id = 4;
delete from iceberg_testing where id = 4;{code}
Count * returns 4 at this point.

Run compaction in Spark:
{code:java}
spark.sql(s"CALL local.system.rewrite_data_files(table => 
'default.iceberg_testing', options => map('min-input-files','2') )").show() 
{code}
Now count * in Impala returns 8 (might require an IM if in HadoopCatalog). Hive 
returns correct results. Also a SELECT * returns correct results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12866) Add table type to the SCAN node's explain output

2024-03-04 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-12866:
-

 Summary: Add table type to the SCAN node's explain output
 Key: IMPALA-12866
 URL: https://issues.apache.org/jira/browse/IMPALA-12866
 Project: IMPALA
  Issue Type: Improvement
  Components: Frontend
Reporter: Gabor Kaszab


Would be nice if the explain output of a SCAN node could show what table type 
is the table it reads, Iceberg or Hive. Would help for debugging.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12861) File formats are confused when Iceberg tables has mixed formats

2024-03-04 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab updated IMPALA-12861:
--
Labels: impala-iceberg  (was: )

> File formats are confused when Iceberg tables has mixed formats
> ---
>
> Key: IMPALA-12861
> URL: https://issues.apache.org/jira/browse/IMPALA-12861
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 4.3.0
>Reporter: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
> Attachments: multi_file_table_crash
>
>
> *Repro steps:*
> create table mixed_ice (i int, year int) partitioned by spec (year) stored as 
> iceberg tblproperties('format-version'='2');
>  
> 1) populate one partition with Impala (parquet)
> insert into mixed_ice values (1, 2024), (2, 2024);
>  
> 2) change the write format:
> alter table mixed_ice set tblproperties ('write.format.default'='orc');
>  
> 3) populate another partition with Hive (orc)
> insert into mixed_ice values (1, 2025), (2, 2025), (3, 2025);
>  
> 4) then query just the parquet partition:
> explain select * from mixed_ice where year = 2024;
> {code:java}
> | F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1                       
>              |
> | Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB 
> thread-reservation=1      |
> |   PLAN-ROOT SINK                                                            
>              |
> |   |  output exprs: default.mixed_ice.i, default.mixed_ice.year              
>              |
> |   |  mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB 
> thread-reservation=0 |
> |   |                                                                         
>              |
> |   01:EXCHANGE [UNPARTITIONED]                                               
>              |
> |      mem-estimate=16.00KB mem-reservation=0B thread-reservation=0           
>              |
> |      tuple-ids=0 row-size=8B cardinality=2                                  
>              |
> |      in pipelines: 00(GETNEXT)                                              
>              |
> |                                                                             
>              |
> | F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1                              
>              |
> | Per-Host Resources: mem-estimate=64.05MB mem-reservation=32.00KB 
> thread-reservation=2    |
> |   DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, UNPARTITIONED]                
>              |
> |   |  mem-estimate=48.00KB mem-reservation=0B thread-reservation=0           
>              |
> |   00:SCAN HDFS [default.mixed_ice, RANDOM]                                  
>              |
> |      HDFS partitions=1/1 files=1 size=602B                                  
>              |
> |      Iceberg snapshot id: 4964066258730898133                               
>              |
> |      skipped Iceberg predicates: `year` = CAST(2024 AS INT)                 
>              |
> |      stored statistics:                                                     
>              |
> |        table: rows=5 size=945B                                              
>              |
> |        columns: unavailable                                                 
>              |
> |      extrapolated-rows=disabled max-scan-range-rows=5                       
>              |
> |      file formats: [ORC, PARQUET]                                           
>              |
> |      mem-estimate=64.00MB mem-reservation=32.00KB thread-reservation=1      
>              |
> |      tuple-ids=0 row-size=8B cardinality=2                                  
>              |
> |      in pipelines: 00(GETNEXT)                                              
>              |
> +--+
>  {code}
> Note, the file formats: [ORC, PARQUET] part even  though this query only 
> reads a parquet files.
>  
> *Some analyis:*
> When IcebergScanNode [is 
> created|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java#L129]
>  it holds the correct information about file formats (Parquet).
> Later on the parent class, HdfsScanNode also tries to populate the file 
> formats [here|#L513].]
>  
> It uses what 
> [getSampledOrRawPartitions()|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L431]
>  returns. In this use case the 'sampledPartitions_' is null, so will return 
> 'partitions_'
>  
> Apparently, this 'partitions_' member holds the partition with the ORC file 
> so it adds ORC to the fileFormats_. 

[jira] [Commented] (IMPALA-12861) File formats are confused when Iceberg tables has mixed formats

2024-03-01 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822616#comment-17822616
 ] 

Gabor Kaszab commented on IMPALA-12861:
---

Additionally, there is an intermittent crash when running the select query in 
the description without explain. Attaching the resolved minidump. 
[^multi_file_table_crash]

> File formats are confused when Iceberg tables has mixed formats
> ---
>
> Key: IMPALA-12861
> URL: https://issues.apache.org/jira/browse/IMPALA-12861
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 4.3.0
>Reporter: Gabor Kaszab
>Priority: Major
> Attachments: multi_file_table_crash
>
>
> *Repro steps:*
> create table mixed_ice (i int, year int) partitioned by spec (year) stored as 
> iceberg tblproperties('format-version'='2');
>  
> 1) populate one partition with Impala (parquet)
> insert into mixed_ice values (1, 2024), (2, 2024);
>  
> 2) change the write format:
> alter table mixed_ice set tblproperties ('write.format.default'='orc');
>  
> 3) populate another partition with Hive (orc)
> insert into mixed_ice values (1, 2025), (2, 2025), (3, 2025);
>  
> 4) then query just the parquet partition:
> explain select * from mixed_ice where year = 2024;
> {code:java}
> | F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1                       
>              |
> | Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB 
> thread-reservation=1      |
> |   PLAN-ROOT SINK                                                            
>              |
> |   |  output exprs: default.mixed_ice.i, default.mixed_ice.year              
>              |
> |   |  mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB 
> thread-reservation=0 |
> |   |                                                                         
>              |
> |   01:EXCHANGE [UNPARTITIONED]                                               
>              |
> |      mem-estimate=16.00KB mem-reservation=0B thread-reservation=0           
>              |
> |      tuple-ids=0 row-size=8B cardinality=2                                  
>              |
> |      in pipelines: 00(GETNEXT)                                              
>              |
> |                                                                             
>              |
> | F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1                              
>              |
> | Per-Host Resources: mem-estimate=64.05MB mem-reservation=32.00KB 
> thread-reservation=2    |
> |   DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, UNPARTITIONED]                
>              |
> |   |  mem-estimate=48.00KB mem-reservation=0B thread-reservation=0           
>              |
> |   00:SCAN HDFS [default.mixed_ice, RANDOM]                                  
>              |
> |      HDFS partitions=1/1 files=1 size=602B                                  
>              |
> |      Iceberg snapshot id: 4964066258730898133                               
>              |
> |      skipped Iceberg predicates: `year` = CAST(2024 AS INT)                 
>              |
> |      stored statistics:                                                     
>              |
> |        table: rows=5 size=945B                                              
>              |
> |        columns: unavailable                                                 
>              |
> |      extrapolated-rows=disabled max-scan-range-rows=5                       
>              |
> |      file formats: [ORC, PARQUET]                                           
>              |
> |      mem-estimate=64.00MB mem-reservation=32.00KB thread-reservation=1      
>              |
> |      tuple-ids=0 row-size=8B cardinality=2                                  
>              |
> |      in pipelines: 00(GETNEXT)                                              
>              |
> +--+
>  {code}
> Note, the file formats: [ORC, PARQUET] part even  though this query only 
> reads a parquet files.
>  
> *Some analyis:*
> When IcebergScanNode [is 
> created|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java#L129]
>  it holds the correct information about file formats (Parquet).
> Later on the parent class, HdfsScanNode also tries to populate the file 
> formats [here|#L513].]
>  
> It uses what 
> [getSampledOrRawPartitions()|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L431]
>  returns. In this use case the 'sampledPartitions_' is null, so will return 
> 

[jira] [Updated] (IMPALA-12861) File formats are confused when Iceberg tables has mixed formats

2024-03-01 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab updated IMPALA-12861:
--
Attachment: multi_file_table_crash

> File formats are confused when Iceberg tables has mixed formats
> ---
>
> Key: IMPALA-12861
> URL: https://issues.apache.org/jira/browse/IMPALA-12861
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 4.3.0
>Reporter: Gabor Kaszab
>Priority: Major
> Attachments: multi_file_table_crash
>
>
> *Repro steps:*
> create table mixed_ice (i int, year int) partitioned by spec (year) stored as 
> iceberg tblproperties('format-version'='2');
>  
> 1) populate one partition with Impala (parquet)
> insert into mixed_ice values (1, 2024), (2, 2024);
>  
> 2) change the write format:
> alter table mixed_ice set tblproperties ('write.format.default'='orc');
>  
> 3) populate another partition with Hive (orc)
> insert into mixed_ice values (1, 2025), (2, 2025), (3, 2025);
>  
> 4) then query just the parquet partition:
> explain select * from mixed_ice where year = 2024;
> {code:java}
> | F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1                       
>              |
> | Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB 
> thread-reservation=1      |
> |   PLAN-ROOT SINK                                                            
>              |
> |   |  output exprs: default.mixed_ice.i, default.mixed_ice.year              
>              |
> |   |  mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB 
> thread-reservation=0 |
> |   |                                                                         
>              |
> |   01:EXCHANGE [UNPARTITIONED]                                               
>              |
> |      mem-estimate=16.00KB mem-reservation=0B thread-reservation=0           
>              |
> |      tuple-ids=0 row-size=8B cardinality=2                                  
>              |
> |      in pipelines: 00(GETNEXT)                                              
>              |
> |                                                                             
>              |
> | F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1                              
>              |
> | Per-Host Resources: mem-estimate=64.05MB mem-reservation=32.00KB 
> thread-reservation=2    |
> |   DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, UNPARTITIONED]                
>              |
> |   |  mem-estimate=48.00KB mem-reservation=0B thread-reservation=0           
>              |
> |   00:SCAN HDFS [default.mixed_ice, RANDOM]                                  
>              |
> |      HDFS partitions=1/1 files=1 size=602B                                  
>              |
> |      Iceberg snapshot id: 4964066258730898133                               
>              |
> |      skipped Iceberg predicates: `year` = CAST(2024 AS INT)                 
>              |
> |      stored statistics:                                                     
>              |
> |        table: rows=5 size=945B                                              
>              |
> |        columns: unavailable                                                 
>              |
> |      extrapolated-rows=disabled max-scan-range-rows=5                       
>              |
> |      file formats: [ORC, PARQUET]                                           
>              |
> |      mem-estimate=64.00MB mem-reservation=32.00KB thread-reservation=1      
>              |
> |      tuple-ids=0 row-size=8B cardinality=2                                  
>              |
> |      in pipelines: 00(GETNEXT)                                              
>              |
> +--+
>  {code}
> Note, the file formats: [ORC, PARQUET] part even  though this query only 
> reads a parquet files.
>  
> *Some analyis:*
> When IcebergScanNode [is 
> created|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java#L129]
>  it holds the correct information about file formats (Parquet).
> Later on the parent class, HdfsScanNode also tries to populate the file 
> formats [here|#L513].]
>  
> It uses what 
> [getSampledOrRawPartitions()|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L431]
>  returns. In this use case the 'sampledPartitions_' is null, so will return 
> 'partitions_'
>  
> Apparently, this 'partitions_' member holds the partition with the ORC file 
> so it adds ORC to the fileFormats_. Unfortunately, this 
> 

[jira] [Updated] (IMPALA-12862) Expose Iceberg position delete records via metadata table

2024-03-01 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab updated IMPALA-12862:
--
Issue Type: Improvement  (was: Bug)

> Expose Iceberg position delete records via metadata table
> -
>
> Key: IMPALA-12862
> URL: https://issues.apache.org/jira/browse/IMPALA-12862
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Reporter: Zoltán Borók-Nagy
>Priority: Major
>  Labels: impala-iceberg
>
> To debug issues with position delete files, or detect table corruption we 
> could expose the delete records via the metadata table syntax, e.g.:
> {noformat}
> SELECT INPUT__FILE__NAME, file_path, pos
> FROM db.ice_t.position_delete_records;{noformat}
> Adding virtual column INPUT__FILE__NAME is useful because it can tell which 
> delete file contains the records.
> We should re-use IcebergPositionDeleteTable for this: 
> [https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/IcebergPositionDeleteTable.java]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12861) File formats are confused when Iceberg tables has mixed formats

2024-03-01 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab updated IMPALA-12861:
--
Description: 
*Repro steps:*
create table mixed_ice (i int, year int) partitioned by spec (year) stored as 
iceberg tblproperties('format-version'='2');
 
1) populate one partition with Impala (parquet)
insert into mixed_ice values (1, 2024), (2, 2024);
 
2) change the write format:
alter table mixed_ice set tblproperties ('write.format.default'='orc');
 
3) populate another partition with Hive (orc)
insert into mixed_ice values (1, 2025), (2, 2025), (3, 2025);
 
4) then query just the parquet partition:
explain select * from mixed_ice where year = 2024;
{code:java}
| F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1                         
           |
| Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB 
thread-reservation=1      |
|   PLAN-ROOT SINK                                                              
           |
|   |  output exprs: default.mixed_ice.i, default.mixed_ice.year                
           |
|   |  mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB 
thread-reservation=0 |
|   |                                                                           
           |
|   01:EXCHANGE [UNPARTITIONED]                                                 
           |
|      mem-estimate=16.00KB mem-reservation=0B thread-reservation=0             
           |
|      tuple-ids=0 row-size=8B cardinality=2                                    
           |
|      in pipelines: 00(GETNEXT)                                                
           |
|                                                                               
           |
| F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1                                
           |
| Per-Host Resources: mem-estimate=64.05MB mem-reservation=32.00KB 
thread-reservation=2    |
|   DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, UNPARTITIONED]                  
           |
|   |  mem-estimate=48.00KB mem-reservation=0B thread-reservation=0             
           |
|   00:SCAN HDFS [default.mixed_ice, RANDOM]                                    
           |
|      HDFS partitions=1/1 files=1 size=602B                                    
           |
|      Iceberg snapshot id: 4964066258730898133                                 
           |
|      skipped Iceberg predicates: `year` = CAST(2024 AS INT)                   
           |
|      stored statistics:                                                       
           |
|        table: rows=5 size=945B                                                
           |
|        columns: unavailable                                                   
           |
|      extrapolated-rows=disabled max-scan-range-rows=5                         
           |
|      file formats: [ORC, PARQUET]                                             
           |
|      mem-estimate=64.00MB mem-reservation=32.00KB thread-reservation=1        
           |
|      tuple-ids=0 row-size=8B cardinality=2                                    
           |
|      in pipelines: 00(GETNEXT)                                                
           |
+--+
 {code}
Note, the file formats: [ORC, PARQUET] part even  though this query only reads 
a parquet files.
 
*Some analyis:*
When IcebergScanNode [is 
created|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java#L129]
 it holds the correct information about file formats (Parquet).

Later on the parent class, HdfsScanNode also tries to populate the file formats 
[here|#L513].]
 
It uses what 
[getSampledOrRawPartitions()|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L431]
 returns. In this use case the 'sampledPartitions_' is null, so will return 
'partitions_'
 
Apparently, this 'partitions_' member holds the partition with the ORC file so 
it adds ORC to the fileFormats_. Unfortunately, this 
getSampledOrRawPartitions() is called in multiple locations within HdfsScanNode 
returning the wrong partition.

*Next steps:*

Check what other issues can this getSampledOrRawPartitions cause with multi 
file format tables. Also check if we can populate 'partitions_' properly.

  was:
*Repro steps:*
create table mixed_ice (i int, year int) partitioned by spec (year) stored as 
iceberg tblproperties('format-version'='2');
 
1) populate one partition with Impala (parquet)
insert into mixed_ice values (1, 2024), (2, 2024);
 
2) change the write format:
alter table mixed_ice set tblproperties ('write.format.default'='orc');
 
3) populate another partition with Hive (orc)
insert into mixed_ice 

[jira] [Created] (IMPALA-12861) File formats are confused when Iceberg tables has mixed formats

2024-03-01 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-12861:
-

 Summary: File formats are confused when Iceberg tables has mixed 
formats
 Key: IMPALA-12861
 URL: https://issues.apache.org/jira/browse/IMPALA-12861
 Project: IMPALA
  Issue Type: Bug
  Components: Frontend
Affects Versions: Impala 4.3.0
Reporter: Gabor Kaszab


*Repro steps:*
create table mixed_ice (i int, year int) partitioned by spec (year) stored as 
iceberg tblproperties('format-version'='2');
 
1) populate one partition with Impala (parquet)
insert into mixed_ice values (1, 2024), (2, 2024);
 
2) change the write format:
alter table mixed_ice set tblproperties ('write.format.default'='orc');
 
3) populate another partition with Hive (orc)
insert into mixed_ice values (1, 2025), (2, 2025), (3, 2025);
 
4) then query just the parquet partition:
explain select * from mixed_ice where year = 2024;
{code:java}
| F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1                         
           |
| Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB 
thread-reservation=1      |
|   PLAN-ROOT SINK                                                              
           |
|   |  output exprs: default.mixed_ice.i, default.mixed_ice.year                
           |
|   |  mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB 
thread-reservation=0 |
|   |                                                                           
           |
|   01:EXCHANGE [UNPARTITIONED]                                                 
           |
|      mem-estimate=16.00KB mem-reservation=0B thread-reservation=0             
           |
|      tuple-ids=0 row-size=8B cardinality=2                                    
           |
|      in pipelines: 00(GETNEXT)                                                
           |
|                                                                               
           |
| F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1                                
           |
| Per-Host Resources: mem-estimate=64.05MB mem-reservation=32.00KB 
thread-reservation=2    |
|   DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, UNPARTITIONED]                  
           |
|   |  mem-estimate=48.00KB mem-reservation=0B thread-reservation=0             
           |
|   00:SCAN HDFS [default.mixed_ice, RANDOM]                                    
           |
|      HDFS partitions=1/1 files=1 size=602B                                    
           |
|      Iceberg snapshot id: 4964066258730898133                                 
           |
|      skipped Iceberg predicates: `year` = CAST(2024 AS INT)                   
           |
|      stored statistics:                                                       
           |
|        table: rows=5 size=945B                                                
           |
|        columns: unavailable                                                   
           |
|      extrapolated-rows=disabled max-scan-range-rows=5                         
           |
|      file formats: [ORC, PARQUET]                                             
           |
|      mem-estimate=64.00MB mem-reservation=32.00KB thread-reservation=1        
           |
|      tuple-ids=0 row-size=8B cardinality=2                                    
           |
|      in pipelines: 00(GETNEXT)                                                
           |
+--+
 {code}
Note, the file formats: [ORC, PARQUET] part even  though this query only reads 
a parquet files.
 
*Some analyis:*
When IcebergScanNode [is 
created|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java#L129]
 it holds the correct information about file formats (Parquet).

Later on the parent class, HdfsScanNode also tries to populate the file formats 
[here|[https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L513].]
 
It uses what 
[getSampledOrRawPartition()|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L431]
 returns. In this use case the 'sampledPartitions_' is null, so will return 
'partitions_'
 
Apparently, this 'partitions_' member holds the partition with the ORC file so 
it adds ORC to the fileFormats_.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-12598) Add support for multiple equality field ID list

2024-03-01 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab resolved IMPALA-12598.
---
Resolution: Fixed

> Add support for multiple equality field ID list
> ---
>
> Key: IMPALA-12598
> URL: https://issues.apache.org/jira/browse/IMPALA-12598
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Frontend
>Reporter: Gabor Kaszab
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
>
> Iceberg metadata holds an equality field ID list for the equality-delete 
> files. It's possible to have a different equality field ID list for different 
> equality-delete files, for instance one file deletes by columnA while another 
> file deletes by columnB.
> When you have such a table you should have multiple layers of ANTI JOINs, one 
> join for each equality field ID list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Work started] (IMPALA-12729) Allow creating primary keys for Iceberg tables

2024-02-26 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on IMPALA-12729 started by Gabor Kaszab.
-
> Allow creating primary keys for Iceberg tables
> --
>
> Key: IMPALA-12729
> URL: https://issues.apache.org/jira/browse/IMPALA-12729
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Reporter: Gabor Kaszab
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
>
> Some writer engines require primary keys on a table so that they can use them 
> for writing equality deletes (only the PK cols are written to the eq-delete 
> files).
> Impala currently doesn't reject setting PKs for Iceberg tables, however it 
> seems to omit them. This suceeds:
> {code:java}
> create table ice_pk (i int, j int, primary key(i)) stored as iceberg;
> {code}
> However, DESCRIBE EXTENDED doesn't show 'identifier-field-ids' in the 
> 'current-schema'.
> On the other hand for a table created by Flink these fields are there:
> {code:java}
> current-schema                                     | 
> {\"type\":\"struct\",\"schema-id\":0,\"identifier-field-ids\":[1],\"fields\":[{\"id\":1,\"name\":\"i\",\"required\":true,\"type\":\"int\"},{\"id\":2,\"name\":\"s\",\"required\":false,\"type\":\"string\"}]}
>  {code}
> Part2:
> SHOW CREATE TABLE should also correctly print the primary key part of the 
> field list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Comment Edited] (IMPALA-12836) Aggregation over a STRUCT throws IllegalStateException

2024-02-22 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819638#comment-17819638
 ] 

Gabor Kaszab edited comment on IMPALA-12836 at 2/22/24 12:42 PM:
-

There was this example query [on a 
conference|https://www.apachecon.com/acna2022/slides/02_Ho_Icebergs_Best_Secret.pdf]
 for Iceberg metadata tables to check the size of each partition:
{code:java}
SELECT `partition`,
sum(file_size_in_bytes) AS partition_size
FROM db.table.`files`
GROUP BY `partition` {code}
Note, in `files` metadata table the `partition` column is a struct that holds 
one member for each partition column. So I believe this works in Spark and 
would be a nice addition for us too for table analysis purposes.

This query could be re-worked so that we can run it, but then for each table 
you'd have to write a separate query for getting these stats:
{code:java}
SELECT `partition`.col1, .. `partition`.colN, sum(file_size_in_bytes) AS 
partition_size 
FROM db.table.files 
GROUP BY `partition`.col1, .. , `partition`.colN; {code}
 


was (Author: gaborkaszab):
There was this example query [on a 
conference|https://www.apachecon.com/acna2022/slides/02_Ho_Icebergs_Best_Secret.pdf]
 for Iceberg metadata tables to check the size of each partition:
{code:java}
SELECT partition,
sum(file_size_in_bytes) AS partition_size,
FROM db.table.files
GROUP BY partition {code}
Note, in `files` metadata table the `partition` column is a struct that holds 
one member for each partition column. So I believe this works in Spark and 
would be a nice addition for us too for table analysis purposes.

This query could be re-worked so that we can run it, but then for each table 
you'd have to write a separate query for getting these stats:
{code:java}
SELECT `partition`.col1, .. `partition`.colN, sum(file_size_in_bytes) AS 
partition_size, FROM db.table.files GROUP BY `partition`.col1, .. , 
`partition`.colN; {code}

> Aggregation over a STRUCT throws IllegalStateException
> --
>
> Key: IMPALA-12836
> URL: https://issues.apache.org/jira/browse/IMPALA-12836
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 4.4.0
>Reporter: Tamas Mate
>Priority: Major
>
> A Preconditions check will fail when trying to aggregate over a struct.
> Repro query:
> {code}
> Query: select int_struct_col, sum(id) from functional_parquet.allcomplextypes 
> group by int_struct_col
> Query submitted at: 2024-02-22 13:08:20 (Coordinator: 
> http://tmate-desktop:25000)
> ERROR: IllegalStateException: null
> {code}
> {code:java}
> I0222 13:05:21.762225 10675 jni-util.cc:302] 
> 3c44b4fafbbcb6b5:eee03297] java.lang.IllegalStateException
>         at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:486)
>         at 
> org.apache.impala.analysis.SlotRef.addStructChildrenAsSlotRefs(SlotRef.java:268)
>         at org.apache.impala.analysis.SlotRef.(SlotRef.java:93)
>         at 
> org.apache.impala.analysis.AggregateInfoBase.createTupleDesc(AggregateInfoBase.java:135)
>         at 
> org.apache.impala.analysis.AggregateInfoBase.createTupleDescs(AggregateInfoBase.java:101)
>         at 
> org.apache.impala.analysis.AggregateInfo.create(AggregateInfo.java:150)
>         at 
> org.apache.impala.analysis.AggregateInfo.create(AggregateInfo.java:171)
>         at 
> org.apache.impala.analysis.MultiAggregateInfo.analyze(MultiAggregateInfo.java:301)
>         at 
> org.apache.impala.analysis.SelectStmt$SelectAnalyzer.buildAggregateExprs(SelectStmt.java:1149)
>         at 
> org.apache.impala.analysis.SelectStmt$SelectAnalyzer.analyze(SelectStmt.java:355)
>         at 
> org.apache.impala.analysis.SelectStmt$SelectAnalyzer.access$100(SelectStmt.java:282)
>         at org.apache.impala.analysis.SelectStmt.analyze(SelectStmt.java:274)
>         at 
> org.apache.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:545)
>         at 
> org.apache.impala.analysis.AnalysisContext.analyzeAndAuthorize(AnalysisContext.java:492)
>         at 
> org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:2364)
>         at 
> org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:2110)
>         at 
> org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1883)
>         at 
> org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:169) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Comment Edited] (IMPALA-12836) Aggregation over a STRUCT throws IllegalStateException

2024-02-22 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819638#comment-17819638
 ] 

Gabor Kaszab edited comment on IMPALA-12836 at 2/22/24 12:41 PM:
-

There was this example query [on a 
conference|https://www.apachecon.com/acna2022/slides/02_Ho_Icebergs_Best_Secret.pdf]
 for Iceberg metadata tables to check the size of each partition:
{code:java}
SELECT partition,
sum(file_size_in_bytes) AS partition_size,
FROM db.table.files
GROUP BY partition {code}
Note, in `files` metadata table the `partition` column is a struct that holds 
one member for each partition column. So I believe this works in Spark and 
would be a nice addition for us too for table analysis purposes.

This query could be re-worked so that we can run it, but then for each table 
you'd have to write a separate query for getting these stats:
{code:java}
SELECT `partition`.col1, .. `partition`.colN, sum(file_size_in_bytes) AS 
partition_size, FROM db.table.files GROUP BY `partition`.col1, .. , 
`partition`.colN; {code}


was (Author: gaborkaszab):
There was this example query [on a 
conference|https://www.apachecon.com/acna2022/slides/02_Ho_Icebergs_Best_Secret.pdf]
 for Iceberg metadata tables to check the size of each partition:
{code:java}
SELECT partition,
sum(file_size_in_bytes) AS partition_size,
FROM db.table.files
GROUP BY partition {code}
Note, in `files` metadata table the `partition` column is a struct that holds 
one member for each partition column. So I believe this works in Spark and 
would be a nice addition for us too for table analysis purposes.

This query could be re-worked so that we can run it, but then for each table 
so'd have to write a separate query for getting these stats:
{code:java}
SELECT `partition`.col1, .. `partition`.colN, sum(file_size_in_bytes) AS 
partition_size, FROM db.table.files GROUP BY `partition`.col1, .. , 
`partition`.colN; {code}

> Aggregation over a STRUCT throws IllegalStateException
> --
>
> Key: IMPALA-12836
> URL: https://issues.apache.org/jira/browse/IMPALA-12836
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 4.4.0
>Reporter: Tamas Mate
>Priority: Major
>
> A Preconditions check will fail when trying to aggregate over a struct.
> Repro query:
> {code}
> Query: select int_struct_col, sum(id) from functional_parquet.allcomplextypes 
> group by int_struct_col
> Query submitted at: 2024-02-22 13:08:20 (Coordinator: 
> http://tmate-desktop:25000)
> ERROR: IllegalStateException: null
> {code}
> {code:java}
> I0222 13:05:21.762225 10675 jni-util.cc:302] 
> 3c44b4fafbbcb6b5:eee03297] java.lang.IllegalStateException
>         at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:486)
>         at 
> org.apache.impala.analysis.SlotRef.addStructChildrenAsSlotRefs(SlotRef.java:268)
>         at org.apache.impala.analysis.SlotRef.(SlotRef.java:93)
>         at 
> org.apache.impala.analysis.AggregateInfoBase.createTupleDesc(AggregateInfoBase.java:135)
>         at 
> org.apache.impala.analysis.AggregateInfoBase.createTupleDescs(AggregateInfoBase.java:101)
>         at 
> org.apache.impala.analysis.AggregateInfo.create(AggregateInfo.java:150)
>         at 
> org.apache.impala.analysis.AggregateInfo.create(AggregateInfo.java:171)
>         at 
> org.apache.impala.analysis.MultiAggregateInfo.analyze(MultiAggregateInfo.java:301)
>         at 
> org.apache.impala.analysis.SelectStmt$SelectAnalyzer.buildAggregateExprs(SelectStmt.java:1149)
>         at 
> org.apache.impala.analysis.SelectStmt$SelectAnalyzer.analyze(SelectStmt.java:355)
>         at 
> org.apache.impala.analysis.SelectStmt$SelectAnalyzer.access$100(SelectStmt.java:282)
>         at org.apache.impala.analysis.SelectStmt.analyze(SelectStmt.java:274)
>         at 
> org.apache.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:545)
>         at 
> org.apache.impala.analysis.AnalysisContext.analyzeAndAuthorize(AnalysisContext.java:492)
>         at 
> org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:2364)
>         at 
> org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:2110)
>         at 
> org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1883)
>         at 
> org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:169) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12836) Aggregation over a STRUCT throws IllegalStateException

2024-02-22 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819638#comment-17819638
 ] 

Gabor Kaszab commented on IMPALA-12836:
---

There was this example query [on a 
conference|https://www.apachecon.com/acna2022/slides/02_Ho_Icebergs_Best_Secret.pdf]
 for Iceberg metadata tables to check the size of each partition:
{code:java}
SELECT partition,
sum(file_size_in_bytes) AS partition_size,
FROM db.table.files
GROUP BY partition {code}
Note, in `files` metadata table the `partition` column is a struct that holds 
one member for each partition column. So I believe this works in Spark and 
would be a nice addition for us too for table analysis purposes.

This query could be re-worked so that we can run it, but then for each table 
so'd have to write a separate query for getting these stats:
{code:java}
SELECT `partition`.col1, .. `partition`.colN, sum(file_size_in_bytes) AS 
partition_size, FROM db.table.files GROUP BY `partition`.col1, .. , 
`partition`.colN; {code}

> Aggregation over a STRUCT throws IllegalStateException
> --
>
> Key: IMPALA-12836
> URL: https://issues.apache.org/jira/browse/IMPALA-12836
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 4.4.0
>Reporter: Tamas Mate
>Priority: Major
>
> A Preconditions check will fail when trying to aggregate over a struct.
> Repro query:
> {code}
> Query: select int_struct_col, sum(id) from functional_parquet.allcomplextypes 
> group by int_struct_col
> Query submitted at: 2024-02-22 13:08:20 (Coordinator: 
> http://tmate-desktop:25000)
> ERROR: IllegalStateException: null
> {code}
> {code:java}
> I0222 13:05:21.762225 10675 jni-util.cc:302] 
> 3c44b4fafbbcb6b5:eee03297] java.lang.IllegalStateException
>         at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:486)
>         at 
> org.apache.impala.analysis.SlotRef.addStructChildrenAsSlotRefs(SlotRef.java:268)
>         at org.apache.impala.analysis.SlotRef.(SlotRef.java:93)
>         at 
> org.apache.impala.analysis.AggregateInfoBase.createTupleDesc(AggregateInfoBase.java:135)
>         at 
> org.apache.impala.analysis.AggregateInfoBase.createTupleDescs(AggregateInfoBase.java:101)
>         at 
> org.apache.impala.analysis.AggregateInfo.create(AggregateInfo.java:150)
>         at 
> org.apache.impala.analysis.AggregateInfo.create(AggregateInfo.java:171)
>         at 
> org.apache.impala.analysis.MultiAggregateInfo.analyze(MultiAggregateInfo.java:301)
>         at 
> org.apache.impala.analysis.SelectStmt$SelectAnalyzer.buildAggregateExprs(SelectStmt.java:1149)
>         at 
> org.apache.impala.analysis.SelectStmt$SelectAnalyzer.analyze(SelectStmt.java:355)
>         at 
> org.apache.impala.analysis.SelectStmt$SelectAnalyzer.access$100(SelectStmt.java:282)
>         at org.apache.impala.analysis.SelectStmt.analyze(SelectStmt.java:274)
>         at 
> org.apache.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:545)
>         at 
> org.apache.impala.analysis.AnalysisContext.analyzeAndAuthorize(AnalysisContext.java:492)
>         at 
> org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:2364)
>         at 
> org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:2110)
>         at 
> org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1883)
>         at 
> org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:169) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12826) Add better cardinality estimation for Iceberg V2 tables with equality deletes

2024-02-21 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-12826:
-

 Summary: Add better cardinality estimation for Iceberg V2 tables 
with equality deletes
 Key: IMPALA-12826
 URL: https://issues.apache.org/jira/browse/IMPALA-12826
 Project: IMPALA
  Issue Type: Sub-task
  Components: Frontend
Reporter: Gabor Kaszab


there is a similar ticket for positional deletes: 
https://issues.apache.org/jira/browse/IMPALA-12371

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Work started] (IMPALA-12600) Support equality deletes when table has partition or schema evolution

2024-01-25 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on IMPALA-12600 started by Gabor Kaszab.
-
> Support equality deletes when table has partition or schema evolution
> -
>
> Key: IMPALA-12600
> URL: https://issues.apache.org/jira/browse/IMPALA-12600
> Project: IMPALA
>  Issue Type: Sub-task
>Reporter: Gabor Kaszab
>Assignee: Gabor Kaszab
>Priority: Major
>
> With adding the basic equality delete read support, we reject queries for 
> Iceberg tables that has equality delete files and has partition or schema 
> evolution. This ticket is to enhance this functionality.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12729) Allow creating primary keys for Iceberg tables

2024-01-18 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-12729:
-

 Summary: Allow creating primary keys for Iceberg tables
 Key: IMPALA-12729
 URL: https://issues.apache.org/jira/browse/IMPALA-12729
 Project: IMPALA
  Issue Type: Improvement
  Components: Frontend
Reporter: Gabor Kaszab


Some writer engines require primary keys on a table so that they can use them 
for writing equality deletes (only the PK cols are written to the eq-delete 
files).

Impala currently doesn't reject setting PKs for Iceberg tables, however it 
seems to omit them. This suceeds:
{code:java}
create table ice_pk (i int, j int, primary key(i)) stored as iceberg;
{code}
However, DESCRIBE EXTENDED doesn't show 'identifier-field-ids' in the 
'current-schema'.

On the other hand for a table created by Flink these fields are there:
{code:java}
current-schema                                     | 
{\"type\":\"struct\",\"schema-id\":0,\"identifier-field-ids\":[1],\"fields\":[{\"id\":1,\"name\":\"i\",\"required\":true,\"type\":\"int\"},{\"id\":2,\"name\":\"s\",\"required\":false,\"type\":\"string\"}]}
 {code}
Part2:

SHOW CREATE TABLE should also correctly print the primary key part of the field 
list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-12598) Add support for multiple equality field ID list

2024-01-16 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab reassigned IMPALA-12598:
-

Assignee: Gabor Kaszab

> Add support for multiple equality field ID list
> ---
>
> Key: IMPALA-12598
> URL: https://issues.apache.org/jira/browse/IMPALA-12598
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Frontend
>Reporter: Gabor Kaszab
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
>
> Iceberg metadata holds an equality field ID list for the equality-delete 
> files. It's possible to have a different equality field ID list for different 
> equality-delete files, for instance one file deletes by columnA while another 
> file deletes by columnB.
> When you have such a table you should have multiple layers of ANTI JOINs, one 
> join for each equality field ID list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Work started] (IMPALA-12598) Add support for multiple equality field ID list

2024-01-16 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on IMPALA-12598 started by Gabor Kaszab.
-
> Add support for multiple equality field ID list
> ---
>
> Key: IMPALA-12598
> URL: https://issues.apache.org/jira/browse/IMPALA-12598
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Frontend
>Reporter: Gabor Kaszab
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
>
> Iceberg metadata holds an equality field ID list for the equality-delete 
> files. It's possible to have a different equality field ID list for different 
> equality-delete files, for instance one file deletes by columnA while another 
> file deletes by columnB.
> When you have such a table you should have multiple layers of ANTI JOINs, one 
> join for each equality field ID list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Work started] (IMPALA-12694) Test equality delete support with data from NiFi

2024-01-09 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on IMPALA-12694 started by Gabor Kaszab.
-
> Test equality delete support with data from NiFi
> 
>
> Key: IMPALA-12694
> URL: https://issues.apache.org/jira/browse/IMPALA-12694
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend, Frontend
>Reporter: Gabor Kaszab
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
>
> Iceberg equality delete support in Impala is a subset of what the Iceberg 
> spec allows for equality deletes. Currently, we have sufficient 
> implementation to use eq-deletes created by Flink. As a next step, let's 
> examine if this implementation is sufficient for eq-deletes created by NiFi.
> In theory, NiFi uses Flink's eq-delete implementation so Impala should be 
> fine reading such data. However, at least some manual tests needed for 
> verification, and if it turns out that there are some uncovered edge cases, 
> we should fill these holes in the implementation (probably in separate jiras).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12694) Test equality delete support with data from NiFi

2024-01-09 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-12694:
-

 Summary: Test equality delete support with data from NiFi
 Key: IMPALA-12694
 URL: https://issues.apache.org/jira/browse/IMPALA-12694
 Project: IMPALA
  Issue Type: Improvement
  Components: Backend, Frontend
Reporter: Gabor Kaszab
Assignee: Gabor Kaszab


Iceberg equality delete support in Impala is a subset of what the Iceberg spec 
allows for equality deletes. Currently, we have sufficient implementation to 
use eq-deletes created by Flink. As a next step, let's examine if this 
implementation is sufficient for eq-deletes created by NiFi.

In theory, NiFi uses Flink's eq-delete implementation so Impala should be fine 
reading such data. However, at least some manual tests needed for verification, 
and if it turns out that there are some uncovered edge cases, we should fill 
these holes in the implementation (probably in separate jiras).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9821) Rewrite ds_hll_sketch() and ds_hll_union() and other datasketch generating functions to return Binary

2024-01-08 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17804253#comment-17804253
 ] 

Gabor Kaszab commented on IMPALA-9821:
--

Made the title of this ticket more generic to cover all the other sketch types 
too. This is a breaking change so needs a new major Impala version.

> Rewrite ds_hll_sketch() and ds_hll_union() and other datasketch generating 
> functions to return Binary
> -
>
> Key: IMPALA-9821
> URL: https://issues.apache.org/jira/browse/IMPALA-9821
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Backend
>Reporter: Gabor Kaszab
>Priority: Major
>
> Until Binary implementation is ongoing ds_hll_sketch() and ds_hll_union() 
> functions return serialized sketches in String format. Once Binary is 
> available in Impala these can return the serialized sketches in Binary format.
> Currently when sketches are written by Hive as BINARY to ORC table and this 
> table is loaded to Impala where the sketch columns are STRINGs then we get an 
> error
> {code:java}
> ERROR: Type mismatch: table column STRING is map to column binary in ORC file
> {code}
> Interestingly the works with Parquet format.
> Once we have binary support make sure to add coverage for ORC table where the 
> table is created and populated by Hive and read for estimating by Impala.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-9821) Rewrite ds_hll_sketch() and ds_hll_union() and other datasketch generating functions to return Binary

2024-01-08 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab updated IMPALA-9821:
-
Summary: Rewrite ds_hll_sketch() and ds_hll_union() and other datasketch 
generating functions to return Binary  (was: Rewrite ds_hll_sketch() and 
ds_hll_union() functions to return Binary)

> Rewrite ds_hll_sketch() and ds_hll_union() and other datasketch generating 
> functions to return Binary
> -
>
> Key: IMPALA-9821
> URL: https://issues.apache.org/jira/browse/IMPALA-9821
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Backend
>Reporter: Gabor Kaszab
>Priority: Major
>
> Until Binary implementation is ongoing ds_hll_sketch() and ds_hll_union() 
> functions return serialized sketches in String format. Once Binary is 
> available in Impala these can return the serialized sketches in Binary format.
> Currently when sketches are written by Hive as BINARY to ORC table and this 
> table is loaded to Impala where the sketch columns are STRINGs then we get an 
> error
> {code:java}
> ERROR: Type mismatch: table column STRING is map to column binary in ORC file
> {code}
> Interestingly the works with Parquet format.
> Once we have binary support make sure to add coverage for ORC table where the 
> table is created and populated by Hive and read for estimating by Impala.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-12673) Iceberg table migraton fails for '/' in partition values

2024-01-05 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab resolved IMPALA-12673.
---
Resolution: Fixed

> Iceberg table migraton fails for '/' in partition values
> 
>
> Key: IMPALA-12673
> URL: https://issues.apache.org/jira/browse/IMPALA-12673
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Reporter: Gabor Kaszab
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
> Fix For: Impala 4.4.0
>
>
> As a bug in Iceberg we don't allow migrating tables to Iceberg when the table 
> has a partition value containing a '/' character. Now, that the fix for this 
> Iceberg bug is picked up by Impala we can allow migrating such tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12673) Iceberg table migraton fails for '/' in partition values

2024-01-03 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-12673:
-

 Summary: Iceberg table migraton fails for '/' in partition values
 Key: IMPALA-12673
 URL: https://issues.apache.org/jira/browse/IMPALA-12673
 Project: IMPALA
  Issue Type: Bug
  Components: Frontend
Reporter: Gabor Kaszab
 Fix For: Impala 4.4.0


As a bug in Iceberg we don't allow migrating tables to Iceberg when the table 
has a partition value containing a '/' character. Now, that the fix for this 
Iceberg bug is picked up by Impala we can allow migrating such tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-12673) Iceberg table migraton fails for '/' in partition values

2024-01-03 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab reassigned IMPALA-12673:
-

Assignee: Gabor Kaszab

> Iceberg table migraton fails for '/' in partition values
> 
>
> Key: IMPALA-12673
> URL: https://issues.apache.org/jira/browse/IMPALA-12673
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Reporter: Gabor Kaszab
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
> Fix For: Impala 4.4.0
>
>
> As a bug in Iceberg we don't allow migrating tables to Iceberg when the table 
> has a partition value containing a '/' character. Now, that the fix for this 
> Iceberg bug is picked up by Impala we can allow migrating such tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-12597) Basic equality delete support

2023-12-20 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab resolved IMPALA-12597.
---
Fix Version/s: Impala 4.4.0
   Resolution: Fixed

> Basic equality delete support
> -
>
> Key: IMPALA-12597
> URL: https://issues.apache.org/jira/browse/IMPALA-12597
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend, Frontend
>Reporter: Gabor Kaszab
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
> Fix For: Impala 4.4.0
>
>
> To split up the Equality-delete read support task, let's deliver a patch for 
> some initial support first. The idea here is that apparently Flink (one of 
> the engines that can write equality delete files) can write only a subset of 
> the possible equality delete use cases that are allowed by the Iceberg spec.
> So as a first step let's deliver the functionality that is required to read 
> the EQ-deletes written by Flink. The use case: when Flink writes EQ-deletes 
> is for tables in upsert mode (primary key is a must in this case) in order to 
> guarantee the uniqueness of the primary key fields, for each insert (that is 
> in fact an upsert) Flink writes one delete file to remove the previous row 
> with the given PK (even if there hasn't been any) and then writes data files 
> with the new data.
> How we can narrow down the functionality to be implemented on Impala side:
>  * The set of PK columns is not alterable, so we don't have to implement when 
> different EQ-delete files have different equality field ID lists.
>  * Flink's ALTER  TABLE for Iceberg tables doesn't allow partition and schema 
> evolution. We can reject queries on eq-delete tables where there was 
> partition or schema evolution.
>  * As eq-deletes are written to NOT NULL PK's we could omit the case where 
> there are NULLs in the eq-delete file. (Update, this seemed easy to solve, so 
> will be part of this patch)
>  * For partitioned tables Flink requires the partition columns to be part of 
> the PK. As a result each EQ-delete file will have the partition values too so 
> no need to add extra logic to check if the partition spec ID and the 
> partition values match between the data and delete files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12649) Use max(data_sequence_number) fo joining equality delete rows

2023-12-18 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-12649:
-

 Summary: Use max(data_sequence_number) fo joining equality delete 
rows
 Key: IMPALA-12649
 URL: https://issues.apache.org/jira/browse/IMPALA-12649
 Project: IMPALA
  Issue Type: Sub-task
  Components: Frontend
Reporter: Gabor Kaszab


improvement idea for the future:

If Flink always writes EQ-delete files, and uses the same primary key a lot, we 
will have the same entry in the HashMap with multiple data sequence numbers. 
Then during probing, for each hash table lookup we need to loop over all the 
sequence numbers and check them. Actually we only need the largest data 
sequence number, the lower sequence numbers with the same primary keys don't 
add any value.

So we could add an Aggregation node to the right side of the join, like "PK1, 
PK2, ..., max(data_sequence_number), group by PK1, PK2, ...".

Now, we would need to decide when to add this node to the plan, or when we 
shouldn't. We should also avoid having an EXCHANGE between the aggregation node 
and the JOIN node, as it would be redundant as they would use the same 
partition key expressions (the primary keys).

If we had "hash teams" in Impala, we could always add this aggregator operator, 
as it would be in the same "hash team" with the JOIN operator, i.e. we wouldn't 
need to build the hash table twice. Microsoft's paper about hash joins and hash 
teams: 
[https://citeseerx.ist.psu.edu/document?repid=rep1=pdf=fc1c78cbef5062cf49fdb309b1935af08b759d2d]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12620) Missing field ID in the eq-delete file could filter out rows with null values

2023-12-13 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-12620:
-

 Summary: Missing field ID in the eq-delete file could filter out 
rows with null values
 Key: IMPALA-12620
 URL: https://issues.apache.org/jira/browse/IMPALA-12620
 Project: IMPALA
  Issue Type: Sub-task
Reporter: Gabor Kaszab


If a malformed equality delete file doesn't have some of the equality field IDs 
then Parquet schema resolver will identify these ase missing fields but won't 
fail the query. Missing fields instead are filled with NULL values. But when 
some of the columns in the equality delete tuples are NULLs then when 
anti-joining them with the data rows, they will match the NULL values from the 
data rows. As a result a malformed equality delete file could cause the rows 
being ommitted from the result where the field ID of the data row contains NULL 
and the field ID in the equality delete file is missing.

 

E.g.
Test data is

i,     s
(1, "str1")
(NULL, "str2")

and equality field ID is 1 (corresponding to column i).

When an equality delete file doesn't have column i and doesn't have field ID 1 
then it will make the second row missing from the result.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12266) Sporadic failure after migrating a table to Iceberg

2023-12-11 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795262#comment-17795262
 ] 

Gabor Kaszab commented on IMPALA-12266:
---

Hey [~stigahuang],

I'm not actively working on this due to lack of bandwidth just monitoring the 
situation. I did manage to repro the issue locally, see my comment from August 
(already that long ago?? :) ) but wasn't able to progress from that. For me 
this seems a timing issue, where a query right after a CONVERT TABLE might not 
see the converted table, but if you re-run it, it will succeed. I was wondering 
if a SYNC_DDL would help, but didn't have the time to try it out.

I'd be really grateful if you could take a look!

> Sporadic failure after migrating a table to Iceberg
> ---
>
> Key: IMPALA-12266
> URL: https://issues.apache.org/jira/browse/IMPALA-12266
> Project: IMPALA
>  Issue Type: Bug
>  Components: fe
>Affects Versions: Impala 4.2.0
>Reporter: Tamas Mate
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
> Attachments: 
> catalogd.bd40020df22b.invalid-user.log.INFO.20230704-181939.1, 
> impalad.6c0f48d9ce66.invalid-user.log.INFO.20230704-181940.1
>
>
> TestIcebergTable.test_convert_table test failed in a recent verify job's 
> dockerised tests:
> https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/7629
> {code:none}
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> EINNER EXCEPTION: 
> EMESSAGE: AnalysisException: Failed to load metadata for table: 
> 'parquet_nopartitioned'
> E   CAUSED BY: TableLoadingException: Could not load table 
> test_convert_table_cdba7383.parquet_nopartitioned from catalog
> E   CAUSED BY: TException: 
> TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, 
> error_msgs:[NullPointerException: null]), lookup_status:OK)
> {code}
> {code:none}
> E0704 19:09:22.980131   833 JniUtil.java:183] 
> 7145c21173f2c47b:2579db55] Error in Getting partial catalog object of 
> TABLE:test_convert_table_cdba7383.parquet_nopartitioned. Time spent: 49ms
> I0704 19:09:22.980309   833 jni-util.cc:288] 
> 7145c21173f2c47b:2579db55] java.lang.NullPointerException
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.replaceTableIfUnchanged(CatalogServiceCatalog.java:2357)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getOrLoadTable(CatalogServiceCatalog.java:2300)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.doGetPartialCatalogObject(CatalogServiceCatalog.java:3587)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3513)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3480)
>   at 
> org.apache.impala.service.JniCatalog.lambda$getPartialCatalogObject$11(JniCatalog.java:397)
>   at 
> org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90)
>   at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58)
>   at 
> org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89)
>   at 
> org.apache.impala.service.JniCatalogOp.execAndSerializeSilentStartAndFinish(JniCatalogOp.java:109)
>   at 
> org.apache.impala.service.JniCatalog.execAndSerializeSilentStartAndFinish(JniCatalog.java:238)
>   at 
> org.apache.impala.service.JniCatalog.getPartialCatalogObject(JniCatalog.java:396)
> I0704 19:09:22.980324   833 status.cc:129] 7145c21173f2c47b:2579db55] 
> NullPointerException: null
> @  0x1012f9f  impala::Status::Status()
> @  0x187f964  impala::JniUtil::GetJniExceptionMsg()
> @   0xfee920  impala::JniCall::Call<>()
> @   0xfccd0f  impala::Catalog::GetPartialCatalogObject()
> @   0xfb55a5  
> impala::CatalogServiceThriftIf::GetPartialCatalogObject()
> @   0xf7a691  
> impala::CatalogServiceProcessorT<>::process_GetPartialCatalogObject()
> @   0xf82151  impala::CatalogServiceProcessorT<>::dispatchCall()
> @   0xee330f  apache::thrift::TDispatchProcessor::process()
> @  0x1329246  
> apache::thrift::server::TAcceptQueueServer::Task::run()
> @  0x1315a89  impala::ThriftThread::RunRunnable()
> @  0x131773d  
> boost::detail::function::void_function_obj_invoker0<>::invoke()
> @  0x195ba8c  impala::Thread::SuperviseThread()
> @  0x195c895  boost::detail::thread_data<>::run()
> @  0x23a03a7  thread_proxy
> @ 0x7faaad2a66ba  start_thread
> @ 0x7f2c151d  clone
> E0704 19:09:23.006968   833 catalog-server.cc:278] 
> 7145c21173f2c47b:2579db55] NullPointerException: null
> {code}




[jira] [Created] (IMPALA-12608) Push down conjuncts to the equality delete scanner

2023-12-08 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-12608:
-

 Summary: Push down conjuncts to the equality delete scanner
 Key: IMPALA-12608
 URL: https://issues.apache.org/jira/browse/IMPALA-12608
 Project: IMPALA
  Issue Type: Sub-task
  Components: Frontend
Reporter: Gabor Kaszab


When we create the scan node for the Iceberg equality delete files in the 
initial implementation we don't push down any conjuncts to it. However, for 
better performance we can filter the conjuncts that are relevant for the 
equality delete scanner and push down them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-11388) Add support for equality-based deletes

2023-12-05 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-11388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793344#comment-17793344
 ] 

Gabor Kaszab commented on IMPALA-11388:
---

FYI, I decided to make this an EPIC and split the work up to multiple items so 
that we can deliver functionality gradually.

> Add support for equality-based deletes
> --
>
> Key: IMPALA-11388
> URL: https://issues.apache.org/jira/browse/IMPALA-11388
> Project: IMPALA
>  Issue Type: Epic
>  Components: Frontend
>Reporter: Zoltán Borók-Nagy
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
>
> Iceberg V2 adds support for row-level modifications.
> One way to implement this is via equality based delete files:
> https://iceberg.apache.org/spec/#equality-delete-files
> https://iceberg.apache.org/spec/#scan-planning
> We could implement this via doing ANTI HASH JOIN between data and delete 
> files. Similarly to what we do for Hive full ACID tables:
> https://github.com/apache/impala/blob/f5fc08573352d0a1943296209791a4db17268086/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java#L1729-L1735
> The complexity comes when different delete files use different set of 
> columns. In that case we will need multiple ANTI HASH JOINs on top of each 
> other.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Work started] (IMPALA-12597) Basic equality delete support

2023-12-05 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on IMPALA-12597 started by Gabor Kaszab.
-
> Basic equality delete support
> -
>
> Key: IMPALA-12597
> URL: https://issues.apache.org/jira/browse/IMPALA-12597
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend, Frontend
>Reporter: Gabor Kaszab
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
>
> To split up the Equality-delete read support task, let's deliver a patch for 
> some initial support first. The idea here is that apparently Flink (one of 
> the engines that can write equality delete files) can write only a subset of 
> the possible equality delete use cases that are allowed by the Iceberg spec.
> So as a first step let's deliver the functionality that is required to read 
> the EQ-deletes written by Flink. The use case: when Flink writes EQ-deletes 
> is for tables in upsert mode (primary key is a must in this case) in order to 
> guarantee the uniqueness of the primary key fields, for each insert (that is 
> in fact an upsert) Flink writes one delete file to remove the previous row 
> with the given PK (even if there hasn't been any) and then writes data files 
> with the new data.
> How we can narrow down the functionality to be implemented on Impala side:
>  * The set of PK columns is not alterable, so we don't have to implement when 
> different EQ-delete files have different equality field ID lists.
>  * Flink's ALTER  TABLE for Iceberg tables doesn't allow partition and schema 
> evolution. We can reject queries on eq-delete tables where there was 
> partition or schema evolution.
>  * As eq-deletes are written to NOT NULL PK's we could omit the case where 
> there are NULLs in the eq-delete file. (Update, this seemed easy to solve, so 
> will be part of this patch)
>  * For partitioned tables Flink requires the partition columns to be part of 
> the PK. As a result each EQ-delete file will have the partition values too so 
> no need to add extra logic to check if the partition spec ID and the 
> partition values match between the data and delete files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12600) Support equality deletes when table has partition or schema evolution

2023-12-05 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-12600:
-

 Summary: Support equality deletes when table has partition or 
schema evolution
 Key: IMPALA-12600
 URL: https://issues.apache.org/jira/browse/IMPALA-12600
 Project: IMPALA
  Issue Type: Sub-task
Reporter: Gabor Kaszab


With adding the basic equality delete read support, we reject queries for 
Iceberg tables that has equality delete files and has partition or schema 
evolution. This ticket is to enhance this functionality.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12599) Support equality delete files that don't contain the partition values

2023-12-05 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-12599:
-

 Summary: Support equality delete files that don't contain the 
partition values
 Key: IMPALA-12599
 URL: https://issues.apache.org/jira/browse/IMPALA-12599
 Project: IMPALA
  Issue Type: Sub-task
  Components: Frontend
Reporter: Gabor Kaszab


When you write equality delete files with Flink the partition columns have to 
also be part of the primary key. As a result the partition values will be added 
into the equality delete files. However, the Iceberg spec is more flexible than 
that and it's also a valid case when the partition values aren't written into 
the eq-delete files.

To be able to read such tables Impala should also check if the partition spec 
and the partition values match between the data and delete files when applying 
the delete rows. This could be achieved by adding a virtual columns and 
conjuncts for the partition spec IDs and also for the partition values. These 
virtual columns already exist, but have to be added to the scan nodes, and the 
conjuncts have to be created for the ANTI JOIN node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12598) Add support for multiple equality field ID list

2023-12-05 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-12598:
-

 Summary: Add support for multiple equality field ID list
 Key: IMPALA-12598
 URL: https://issues.apache.org/jira/browse/IMPALA-12598
 Project: IMPALA
  Issue Type: Sub-task
  Components: Frontend
Reporter: Gabor Kaszab


Iceberg metadata holds an equality field ID list for the equality-delete files. 
It's possible to have a different equality field ID list for different 
equality-delete files, for instance one file deletes by columnA while another 
file deletes by columnB.

When you have such a table you should have multiple layers of ANTI JOINs, one 
join for each equality field ID list.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12597) Basic equality delete support

2023-12-05 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-12597:
-

 Summary: Basic equality delete support
 Key: IMPALA-12597
 URL: https://issues.apache.org/jira/browse/IMPALA-12597
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend, Frontend
Reporter: Gabor Kaszab


To split up the Equality-delete read support task, let's deliver a patch for 
some initial support first. The idea here is that apparently Flink (one of the 
engines that can write equality delete files) can write only a subset of the 
possible equality delete use cases that are allowed by the Iceberg spec.

So as a first step let's deliver the functionality that is required to read the 
EQ-deletes written by Flink. The use case: when Flink writes EQ-deletes is for 
tables in upsert mode (primary key is a must in this case) in order to 
guarantee the uniqueness of the primary key fields, for each insert (that is in 
fact an upsert) Flink writes one delete file to remove the previous row with 
the given PK (even if there hasn't been any) and then writes data files with 
the new data.

How we can narrow down the functionality to be implemented on Impala side:
 * The set of PK columns is not alterable, so we don't have to implement when 
different EQ-delete files have different equality field ID lists.
 * Flink's ALTER  TABLE for Iceberg tables doesn't allow partition and schema 
evolution. We can reject queries on eq-delete tables where there was partition 
or schema evolution.
 * As eq-deletes are written to NOT NULL PK's we could omit the case where 
there are NULLs in the eq-delete file. (Update, this seemed easy to solve, so 
will be part of this patch)
 * For partitioned tables Flink requires the partition columns to be part of 
the PK. As a result each EQ-delete file will have the partition values too so 
no need to add extra logic to check if the partition spec ID and the partition 
values match between the data and delete files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-12597) Basic equality delete support

2023-12-05 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab reassigned IMPALA-12597:
-

Assignee: Gabor Kaszab

> Basic equality delete support
> -
>
> Key: IMPALA-12597
> URL: https://issues.apache.org/jira/browse/IMPALA-12597
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend, Frontend
>Reporter: Gabor Kaszab
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
>
> To split up the Equality-delete read support task, let's deliver a patch for 
> some initial support first. The idea here is that apparently Flink (one of 
> the engines that can write equality delete files) can write only a subset of 
> the possible equality delete use cases that are allowed by the Iceberg spec.
> So as a first step let's deliver the functionality that is required to read 
> the EQ-deletes written by Flink. The use case: when Flink writes EQ-deletes 
> is for tables in upsert mode (primary key is a must in this case) in order to 
> guarantee the uniqueness of the primary key fields, for each insert (that is 
> in fact an upsert) Flink writes one delete file to remove the previous row 
> with the given PK (even if there hasn't been any) and then writes data files 
> with the new data.
> How we can narrow down the functionality to be implemented on Impala side:
>  * The set of PK columns is not alterable, so we don't have to implement when 
> different EQ-delete files have different equality field ID lists.
>  * Flink's ALTER  TABLE for Iceberg tables doesn't allow partition and schema 
> evolution. We can reject queries on eq-delete tables where there was 
> partition or schema evolution.
>  * As eq-deletes are written to NOT NULL PK's we could omit the case where 
> there are NULLs in the eq-delete file. (Update, this seemed easy to solve, so 
> will be part of this patch)
>  * For partitioned tables Flink requires the partition columns to be part of 
> the PK. As a result each EQ-delete file will have the partition values too so 
> no need to add extra logic to check if the partition spec ID and the 
> partition values match between the data and delete files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-11388) Add support for equality-based deletes

2023-12-05 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-11388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab updated IMPALA-11388:
--
Issue Type: Epic  (was: New Feature)

> Add support for equality-based deletes
> --
>
> Key: IMPALA-11388
> URL: https://issues.apache.org/jira/browse/IMPALA-11388
> Project: IMPALA
>  Issue Type: Epic
>  Components: Frontend
>Reporter: Zoltán Borók-Nagy
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
>
> Iceberg V2 adds support for row-level modifications.
> One way to implement this is via equality based delete files:
> https://iceberg.apache.org/spec/#equality-delete-files
> https://iceberg.apache.org/spec/#scan-planning
> We could implement this via doing ANTI HASH JOIN between data and delete 
> files. Similarly to what we do for Hive full ACID tables:
> https://github.com/apache/impala/blob/f5fc08573352d0a1943296209791a4db17268086/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java#L1729-L1735
> The complexity comes when different delete files use different set of 
> columns. In that case we will need multiple ANTI HASH JOINs on top of each 
> other.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-11388) Add support for equality-based deletes

2023-12-05 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-11388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab updated IMPALA-11388:
--
Epic Link: (was: IMPALA-11386)

> Add support for equality-based deletes
> --
>
> Key: IMPALA-11388
> URL: https://issues.apache.org/jira/browse/IMPALA-11388
> Project: IMPALA
>  Issue Type: Epic
>  Components: Frontend
>Reporter: Zoltán Borók-Nagy
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
>
> Iceberg V2 adds support for row-level modifications.
> One way to implement this is via equality based delete files:
> https://iceberg.apache.org/spec/#equality-delete-files
> https://iceberg.apache.org/spec/#scan-planning
> We could implement this via doing ANTI HASH JOIN between data and delete 
> files. Similarly to what we do for Hive full ACID tables:
> https://github.com/apache/impala/blob/f5fc08573352d0a1943296209791a4db17268086/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java#L1729-L1735
> The complexity comes when different delete files use different set of 
> columns. In that case we will need multiple ANTI HASH JOINs on top of each 
> other.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-12308) Implement DIRECTED distribution mode for Iceberg tables

2023-11-22 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab resolved IMPALA-12308.
---
Fix Version/s: Impala 4.4.0
   Resolution: Fixed

> Implement DIRECTED distribution mode for Iceberg tables
> ---
>
> Key: IMPALA-12308
> URL: https://issues.apache.org/jira/browse/IMPALA-12308
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend, Frontend
>Reporter: Zoltán Borók-Nagy
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg, performance
> Fix For: Impala 4.4.0
>
>
> Currently there are two distribution modes for JOIN-operators:
> * BROADCAST: RHS is delivered to all executors of LHS
> * PARTITIONED: both LHS and RHS are shuffled across executors
> We implement reading of an Iceberg V2 table (with position delete files) via 
> an ANTI JOIN operator. LHS is the SCAN operator of the data records, RHS is 
> the SCAN operator of the delete records. The delete record contain 
> (file_path, pos) information of the deleted rows.
> This means we can invent another distribution mode, just for Iceberg V2 
> tables with position deletes: DIRECTED distribution mode.
> At scheduling we must save the information about data SCAN operators, i.e. on 
> which nodes are they going to be executed. The LHS don't need to be shuffled 
> over the network.
> The delete records of RHS can use the scheduling information to transfer 
> delete records to the hosts that process the corresponding data file.
> This minimizes network communication.
> We can also add further optimizations to the Iceberg V2 operator 
> (IcebergDeleteNode):
> * Compare the pointers of the file paths instead of doing string compare
> * Each tuple in a rowbatch belong to the same file, and positions are in 
> ascending order
> ** Onlyone lookup is needed from the Hash table
> ** We can add fast paths to skip testing the whole rowbatch (when the row 
> batch's position range is outside of the delete position range)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12543) test_iceberg_self_events failed in JDK11 build

2023-11-06 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783138#comment-17783138
 ] 

Gabor Kaszab commented on IMPALA-12543:
---

Hey [~rizaon] ,

Does this test fail constantly? IMPALA-11387 seems pretty unrelated for me. 
Isn't it possible that this test is simply flaky

> test_iceberg_self_events failed in JDK11 build
> --
>
> Key: IMPALA-12543
> URL: https://issues.apache.org/jira/browse/IMPALA-12543
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Riza Suminto
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: broken-build
>
> test_iceberg_self_events failed in JDK11 build with following error.
>  
> {code:java}
> Error Message
> assert 0 == 1
> Stacktrace
> custom_cluster/test_events_custom_configs.py:637: in test_iceberg_self_events
>     check_self_events("ALTER TABLE {0} ADD COLUMN j INT".format(tbl_name))
> custom_cluster/test_events_custom_configs.py:624: in check_self_events
>     assert tbls_refreshed_before == tbls_refreshed_after
> E   assert 0 == 1 {code}
> This test still pass before IMPALA-11387 merged.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12457) Conversion from non-supported column types for Iceberg tables

2023-09-22 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab updated IMPALA-12457:
--
Issue Type: Improvement  (was: New Feature)

> Conversion from non-supported column types for Iceberg tables
> -
>
> Key: IMPALA-12457
> URL: https://issues.apache.org/jira/browse/IMPALA-12457
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend
>Reporter: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
>
> Assume you have a Hive table with one VARCHAR(N) column. The following now 
> fails:
> CREATE TABLE ice_tbl STORED AS ICEBERG AS SELECT * FROM hive_tbl;
> Fails because varchar(N) is not a supported Iceberg column type. Note, simple 
> varchar works because it's just a string under the hood.
> I think this behaviour is just fine, Hive also gives an error for the above, 
> however, Hive has a switch called 'iceberg.mr.schema.auto.conversion' that 
> when you turn on Hive would do a conversion into string for varchar(N) 
> automatically. Also smallint and tinyint could be converted into int.
> Would be nice to have something similar in Impala.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12457) Conversion from non-supported column types for Iceberg tables

2023-09-22 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-12457:
-

 Summary: Conversion from non-supported column types for Iceberg 
tables
 Key: IMPALA-12457
 URL: https://issues.apache.org/jira/browse/IMPALA-12457
 Project: IMPALA
  Issue Type: New Feature
  Components: Frontend
Reporter: Gabor Kaszab


Assume you have a Hive table with one VARCHAR(N) column. The following now 
fails:

CREATE TABLE ice_tbl STORED AS ICEBERG AS SELECT * FROM hive_tbl;

Fails because varchar(N) is not a supported Iceberg column type. Note, simple 
varchar works because it's just a string under the hood.

I think this behaviour is just fine, Hive also gives an error for the above, 
however, Hive has a switch called 'iceberg.mr.schema.auto.conversion' that when 
you turn on Hive would do a conversion into string for varchar(N) 
automatically. Also smallint and tinyint could be converted into int.

Would be nice to have something similar in Impala.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12409) Don't allow EXTERNAL Iceberg tables to point another Iceberg table in Hive catalog

2023-08-31 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab updated IMPALA-12409:
--
Description: 
We shouldn't allow users creating an EXTERNAL Iceberg table that points to 
another Iceberg table. I.e. the following should be forbidden:
{noformat}
CREATE EXTERNAL TABLE ice_ext
STORED BY ICEBERG
TBLPROPERTIES ('iceberg.table_identifier'='db.tbl');{noformat}

  was:
We shouldn't allow users creating an EXTERNAL Iceberg table that points to 
another Iceberg catalog. I.e. the following should be forbidden:
{noformat}
CREATE EXTERNAL TABLE ice_ext
STORED BY ICEBERG
TBLPROPERTIES ('iceberg.table_identifier'='db.tbl');{noformat}


> Don't allow EXTERNAL Iceberg tables to point another Iceberg table in Hive 
> catalog
> --
>
> Key: IMPALA-12409
> URL: https://issues.apache.org/jira/browse/IMPALA-12409
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog, Frontend
>Reporter: Zoltán Borók-Nagy
>Assignee: Zoltán Borók-Nagy
>Priority: Major
>  Labels: impala-iceberg
>
> We shouldn't allow users creating an EXTERNAL Iceberg table that points to 
> another Iceberg table. I.e. the following should be forbidden:
> {noformat}
> CREATE EXTERNAL TABLE ice_ext
> STORED BY ICEBERG
> TBLPROPERTIES ('iceberg.table_identifier'='db.tbl');{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12410) Impala's CONVERT TO ICEBERG statement does not retain table properties

2023-08-31 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760756#comment-17760756
 ] 

Gabor Kaszab commented on IMPALA-12410:
---

Note, we have to be careful with migrating the table properties. E.g. if a user 
had set 'iceberg.table_identifier' then we don't want to migrate that property 
as it could point to another Iceberg table, that is in fact restricted by 
https://issues.apache.org/jira/browse/IMPALA-12409

I think the properties with 'iceberg.' prefix shouldn't be kept during 
migration. Not sure about 'name', but we might want to drop that as well.

> Impala's CONVERT TO ICEBERG statement does not retain table properties
> --
>
> Key: IMPALA-12410
> URL: https://issues.apache.org/jira/browse/IMPALA-12410
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Reporter: Zoltán Borók-Nagy
>Priority: Major
>  Labels: impala-iceberg
>
> Impala's CONVERT TO ICEBERG statement does not retain table properties.
> Table properties should be retained except the ones used by Iceberg, e.g.:
>  * metadata_location
>  * iceberg.table_identifier
>  * name
>  * 
> iceberg.mr.table.identifier



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12266) Sporadic failure after migrating a table to Iceberg

2023-08-23 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17758001#comment-17758001
 ] 

Gabor Kaszab commented on IMPALA-12266:
---

With the repros steps I found 3 different issues that randomly can occur: 1) 
the one mentioned in the description, 2) Could not resolve path 3) Table does 
not exist. I believe that all 3 are for the same root cause.

> Sporadic failure after migrating a table to Iceberg
> ---
>
> Key: IMPALA-12266
> URL: https://issues.apache.org/jira/browse/IMPALA-12266
> Project: IMPALA
>  Issue Type: Bug
>  Components: fe
>Affects Versions: Impala 4.2.0
>Reporter: Tamas Mate
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
> Attachments: 
> catalogd.bd40020df22b.invalid-user.log.INFO.20230704-181939.1, 
> impalad.6c0f48d9ce66.invalid-user.log.INFO.20230704-181940.1
>
>
> TestIcebergTable.test_convert_table test failed in a recent verify job's 
> dockerised tests:
> https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/7629
> {code:none}
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> EINNER EXCEPTION: 
> EMESSAGE: AnalysisException: Failed to load metadata for table: 
> 'parquet_nopartitioned'
> E   CAUSED BY: TableLoadingException: Could not load table 
> test_convert_table_cdba7383.parquet_nopartitioned from catalog
> E   CAUSED BY: TException: 
> TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, 
> error_msgs:[NullPointerException: null]), lookup_status:OK)
> {code}
> {code:none}
> E0704 19:09:22.980131   833 JniUtil.java:183] 
> 7145c21173f2c47b:2579db55] Error in Getting partial catalog object of 
> TABLE:test_convert_table_cdba7383.parquet_nopartitioned. Time spent: 49ms
> I0704 19:09:22.980309   833 jni-util.cc:288] 
> 7145c21173f2c47b:2579db55] java.lang.NullPointerException
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.replaceTableIfUnchanged(CatalogServiceCatalog.java:2357)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getOrLoadTable(CatalogServiceCatalog.java:2300)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.doGetPartialCatalogObject(CatalogServiceCatalog.java:3587)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3513)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3480)
>   at 
> org.apache.impala.service.JniCatalog.lambda$getPartialCatalogObject$11(JniCatalog.java:397)
>   at 
> org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90)
>   at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58)
>   at 
> org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89)
>   at 
> org.apache.impala.service.JniCatalogOp.execAndSerializeSilentStartAndFinish(JniCatalogOp.java:109)
>   at 
> org.apache.impala.service.JniCatalog.execAndSerializeSilentStartAndFinish(JniCatalog.java:238)
>   at 
> org.apache.impala.service.JniCatalog.getPartialCatalogObject(JniCatalog.java:396)
> I0704 19:09:22.980324   833 status.cc:129] 7145c21173f2c47b:2579db55] 
> NullPointerException: null
> @  0x1012f9f  impala::Status::Status()
> @  0x187f964  impala::JniUtil::GetJniExceptionMsg()
> @   0xfee920  impala::JniCall::Call<>()
> @   0xfccd0f  impala::Catalog::GetPartialCatalogObject()
> @   0xfb55a5  
> impala::CatalogServiceThriftIf::GetPartialCatalogObject()
> @   0xf7a691  
> impala::CatalogServiceProcessorT<>::process_GetPartialCatalogObject()
> @   0xf82151  impala::CatalogServiceProcessorT<>::dispatchCall()
> @   0xee330f  apache::thrift::TDispatchProcessor::process()
> @  0x1329246  
> apache::thrift::server::TAcceptQueueServer::Task::run()
> @  0x1315a89  impala::ThriftThread::RunRunnable()
> @  0x131773d  
> boost::detail::function::void_function_obj_invoker0<>::invoke()
> @  0x195ba8c  impala::Thread::SuperviseThread()
> @  0x195c895  boost::detail::thread_data<>::run()
> @  0x23a03a7  thread_proxy
> @ 0x7faaad2a66ba  start_thread
> @ 0x7f2c151d  clone
> E0704 19:09:23.006968   833 catalog-server.cc:278] 
> 7145c21173f2c47b:2579db55] NullPointerException: null
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12266) Sporadic failure after migrating a table to Iceberg

2023-08-22 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab updated IMPALA-12266:
--
Labels: impala-iceberg  (was: )

> Sporadic failure after migrating a table to Iceberg
> ---
>
> Key: IMPALA-12266
> URL: https://issues.apache.org/jira/browse/IMPALA-12266
> Project: IMPALA
>  Issue Type: Bug
>  Components: fe
>Affects Versions: Impala 4.2.0
>Reporter: Tamas Mate
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
> Attachments: 
> catalogd.bd40020df22b.invalid-user.log.INFO.20230704-181939.1, 
> impalad.6c0f48d9ce66.invalid-user.log.INFO.20230704-181940.1
>
>
> TestIcebergTable.test_convert_table test failed in a recent verify job's 
> dockerised tests:
> https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/7629
> {code:none}
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> EINNER EXCEPTION: 
> EMESSAGE: AnalysisException: Failed to load metadata for table: 
> 'parquet_nopartitioned'
> E   CAUSED BY: TableLoadingException: Could not load table 
> test_convert_table_cdba7383.parquet_nopartitioned from catalog
> E   CAUSED BY: TException: 
> TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, 
> error_msgs:[NullPointerException: null]), lookup_status:OK)
> {code}
> {code:none}
> E0704 19:09:22.980131   833 JniUtil.java:183] 
> 7145c21173f2c47b:2579db55] Error in Getting partial catalog object of 
> TABLE:test_convert_table_cdba7383.parquet_nopartitioned. Time spent: 49ms
> I0704 19:09:22.980309   833 jni-util.cc:288] 
> 7145c21173f2c47b:2579db55] java.lang.NullPointerException
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.replaceTableIfUnchanged(CatalogServiceCatalog.java:2357)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getOrLoadTable(CatalogServiceCatalog.java:2300)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.doGetPartialCatalogObject(CatalogServiceCatalog.java:3587)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3513)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3480)
>   at 
> org.apache.impala.service.JniCatalog.lambda$getPartialCatalogObject$11(JniCatalog.java:397)
>   at 
> org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90)
>   at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58)
>   at 
> org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89)
>   at 
> org.apache.impala.service.JniCatalogOp.execAndSerializeSilentStartAndFinish(JniCatalogOp.java:109)
>   at 
> org.apache.impala.service.JniCatalog.execAndSerializeSilentStartAndFinish(JniCatalog.java:238)
>   at 
> org.apache.impala.service.JniCatalog.getPartialCatalogObject(JniCatalog.java:396)
> I0704 19:09:22.980324   833 status.cc:129] 7145c21173f2c47b:2579db55] 
> NullPointerException: null
> @  0x1012f9f  impala::Status::Status()
> @  0x187f964  impala::JniUtil::GetJniExceptionMsg()
> @   0xfee920  impala::JniCall::Call<>()
> @   0xfccd0f  impala::Catalog::GetPartialCatalogObject()
> @   0xfb55a5  
> impala::CatalogServiceThriftIf::GetPartialCatalogObject()
> @   0xf7a691  
> impala::CatalogServiceProcessorT<>::process_GetPartialCatalogObject()
> @   0xf82151  impala::CatalogServiceProcessorT<>::dispatchCall()
> @   0xee330f  apache::thrift::TDispatchProcessor::process()
> @  0x1329246  
> apache::thrift::server::TAcceptQueueServer::Task::run()
> @  0x1315a89  impala::ThriftThread::RunRunnable()
> @  0x131773d  
> boost::detail::function::void_function_obj_invoker0<>::invoke()
> @  0x195ba8c  impala::Thread::SuperviseThread()
> @  0x195c895  boost::detail::thread_data<>::run()
> @  0x23a03a7  thread_proxy
> @ 0x7faaad2a66ba  start_thread
> @ 0x7f2c151d  clone
> E0704 19:09:23.006968   833 catalog-server.cc:278] 
> 7145c21173f2c47b:2579db55] NullPointerException: null
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12266) Sporadic failure after migrating a table to Iceberg

2023-08-22 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757380#comment-17757380
 ] 

Gabor Kaszab commented on IMPALA-12266:
---

I managed to repro this with running the following SQL in a loop:
{code:java}
create table tmp_conv_tbl (i int) stored as parquet;
insert into tmp_conv_tbl values (1), (2), (3);
alter table tmp_conv_tbl convert to iceberg;
alter table tmp_conv_tbl set tblproperties ('format-version'='2');
drop table tmp_conv_tbl; {code}
For me the DROP TABLE statement failed with "Table doesn not exist" error. I 
guess it depends on which command is run on a different coordinator after the 
table conversion.

Note, that this repro came in local catalog mode, however, I wouldn't be 
surprised if this repro-ed in normal catalog mode.

This is how I enabled local catalog mode:
{code:java}
bin/start-impala-cluster.py --impalad_args='--use_local_catalog=true' 
--catalogd_args='--catalog_topic_mode=minimal' {code}

> Sporadic failure after migrating a table to Iceberg
> ---
>
> Key: IMPALA-12266
> URL: https://issues.apache.org/jira/browse/IMPALA-12266
> Project: IMPALA
>  Issue Type: Bug
>  Components: fe
>Affects Versions: Impala 4.2.0
>Reporter: Tamas Mate
>Assignee: Gabor Kaszab
>Priority: Major
> Attachments: 
> catalogd.bd40020df22b.invalid-user.log.INFO.20230704-181939.1, 
> impalad.6c0f48d9ce66.invalid-user.log.INFO.20230704-181940.1
>
>
> TestIcebergTable.test_convert_table test failed in a recent verify job's 
> dockerised tests:
> https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/7629
> {code:none}
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> EINNER EXCEPTION: 
> EMESSAGE: AnalysisException: Failed to load metadata for table: 
> 'parquet_nopartitioned'
> E   CAUSED BY: TableLoadingException: Could not load table 
> test_convert_table_cdba7383.parquet_nopartitioned from catalog
> E   CAUSED BY: TException: 
> TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, 
> error_msgs:[NullPointerException: null]), lookup_status:OK)
> {code}
> {code:none}
> E0704 19:09:22.980131   833 JniUtil.java:183] 
> 7145c21173f2c47b:2579db55] Error in Getting partial catalog object of 
> TABLE:test_convert_table_cdba7383.parquet_nopartitioned. Time spent: 49ms
> I0704 19:09:22.980309   833 jni-util.cc:288] 
> 7145c21173f2c47b:2579db55] java.lang.NullPointerException
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.replaceTableIfUnchanged(CatalogServiceCatalog.java:2357)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getOrLoadTable(CatalogServiceCatalog.java:2300)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.doGetPartialCatalogObject(CatalogServiceCatalog.java:3587)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3513)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3480)
>   at 
> org.apache.impala.service.JniCatalog.lambda$getPartialCatalogObject$11(JniCatalog.java:397)
>   at 
> org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90)
>   at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58)
>   at 
> org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89)
>   at 
> org.apache.impala.service.JniCatalogOp.execAndSerializeSilentStartAndFinish(JniCatalogOp.java:109)
>   at 
> org.apache.impala.service.JniCatalog.execAndSerializeSilentStartAndFinish(JniCatalog.java:238)
>   at 
> org.apache.impala.service.JniCatalog.getPartialCatalogObject(JniCatalog.java:396)
> I0704 19:09:22.980324   833 status.cc:129] 7145c21173f2c47b:2579db55] 
> NullPointerException: null
> @  0x1012f9f  impala::Status::Status()
> @  0x187f964  impala::JniUtil::GetJniExceptionMsg()
> @   0xfee920  impala::JniCall::Call<>()
> @   0xfccd0f  impala::Catalog::GetPartialCatalogObject()
> @   0xfb55a5  
> impala::CatalogServiceThriftIf::GetPartialCatalogObject()
> @   0xf7a691  
> impala::CatalogServiceProcessorT<>::process_GetPartialCatalogObject()
> @   0xf82151  impala::CatalogServiceProcessorT<>::dispatchCall()
> @   0xee330f  apache::thrift::TDispatchProcessor::process()
> @  0x1329246  
> apache::thrift::server::TAcceptQueueServer::Task::run()
> @  0x1315a89  impala::ThriftThread::RunRunnable()
> @  0x131773d  
> boost::detail::function::void_function_obj_invoker0<>::invoke()
> @  0x195ba8c  impala::Thread::SuperviseThread()
> @  0x195c895  boost::detail::thread_data<>::run()
> @ 

[jira] [Updated] (IMPALA-12266) Sporadic failure after migrating a table to Iceberg

2023-08-22 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab updated IMPALA-12266:
--
Summary: Sporadic failure after migrating a table to Iceberg  (was: Flaky 
TestIcebergTable.test_convert_table NPE)

> Sporadic failure after migrating a table to Iceberg
> ---
>
> Key: IMPALA-12266
> URL: https://issues.apache.org/jira/browse/IMPALA-12266
> Project: IMPALA
>  Issue Type: Bug
>  Components: fe
>Affects Versions: Impala 4.2.0
>Reporter: Tamas Mate
>Assignee: Gabor Kaszab
>Priority: Major
> Attachments: 
> catalogd.bd40020df22b.invalid-user.log.INFO.20230704-181939.1, 
> impalad.6c0f48d9ce66.invalid-user.log.INFO.20230704-181940.1
>
>
> TestIcebergTable.test_convert_table test failed in a recent verify job's 
> dockerised tests:
> https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/7629
> {code:none}
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> EINNER EXCEPTION: 
> EMESSAGE: AnalysisException: Failed to load metadata for table: 
> 'parquet_nopartitioned'
> E   CAUSED BY: TableLoadingException: Could not load table 
> test_convert_table_cdba7383.parquet_nopartitioned from catalog
> E   CAUSED BY: TException: 
> TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, 
> error_msgs:[NullPointerException: null]), lookup_status:OK)
> {code}
> {code:none}
> E0704 19:09:22.980131   833 JniUtil.java:183] 
> 7145c21173f2c47b:2579db55] Error in Getting partial catalog object of 
> TABLE:test_convert_table_cdba7383.parquet_nopartitioned. Time spent: 49ms
> I0704 19:09:22.980309   833 jni-util.cc:288] 
> 7145c21173f2c47b:2579db55] java.lang.NullPointerException
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.replaceTableIfUnchanged(CatalogServiceCatalog.java:2357)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getOrLoadTable(CatalogServiceCatalog.java:2300)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.doGetPartialCatalogObject(CatalogServiceCatalog.java:3587)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3513)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3480)
>   at 
> org.apache.impala.service.JniCatalog.lambda$getPartialCatalogObject$11(JniCatalog.java:397)
>   at 
> org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90)
>   at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58)
>   at 
> org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89)
>   at 
> org.apache.impala.service.JniCatalogOp.execAndSerializeSilentStartAndFinish(JniCatalogOp.java:109)
>   at 
> org.apache.impala.service.JniCatalog.execAndSerializeSilentStartAndFinish(JniCatalog.java:238)
>   at 
> org.apache.impala.service.JniCatalog.getPartialCatalogObject(JniCatalog.java:396)
> I0704 19:09:22.980324   833 status.cc:129] 7145c21173f2c47b:2579db55] 
> NullPointerException: null
> @  0x1012f9f  impala::Status::Status()
> @  0x187f964  impala::JniUtil::GetJniExceptionMsg()
> @   0xfee920  impala::JniCall::Call<>()
> @   0xfccd0f  impala::Catalog::GetPartialCatalogObject()
> @   0xfb55a5  
> impala::CatalogServiceThriftIf::GetPartialCatalogObject()
> @   0xf7a691  
> impala::CatalogServiceProcessorT<>::process_GetPartialCatalogObject()
> @   0xf82151  impala::CatalogServiceProcessorT<>::dispatchCall()
> @   0xee330f  apache::thrift::TDispatchProcessor::process()
> @  0x1329246  
> apache::thrift::server::TAcceptQueueServer::Task::run()
> @  0x1315a89  impala::ThriftThread::RunRunnable()
> @  0x131773d  
> boost::detail::function::void_function_obj_invoker0<>::invoke()
> @  0x195ba8c  impala::Thread::SuperviseThread()
> @  0x195c895  boost::detail::thread_data<>::run()
> @  0x23a03a7  thread_proxy
> @ 0x7faaad2a66ba  start_thread
> @ 0x7f2c151d  clone
> E0704 19:09:23.006968   833 catalog-server.cc:278] 
> 7145c21173f2c47b:2579db55] NullPointerException: null
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12266) Flaky TestIcebergTable.test_convert_table NPE

2023-08-22 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757376#comment-17757376
 ] 

Gabor Kaszab commented on IMPALA-12266:
---

This is actually more than just a flaky tests as it comes in various scenarios. 
I'll rename the ticket to reflect this.

> Flaky TestIcebergTable.test_convert_table NPE
> -
>
> Key: IMPALA-12266
> URL: https://issues.apache.org/jira/browse/IMPALA-12266
> Project: IMPALA
>  Issue Type: Bug
>  Components: fe
>Affects Versions: Impala 4.2.0
>Reporter: Tamas Mate
>Assignee: Gabor Kaszab
>Priority: Major
> Attachments: 
> catalogd.bd40020df22b.invalid-user.log.INFO.20230704-181939.1, 
> impalad.6c0f48d9ce66.invalid-user.log.INFO.20230704-181940.1
>
>
> TestIcebergTable.test_convert_table test failed in a recent verify job's 
> dockerised tests:
> https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/7629
> {code:none}
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> EINNER EXCEPTION: 
> EMESSAGE: AnalysisException: Failed to load metadata for table: 
> 'parquet_nopartitioned'
> E   CAUSED BY: TableLoadingException: Could not load table 
> test_convert_table_cdba7383.parquet_nopartitioned from catalog
> E   CAUSED BY: TException: 
> TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, 
> error_msgs:[NullPointerException: null]), lookup_status:OK)
> {code}
> {code:none}
> E0704 19:09:22.980131   833 JniUtil.java:183] 
> 7145c21173f2c47b:2579db55] Error in Getting partial catalog object of 
> TABLE:test_convert_table_cdba7383.parquet_nopartitioned. Time spent: 49ms
> I0704 19:09:22.980309   833 jni-util.cc:288] 
> 7145c21173f2c47b:2579db55] java.lang.NullPointerException
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.replaceTableIfUnchanged(CatalogServiceCatalog.java:2357)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getOrLoadTable(CatalogServiceCatalog.java:2300)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.doGetPartialCatalogObject(CatalogServiceCatalog.java:3587)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3513)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3480)
>   at 
> org.apache.impala.service.JniCatalog.lambda$getPartialCatalogObject$11(JniCatalog.java:397)
>   at 
> org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90)
>   at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58)
>   at 
> org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89)
>   at 
> org.apache.impala.service.JniCatalogOp.execAndSerializeSilentStartAndFinish(JniCatalogOp.java:109)
>   at 
> org.apache.impala.service.JniCatalog.execAndSerializeSilentStartAndFinish(JniCatalog.java:238)
>   at 
> org.apache.impala.service.JniCatalog.getPartialCatalogObject(JniCatalog.java:396)
> I0704 19:09:22.980324   833 status.cc:129] 7145c21173f2c47b:2579db55] 
> NullPointerException: null
> @  0x1012f9f  impala::Status::Status()
> @  0x187f964  impala::JniUtil::GetJniExceptionMsg()
> @   0xfee920  impala::JniCall::Call<>()
> @   0xfccd0f  impala::Catalog::GetPartialCatalogObject()
> @   0xfb55a5  
> impala::CatalogServiceThriftIf::GetPartialCatalogObject()
> @   0xf7a691  
> impala::CatalogServiceProcessorT<>::process_GetPartialCatalogObject()
> @   0xf82151  impala::CatalogServiceProcessorT<>::dispatchCall()
> @   0xee330f  apache::thrift::TDispatchProcessor::process()
> @  0x1329246  
> apache::thrift::server::TAcceptQueueServer::Task::run()
> @  0x1315a89  impala::ThriftThread::RunRunnable()
> @  0x131773d  
> boost::detail::function::void_function_obj_invoker0<>::invoke()
> @  0x195ba8c  impala::Thread::SuperviseThread()
> @  0x195c895  boost::detail::thread_data<>::run()
> @  0x23a03a7  thread_proxy
> @ 0x7faaad2a66ba  start_thread
> @ 0x7f2c151d  clone
> E0704 19:09:23.006968   833 catalog-server.cc:278] 
> 7145c21173f2c47b:2579db55] NullPointerException: null
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12308) Implement DIRECTED distribution mode for Iceberg tables

2023-08-07 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab updated IMPALA-12308:
--
Issue Type: Improvement  (was: Bug)

> Implement DIRECTED distribution mode for Iceberg tables
> ---
>
> Key: IMPALA-12308
> URL: https://issues.apache.org/jira/browse/IMPALA-12308
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend, Frontend
>Reporter: Zoltán Borók-Nagy
>Priority: Major
>  Labels: impala-iceberg, performance
>
> Currently there are two distribution modes for JOIN-operators:
> * BROADCAST: RHS is delivered to all executors of LHS
> * PARTITIONED: both LHS and RHS are shuffled across executors
> We implement reading of an Iceberg V2 table (with position delete files) via 
> an ANTI JOIN operator. LHS is the SCAN operator of the data records, RHS is 
> the SCAN operator of the delete records. The delete record contain 
> (file_path, pos) information of the deleted rows.
> This means we can invent another distribution mode, just for Iceberg V2 
> tables with position deletes: DIRECTED distribution mode.
> At scheduling we must save the information about data SCAN operators, i.e. on 
> which nodes are they going to be executed. The LHS don't need to be shuffled 
> over the network.
> The delete records of RHS can use the scheduling information to transfer 
> delete records to the hosts that process the corresponding data file.
> This minimizes network communication.
> We can also add further optimizations to the Iceberg V2 operator 
> (IcebergDeleteNode):
> * Compare the pointers of the file paths instead of doing string compare
> * Each tuple in a rowbatch belong to the same file, and positions are in 
> ascending order
> ** Onlyone lookup is needed from the Hash table
> ** We can add fast paths to skip testing the whole rowbatch (when the row 
> batch's position range is outside of the delete position range)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-12308) Implement DIRECTED distribution mode for Iceberg tables

2023-08-07 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab reassigned IMPALA-12308:
-

Assignee: Gabor Kaszab

> Implement DIRECTED distribution mode for Iceberg tables
> ---
>
> Key: IMPALA-12308
> URL: https://issues.apache.org/jira/browse/IMPALA-12308
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend, Frontend
>Reporter: Zoltán Borók-Nagy
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg, performance
>
> Currently there are two distribution modes for JOIN-operators:
> * BROADCAST: RHS is delivered to all executors of LHS
> * PARTITIONED: both LHS and RHS are shuffled across executors
> We implement reading of an Iceberg V2 table (with position delete files) via 
> an ANTI JOIN operator. LHS is the SCAN operator of the data records, RHS is 
> the SCAN operator of the delete records. The delete record contain 
> (file_path, pos) information of the deleted rows.
> This means we can invent another distribution mode, just for Iceberg V2 
> tables with position deletes: DIRECTED distribution mode.
> At scheduling we must save the information about data SCAN operators, i.e. on 
> which nodes are they going to be executed. The LHS don't need to be shuffled 
> over the network.
> The delete records of RHS can use the scheduling information to transfer 
> delete records to the hosts that process the corresponding data file.
> This minimizes network communication.
> We can also add further optimizations to the Iceberg V2 operator 
> (IcebergDeleteNode):
> * Compare the pointers of the file paths instead of doing string compare
> * Each tuple in a rowbatch belong to the same file, and positions are in 
> ascending order
> ** Onlyone lookup is needed from the Hash table
> ** We can add fast paths to skip testing the whole rowbatch (when the row 
> batch's position range is outside of the delete position range)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12266) Flaky TestIcebergTable.test_convert_table NPE

2023-07-17 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17743796#comment-17743796
 ] 

Gabor Kaszab commented on IMPALA-12266:
---

I checked the logs the other day and for me it seems that the table migration 
to Iceberg was successful, but the first following query in that converted 
table failed, I recall it was a simple select count(*). It's a bit strange that 
this is flaky. I suspect that the issue is only with the GVO build, most 
probably with local catalog mode turned on. So there might be some timing 
issues when we have converted the table but some of the coordinators don't see 
it under the original name.

> Flaky TestIcebergTable.test_convert_table NPE
> -
>
> Key: IMPALA-12266
> URL: https://issues.apache.org/jira/browse/IMPALA-12266
> Project: IMPALA
>  Issue Type: Bug
>  Components: fe
>Affects Versions: Impala 4.2.0
>Reporter: Tamas Mate
>Assignee: Gabor Kaszab
>Priority: Major
> Attachments: 
> catalogd.bd40020df22b.invalid-user.log.INFO.20230704-181939.1, 
> impalad.6c0f48d9ce66.invalid-user.log.INFO.20230704-181940.1
>
>
> TestIcebergTable.test_convert_table test failed in a recent verify job's 
> dockerised tests:
> https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/7629
> {code:none}
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> EINNER EXCEPTION: 
> EMESSAGE: AnalysisException: Failed to load metadata for table: 
> 'parquet_nopartitioned'
> E   CAUSED BY: TableLoadingException: Could not load table 
> test_convert_table_cdba7383.parquet_nopartitioned from catalog
> E   CAUSED BY: TException: 
> TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, 
> error_msgs:[NullPointerException: null]), lookup_status:OK)
> {code}
> {code:none}
> E0704 19:09:22.980131   833 JniUtil.java:183] 
> 7145c21173f2c47b:2579db55] Error in Getting partial catalog object of 
> TABLE:test_convert_table_cdba7383.parquet_nopartitioned. Time spent: 49ms
> I0704 19:09:22.980309   833 jni-util.cc:288] 
> 7145c21173f2c47b:2579db55] java.lang.NullPointerException
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.replaceTableIfUnchanged(CatalogServiceCatalog.java:2357)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getOrLoadTable(CatalogServiceCatalog.java:2300)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.doGetPartialCatalogObject(CatalogServiceCatalog.java:3587)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3513)
>   at 
> org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3480)
>   at 
> org.apache.impala.service.JniCatalog.lambda$getPartialCatalogObject$11(JniCatalog.java:397)
>   at 
> org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90)
>   at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58)
>   at 
> org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89)
>   at 
> org.apache.impala.service.JniCatalogOp.execAndSerializeSilentStartAndFinish(JniCatalogOp.java:109)
>   at 
> org.apache.impala.service.JniCatalog.execAndSerializeSilentStartAndFinish(JniCatalog.java:238)
>   at 
> org.apache.impala.service.JniCatalog.getPartialCatalogObject(JniCatalog.java:396)
> I0704 19:09:22.980324   833 status.cc:129] 7145c21173f2c47b:2579db55] 
> NullPointerException: null
> @  0x1012f9f  impala::Status::Status()
> @  0x187f964  impala::JniUtil::GetJniExceptionMsg()
> @   0xfee920  impala::JniCall::Call<>()
> @   0xfccd0f  impala::Catalog::GetPartialCatalogObject()
> @   0xfb55a5  
> impala::CatalogServiceThriftIf::GetPartialCatalogObject()
> @   0xf7a691  
> impala::CatalogServiceProcessorT<>::process_GetPartialCatalogObject()
> @   0xf82151  impala::CatalogServiceProcessorT<>::dispatchCall()
> @   0xee330f  apache::thrift::TDispatchProcessor::process()
> @  0x1329246  
> apache::thrift::server::TAcceptQueueServer::Task::run()
> @  0x1315a89  impala::ThriftThread::RunRunnable()
> @  0x131773d  
> boost::detail::function::void_function_obj_invoker0<>::invoke()
> @  0x195ba8c  impala::Thread::SuperviseThread()
> @  0x195c895  boost::detail::thread_data<>::run()
> @  0x23a03a7  thread_proxy
> @ 0x7faaad2a66ba  start_thread
> @ 0x7f2c151d  clone
> E0704 19:09:23.006968   833 catalog-server.cc:278] 
> 7145c21173f2c47b:2579db55] NullPointerException: null
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (IMPALA-11013) Support migrating external tables to Iceberg tables

2023-07-04 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-11013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab resolved IMPALA-11013.
---
Fix Version/s: Impala 4.3.0
   Resolution: Fixed

> Support migrating external tables to Iceberg tables
> ---
>
> Key: IMPALA-11013
> URL: https://issues.apache.org/jira/browse/IMPALA-11013
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog, Frontend
>Reporter: Zoltán Borók-Nagy
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
> Fix For: Impala 4.3.0
>
>
> E.g. Hive supports migrating external tables to Iceberg tables via the 
> following command:
> {noformat}
> ALTER TABLE t SET TBLPROPERTIES 
> ('storage_handler'='org.apache.iceberg.mr.hive.HiveIcebergStorageHandler');
> {noformat}
> Maybe we could support table migration with the same command.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12251) Table migration to run on multiple partitions in parallel

2023-06-28 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-12251:
-

 Summary: Table migration to run on multiple partitions in parallel
 Key: IMPALA-12251
 URL: https://issues.apache.org/jira/browse/IMPALA-12251
 Project: IMPALA
  Issue Type: New Feature
  Components: Frontend
Reporter: Gabor Kaszab


https://issues.apache.org/jira/browse/IMPALA-11013 Introduces table migration 
from legacy Hive tables to Iceberg tables. The parallelization in this patch is 
based on files within a partition. But if there are a lot of partitions and 
only few files in them this approach is not performant.

Instead, as an improvement we can implement the parallelisation based on 
partitions and then decide which one to used based on a # partitions / avg # of 
files in a partition ratio.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-11013) Support migrating external tables to Iceberg tables

2023-06-15 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-11013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab reassigned IMPALA-11013:
-

Assignee: Gabor Kaszab  (was: Andrew Sherman)

> Support migrating external tables to Iceberg tables
> ---
>
> Key: IMPALA-11013
> URL: https://issues.apache.org/jira/browse/IMPALA-11013
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog, Frontend
>Reporter: Zoltán Borók-Nagy
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
>
> E.g. Hive supports migrating external tables to Iceberg tables via the 
> following command:
> {noformat}
> ALTER TABLE t SET TBLPROPERTIES 
> ('storage_handler'='org.apache.iceberg.mr.hive.HiveIcebergStorageHandler');
> {noformat}
> Maybe we could support table migration with the same command.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-12190) Renaming table will cause losing privileges for non-admin users

2023-06-13 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab reassigned IMPALA-12190:
-

Assignee: (was: Gabor Kaszab)

> Renaming table will cause losing privileges for non-admin users
> ---
>
> Key: IMPALA-12190
> URL: https://issues.apache.org/jira/browse/IMPALA-12190
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Gabor Kaszab
>Priority: Critical
>  Labels: alter-table, authorization, ranger
>
> Let's say user 'a' gets some privileges on table 't'. When this table gets 
> renamed (even by user 'a') then user 'a' loses its privileges on that table.
>  
> Repro steps:
>  # Start impala with Ranger
>  # start impala-shell as admin (-u admin)
>  # create table tmp (i int, s string) stored as parquet;
>  # grant all on table tmp to user ;
>  # grant all on table tmp to user ;
> {code:java}
> Query: show grant user  on table tmp
> +++--+---++-+--+-+-+---+--+-+
> | principal_type | principal_name | database | table | column | uri | 
> storage_type | storage_uri | udf | privilege | grant_option | create_time |
> +++--+---++-+--+-+-+---+--+-+
> | USER           |     | default  | tmp   | *      |     |          
>     |             |     | all       | false        | NULL        |
> +++--+---++-+--+-+-+---+--+-+
> Fetched 1 row(s) in 0.01s {code}
>  #  alter table tmp rename to tmp_1234;
>  # show grant user  on table tmp_1234;
> {code:java}
> Query: show grant user  on table tmp_1234
> Fetched 0 row(s) in 0.17s{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12209) format-version is not present in DESCRIBE FORMATTED and SHOW CREATE TABLE outputs

2023-06-13 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab updated IMPALA-12209:
--
Description: 
Repro:

 
{code:java}
create table tmp (i int, s string) stored as iceberg tblproperties 
('format-version'='2');
describe extended/formatted tmp;
show create table tmp; 
{code}
Current behaviour:

None of the following 2 commands contain 'format-version' in the output. 
Additionally, if you run what is returned from SHOW CREATE TABLE then you end 
up creating a V1 table instead of V2.

The reason might be that format-version in the metadata.json is not stored 
within the tableproperties but it's on level above:
{code:java}
hdfs dfs -cat 
hdfs://localhost:20500/test-warehouse/tmp/metadata/0-55bcfe84-1819-4fb7-ade8-9c132b117880.metadata.json
{
  "format-version" : 2,
  "table-uuid" : "9f11c0c4-02c7-4688-823c-fe95dbe3ff72",
  "location" : "hdfs://localhost:20500/test-warehouse/tmp",
  "last-sequence-number" : 0,
  "last-updated-ms" : 1686640775184,
  "last-column-id" : 2,
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "i",
      "required" : false,
      "type" : "int"
    }, {
      "id" : 2,
      "name" : "s",
      "required" : false,
      "type" : "string"
    } ]
  } ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ ]
  } ],
  "last-partition-id" : 999,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "properties" : {
    "engine.hive.enabled" : "true",
    "external.table.purge" : "TRUE",
    "write.merge.mode" : "merge-on-read",
    "write.format.default" : "parquet",
    "write.delete.mode" : "merge-on-read",
    "OBJCAPABILITIES" : "EXTREAD,EXTWRITE",
    "write.update.mode" : "merge-on-read",
    "storage_handler" : "org.apache.iceberg.mr.hive.HiveIcebergStorageHandler"
  },
  "current-snapshot-id" : -1,
  "refs" : { },
  "snapshots" : [ ],
  "statistics" : [ ],
  "snapshot-log" : [ ],
  "metadata-log" : [ ]
 {code}

  was:
Repro:

 
{code:java}
create table tmp (i int, s string) stored as iceberg tblproperties 
('format-version'='2');
describe extended/formatted tmp;
show create table tmp; 
{code}
Current behaviour:

Non of the following 2 commands contain 'format-version' in the output. 
Additionally, if you run what is returned from SHOW CREATE TABLE then you end 
up creating a V1 table instead of V2.

The reson might be that format-version in the metadata.json is not stored 
within the tableproperties but it's on level above:
{code:java}
hdfs dfs -cat 
hdfs://localhost:20500/test-warehouse/tmp/metadata/0-55bcfe84-1819-4fb7-ade8-9c132b117880.metadata.json
{
  "format-version" : 2,
  "table-uuid" : "9f11c0c4-02c7-4688-823c-fe95dbe3ff72",
  "location" : "hdfs://localhost:20500/test-warehouse/tmp",
  "last-sequence-number" : 0,
  "last-updated-ms" : 1686640775184,
  "last-column-id" : 2,
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "i",
      "required" : false,
      "type" : "int"
    }, {
      "id" : 2,
      "name" : "s",
      "required" : false,
      "type" : "string"
    } ]
  } ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ ]
  } ],
  "last-partition-id" : 999,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "properties" : {
    "engine.hive.enabled" : "true",
    "external.table.purge" : "TRUE",
    "write.merge.mode" : "merge-on-read",
    "write.format.default" : "parquet",
    "write.delete.mode" : "merge-on-read",
    "OBJCAPABILITIES" : "EXTREAD,EXTWRITE",
    "write.update.mode" : "merge-on-read",
    "storage_handler" : "org.apache.iceberg.mr.hive.HiveIcebergStorageHandler"
  },
  "current-snapshot-id" : -1,
  "refs" : { },
  "snapshots" : [ ],
  "statistics" : [ ],
  "snapshot-log" : [ ],
  "metadata-log" : [ ]
 {code}


> format-version is not present in DESCRIBE FORMATTED and SHOW CREATE TABLE 
> outputs
> -
>
> Key: IMPALA-12209
> URL: https://issues.apache.org/jira/browse/IMPALA-12209
> Project: IMPALA
>  Issue Type: Bug
>  Components: from
>Reporter: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
>
> Repro:
>  
> {code:java}
> create table tmp (i int, s string) stored as iceberg tblproperties 
> ('format-version'='2');
> describe extended/formatted tmp;
> show create table tmp; 
> {code}
> Current behaviour:
> None of the following 2 commands contain 'format-version' in the output. 
> Additionally, if you run what is returned from SHOW CREATE TABLE then you end 
> up creating a V1 table instead of V2.

[jira] [Commented] (IMPALA-11710) Table properties are not updated in Iceberg metadata files

2023-06-13 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-11710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17731984#comment-17731984
 ] 

Gabor Kaszab commented on IMPALA-11710:
---

One addition: when creating an Iceberg table without providing tblproperties 
then 'external.table.purge' is defaulted to true and we can alter this property 
to false later on. This is also persisted in the metadata.json file. However, 
once it's false it can't be changed back to true again.

> Table properties are not updated in Iceberg metadata files
> --
>
> Key: IMPALA-11710
> URL: https://issues.apache.org/jira/browse/IMPALA-11710
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Noemi Pap-Takacs
>Priority: Major
>  Labels: impala-iceberg
>
> This issue occurs in true external Hive Catalog tables.
> Iceberg stores the default file format in a table property called 
> 'write.format.default'.  HMS also stores this value loaded from the Iceberg 
> metadata json file.
> However, when this table property is altered through Impala, it is only 
> changed in HMS, but does not update the Iceberg snapshot. When the table data 
> is reloaded from Iceberg metadata, the old value will appear in HMS and the 
> change is lost.
> This bug does not affect table properties that are not stored in Iceberg, 
> because they will not be reloaded.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12209) format-version is not present in DESCRIBE FORMATTED and SHOW CREATE TABLE outputs

2023-06-13 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-12209:
-

 Summary: format-version is not present in DESCRIBE FORMATTED and 
SHOW CREATE TABLE outputs
 Key: IMPALA-12209
 URL: https://issues.apache.org/jira/browse/IMPALA-12209
 Project: IMPALA
  Issue Type: Bug
  Components: from
Reporter: Gabor Kaszab


Repro:

 
{code:java}
create table tmp (i int, s string) stored as iceberg tblproperties 
('format-version'='2');
describe extended/formatted tmp;
show create table tmp; 
{code}
Current behaviour:

Non of the following 2 commands contain 'format-version' in the output. 
Additionally, if you run what is returned from SHOW CREATE TABLE then you end 
up creating a V1 table instead of V2.

The reson might be that format-version in the metadata.json is not stored 
within the tableproperties but it's on level above:
{code:java}
hdfs dfs -cat 
hdfs://localhost:20500/test-warehouse/tmp/metadata/0-55bcfe84-1819-4fb7-ade8-9c132b117880.metadata.json
{
  "format-version" : 2,
  "table-uuid" : "9f11c0c4-02c7-4688-823c-fe95dbe3ff72",
  "location" : "hdfs://localhost:20500/test-warehouse/tmp",
  "last-sequence-number" : 0,
  "last-updated-ms" : 1686640775184,
  "last-column-id" : 2,
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "i",
      "required" : false,
      "type" : "int"
    }, {
      "id" : 2,
      "name" : "s",
      "required" : false,
      "type" : "string"
    } ]
  } ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ ]
  } ],
  "last-partition-id" : 999,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "properties" : {
    "engine.hive.enabled" : "true",
    "external.table.purge" : "TRUE",
    "write.merge.mode" : "merge-on-read",
    "write.format.default" : "parquet",
    "write.delete.mode" : "merge-on-read",
    "OBJCAPABILITIES" : "EXTREAD,EXTWRITE",
    "write.update.mode" : "merge-on-read",
    "storage_handler" : "org.apache.iceberg.mr.hive.HiveIcebergStorageHandler"
  },
  "current-snapshot-id" : -1,
  "refs" : { },
  "snapshots" : [ ],
  "statistics" : [ ],
  "snapshot-log" : [ ],
  "metadata-log" : [ ]
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-11552) Support migrating Iceberg v1 tables to Iceberg v2

2023-06-13 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-11552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17731937#comment-17731937
 ] 

Gabor Kaszab commented on IMPALA-11552:
---

can we close this?

> Support migrating Iceberg v1 tables to Iceberg v2
> -
>
> Key: IMPALA-11552
> URL: https://issues.apache.org/jira/browse/IMPALA-11552
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Manish Maheshwari
>Priority: Major
>  Labels: impala-iceberg
>
> Support migrating Iceberg v1 tables to Iceberg v2 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-11710) Table properties are not updated in Iceberg metadata files

2023-06-09 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-11710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730952#comment-17730952
 ] 

Gabor Kaszab commented on IMPALA-11710:
---

Ran into the same with a different table property. For the record let me add 
the repro steps:

 
create table tmp_ice (i int, s string) stored as iceberg tblproperties 
('external.table.purge'='false');

alter table tmp_ice set tblproperties('external.table.purge'='true');
now if you call {{describe formatted tmp_ice;}} it's set to true as expected.
insert into tmp_ice values (1, "str1");
But if you for instance insert a row into a table it's set to false again. 
(checked in {{describe formatted}} and also {{{}show create table{}}})
 

> Table properties are not updated in Iceberg metadata files
> --
>
> Key: IMPALA-11710
> URL: https://issues.apache.org/jira/browse/IMPALA-11710
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Noemi Pap-Takacs
>Priority: Major
>  Labels: impala-iceberg
>
> This issue occurs in true external Hive Catalog tables.
> Iceberg stores the default file format in a table property called 
> 'write.format.default'.  HMS also stores this value loaded from the Iceberg 
> metadata json file.
> However, when this table property is altered through Impala, it is only 
> changed in HMS, but does not update the Iceberg snapshot. When the table data 
> is reloaded from Iceberg metadata, the old value will appear in HMS and the 
> change is lost.
> This bug does not affect table properties that are not stored in Iceberg, 
> because they will not be reloaded.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-12190) Renaming table will cause losing privileges for non-admin users

2023-06-07 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab reassigned IMPALA-12190:
-

Assignee: Gabor Kaszab

> Renaming table will cause losing privileges for non-admin users
> ---
>
> Key: IMPALA-12190
> URL: https://issues.apache.org/jira/browse/IMPALA-12190
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Gabor Kaszab
>Assignee: Gabor Kaszab
>Priority: Critical
>  Labels: alter-table, authorization, ranger
>
> Let's say user 'a' gets some privileges on table 't'. When this table gets 
> renamed (even by user 'a') then user 'a' loses its privileges on that table.
>  
> Repro steps:
>  # Start impala with Ranger
>  # start impala-shell as admin (-u admin)
>  # create table tmp (i int, s string) stored as parquet;
>  # grant all on table tmp to user ;
>  # grant all on table tmp to user ;
> {code:java}
> Query: show grant user  on table tmp
> +++--+---++-+--+-+-+---+--+-+
> | principal_type | principal_name | database | table | column | uri | 
> storage_type | storage_uri | udf | privilege | grant_option | create_time |
> +++--+---++-+--+-+-+---+--+-+
> | USER           |     | default  | tmp   | *      |     |          
>     |             |     | all       | false        | NULL        |
> +++--+---++-+--+-+-+---+--+-+
> Fetched 1 row(s) in 0.01s {code}
>  #  alter table tmp rename to tmp_1234;
>  # show grant user  on table tmp_1234;
> {code:java}
> Query: show grant user  on table tmp_1234
> Fetched 0 row(s) in 0.17s{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12190) Renaming table will cause losing privileges for non-admin users

2023-06-07 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab updated IMPALA-12190:
--
Priority: Critical  (was: Major)

> Renaming table will cause losing privileges for non-admin users
> ---
>
> Key: IMPALA-12190
> URL: https://issues.apache.org/jira/browse/IMPALA-12190
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Gabor Kaszab
>Priority: Critical
>  Labels: alter-table, authorization, ranger
>
> Let's say user 'a' gets some privileges on table 't'. When this table gets 
> renamed (even by user 'a') then user 'a' loses its privileges on that table.
>  
> Repro steps:
>  # Start impala with Ranger
>  # start impala-shell as admin (-u admin)
>  # create table tmp (i int, s string) stored as parquet;
>  # grant all on table tmp to user ;
>  # grant all on table tmp to user ;
> {code:java}
> Query: show grant user  on table tmp
> +++--+---++-+--+-+-+---+--+-+
> | principal_type | principal_name | database | table | column | uri | 
> storage_type | storage_uri | udf | privilege | grant_option | create_time |
> +++--+---++-+--+-+-+---+--+-+
> | USER           |     | default  | tmp   | *      |     |          
>     |             |     | all       | false        | NULL        |
> +++--+---++-+--+-+-+---+--+-+
> Fetched 1 row(s) in 0.01s {code}
>  #  alter table tmp rename to tmp_1234;
>  # show grant user  on table tmp_1234;
> {code:java}
> Query: show grant user  on table tmp_1234
> Fetched 0 row(s) in 0.17s{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12190) Renaming table will cause losing privileges for non-admin users

2023-06-07 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-12190:
-

 Summary: Renaming table will cause losing privileges for non-admin 
users
 Key: IMPALA-12190
 URL: https://issues.apache.org/jira/browse/IMPALA-12190
 Project: IMPALA
  Issue Type: Bug
  Components: Catalog
Reporter: Gabor Kaszab


Let's say user 'a' gets some privileges on table 't'. When this table gets 
renamed (even by user 'a') then user 'a' loses its privileges on that table.

 

Repro steps:
 # Start impala with Ranger
 # start impala-shell as admin (-u admin)
 # create table tmp (i int, s string) stored as parquet;
 # grant all on table tmp to user ;
 # grant all on table tmp to user ;

{code:java}
Query: show grant user  on table tmp
+++--+---++-+--+-+-+---+--+-+
| principal_type | principal_name | database | table | column | uri | 
storage_type | storage_uri | udf | privilege | grant_option | create_time |
+++--+---++-+--+-+-+---+--+-+
| USER           |     | default  | tmp   | *      |     |            
  |             |     | all       | false        | NULL        |
+++--+---++-+--+-+-+---+--+-+
Fetched 1 row(s) in 0.01s {code}

 #  alter table tmp rename to tmp_1234;
 # show grant user  on table tmp_1234;

{code:java}
Query: show grant user  on table tmp_1234
Fetched 0 row(s) in 0.17s{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-12153) Parquet STRUCT reader doesn't fill position slots

2023-06-05 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab resolved IMPALA-12153.
---
Fix Version/s: Impala 4.3.0
   Resolution: Fixed

> Parquet STRUCT reader doesn't fill position slots
> -
>
> Key: IMPALA-12153
> URL: https://issues.apache.org/jira/browse/IMPALA-12153
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Reporter: Zoltán Borók-Nagy
>Assignee: Zoltán Borók-Nagy
>Priority: Major
> Fix For: Impala 4.3.0
>
>
> The Parquet STRUCT reader doesn't fill the collection position slot, neither 
> the file position slot.
> E.g.:
> {noformat}
> select id, file__position, pos, item
> from complextypestbl c, c.nested_struct.c.d.item;
> SET expand_complex_types=True;
> select file__position, * from complextypestbl;{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-11701) Skip pushing down Iceberg predicates to Impala scanner if not needed

2023-05-08 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17720443#comment-17720443
 ] 

Gabor Kaszab commented on IMPALA-11701:
---

[~lipenglin] Frankly, I haven't looked into the residual() code to see what 
stats it takes into account.

> Skip pushing down Iceberg predicates to Impala scanner if not needed
> 
>
> Key: IMPALA-11701
> URL: https://issues.apache.org/jira/browse/IMPALA-11701
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend, Frontend
>Reporter: Qizhu Chan
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
> Fix For: Impala 4.3.0
>
> Attachments: image-2022-11-03-17-37-14-712.png, 
> profile_cf446a1ab3a5e852_1b1005de.txt
>
>
> I use impala to query iceberg table, but the query efficiency is not ideal, 
> compared with querying the hive format table of the same data, the 
> time-consuming increase is dozens of times.
> The sql statement used is a very simple statistical query, be like :
> select count(*)  from `db_name`.tbl_name where datekey='20221001' and 
> event='xxx'
> ('datekey' and 'event' are the partition fields)
> My personal feeling is that impala might fetch iceberg's metadata stats and 
> return results very quickly, but it doesn't.
> The catalog of iceberg table is of the hadoop type, and Impala can access it 
> by creating an external table in hive. By the way,  iceberg table will 
> perform snapshot expiration and data compaction on a daily basis, so there 
> should be no small file problems.
> I found this warning using the explain statement:
> {code:java}
> | WARNING: The following tables are missing relevant table and/or column 
> statistics. |
> | iceberg.gamebox_event_iceberg
> {code}
> Query: SHOW TABLE STATS `iceberg`.gamebox_event_iceberg
> +---+++--+---+-+---+-+
> | #Rows | #Files | Size   | Bytes Cached | Cache Replication | Format  | 
> Incremental stats | Location  
>   |
> +---+++--+---+-+---+-+
> | 0 | 590509 | 1.91TB | NOT CACHED   | NOT CACHED| PARQUET | 
> false | hdfs:///hive/warehouse/iceberg/gamebox_event_iceberg |
> +---+++--+---+-+---+-+
> It seems like Impala is not syncing iceberg's table and column statistics. 
> I'm not sure if this has anything to do with slow queries.
> As shown in the screenshot, i think the query time is mainly on planning and 
> execution backends , but I don't know what is the reason for these two time 
> consuming.
> Attachment is the complete profile for this query.
> How do I speed up the query? Can someone help with my question?plz.
>  !image-2022-11-03-17-37-14-712.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-11701) Skip pushing down Iceberg predicates to Impala scanner if not needed

2023-05-04 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab reassigned IMPALA-11701:
-

Assignee: Gabor Kaszab  (was: Wenzhe Zhou)

> Skip pushing down Iceberg predicates to Impala scanner if not needed
> 
>
> Key: IMPALA-11701
> URL: https://issues.apache.org/jira/browse/IMPALA-11701
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend, Frontend
>Reporter: Qizhu Chan
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
> Fix For: Impala 4.3.0
>
> Attachments: image-2022-11-03-17-37-14-712.png, 
> profile_cf446a1ab3a5e852_1b1005de.txt
>
>
> I use impala to query iceberg table, but the query efficiency is not ideal, 
> compared with querying the hive format table of the same data, the 
> time-consuming increase is dozens of times.
> The sql statement used is a very simple statistical query, be like :
> select count(*)  from `db_name`.tbl_name where datekey='20221001' and 
> event='xxx'
> ('datekey' and 'event' are the partition fields)
> My personal feeling is that impala might fetch iceberg's metadata stats and 
> return results very quickly, but it doesn't.
> The catalog of iceberg table is of the hadoop type, and Impala can access it 
> by creating an external table in hive. By the way,  iceberg table will 
> perform snapshot expiration and data compaction on a daily basis, so there 
> should be no small file problems.
> I found this warning using the explain statement:
> {code:java}
> | WARNING: The following tables are missing relevant table and/or column 
> statistics. |
> | iceberg.gamebox_event_iceberg
> {code}
> Query: SHOW TABLE STATS `iceberg`.gamebox_event_iceberg
> +---+++--+---+-+---+-+
> | #Rows | #Files | Size   | Bytes Cached | Cache Replication | Format  | 
> Incremental stats | Location  
>   |
> +---+++--+---+-+---+-+
> | 0 | 590509 | 1.91TB | NOT CACHED   | NOT CACHED| PARQUET | 
> false | hdfs:///hive/warehouse/iceberg/gamebox_event_iceberg |
> +---+++--+---+-+---+-+
> It seems like Impala is not syncing iceberg's table and column statistics. 
> I'm not sure if this has anything to do with slow queries.
> As shown in the screenshot, i think the query time is mainly on planning and 
> execution backends , but I don't know what is the reason for these two time 
> consuming.
> Attachment is the complete profile for this query.
> How do I speed up the query? Can someone help with my question?plz.
>  !image-2022-11-03-17-37-14-712.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12107) Precodition check fail when creating range partitioned Kudu table with unsupported types

2023-04-28 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-12107:
-

 Summary: Precodition check fail when creating range partitioned 
Kudu table with unsupported types
 Key: IMPALA-12107
 URL: https://issues.apache.org/jira/browse/IMPALA-12107
 Project: IMPALA
  Issue Type: Bug
  Components: Frontend
Reporter: Gabor Kaszab



{code:java}
CREATE TABLE example_table (
  id INT,
  value DECIMAL(18,2),
  PRIMARY KEY (id, value)
)
PARTITION BY RANGE (value) (
  PARTITION VALUES <= 1000.00,
  PARTITION 1000.00 < VALUES <= 5000.00,
  PARTITION 5000.00 < VALUES <= 1.00,
  PARTITION 1.00 < VALUES
)
STORED AS KUDU;
{code}

This leads to an IllegalStateException.

{code:java}
I0428 14:17:47.564204 10195 jni-util.cc:288] 8f47bda158e1bba1:1d38855b] 
java.lang.IllegalStateException
at 
com.google.common.base.Preconditions.checkState(Preconditions.java:492)
at 
org.apache.impala.analysis.RangePartition.analyzeBoundaryValue(RangePartition.java:180)
at 
org.apache.impala.analysis.RangePartition.analyzeBoundaryValues(RangePartition.java:150)
at 
org.apache.impala.analysis.RangePartition.analyze(RangePartition.java:135)
at 
org.apache.impala.analysis.KuduPartitionParam.analyzeRangeParam(KuduPartitionParam.java:144)
at 
org.apache.impala.analysis.KuduPartitionParam.analyze(KuduPartitionParam.java:132)
at 
org.apache.impala.analysis.CreateTableStmt.analyzeKuduPartitionParams(CreateTableStmt.java:550)
at 
org.apache.impala.analysis.CreateTableStmt.analyzeSynchronizedKuduTableParams(CreateTableStmt.java:502)
at 
org.apache.impala.analysis.CreateTableStmt.analyzeKuduFormat(CreateTableStmt.java:352)
at 
org.apache.impala.analysis.CreateTableStmt.analyze(CreateTableStmt.java:266)
at 
org.apache.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:521)
at 
org.apache.impala.analysis.AnalysisContext.analyzeAndAuthorize(AnalysisContext.java:468)
at 
org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:2059)
at 
org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:1967)
at 
org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1788)
at 
org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:164)
{code}

Here:
https://github.com/apache/impala/blob/112bab64b77d6ed966b1c67bd503ed632da6f208/fe/src/main/java/org/apache/impala/analysis/RangePartition.java#L198

Instead of running into a Precondition check failure we should detect 
unsupported types beforehand and return and fail the query with a proper error 
message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12089) Be able to skip pushing down a subset of the predicates

2023-04-24 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-12089:
-

 Summary: Be able to skip pushing down a subset of the predicates
 Key: IMPALA-12089
 URL: https://issues.apache.org/jira/browse/IMPALA-12089
 Project: IMPALA
  Issue Type: Sub-task
  Components: Frontend
Reporter: Gabor Kaszab


https://issues.apache.org/jira/browse/IMPALA-11701 introduced logic to skip 
pushing down predicates to Impala scanners if they are already applied by 
Iceberg and won't filter any further rows. This is an "all or nothing" approach 
where we either skip pushing down all the predicates or we push down all of 
them.

As a more sophisticated approach we should be able to push down a subset of the 
predicates to Impala Scan nodes. For this we should be able to map Iceberg 
predicates (returned from residual()) to Impala predicates. This might not be 
that trivial as Iceberg sometimes doesn't return the exact same predicates as 
it received through planFiles(). E.g. the object ID might be different making 
the mapping more difficult.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-11701) Skip pushing down Iceberg predicates to Impala scanner if not needed

2023-04-24 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab resolved IMPALA-11701.
---
Fix Version/s: Impala 4.3.0
   Resolution: Fixed

> Skip pushing down Iceberg predicates to Impala scanner if not needed
> 
>
> Key: IMPALA-11701
> URL: https://issues.apache.org/jira/browse/IMPALA-11701
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend, Frontend
>Reporter: Qizhu Chan
>Assignee: Gabor Kaszab
>Priority: Major
>  Labels: impala-iceberg
> Fix For: Impala 4.3.0
>
> Attachments: image-2022-11-03-17-37-14-712.png, 
> profile_cf446a1ab3a5e852_1b1005de.txt
>
>
> I use impala to query iceberg table, but the query efficiency is not ideal, 
> compared with querying the hive format table of the same data, the 
> time-consuming increase is dozens of times.
> The sql statement used is a very simple statistical query, be like :
> select count(*)  from `db_name`.tbl_name where datekey='20221001' and 
> event='xxx'
> ('datekey' and 'event' are the partition fields)
> My personal feeling is that impala might fetch iceberg's metadata stats and 
> return results very quickly, but it doesn't.
> The catalog of iceberg table is of the hadoop type, and Impala can access it 
> by creating an external table in hive. By the way,  iceberg table will 
> perform snapshot expiration and data compaction on a daily basis, so there 
> should be no small file problems.
> I found this warning using the explain statement:
> {code:java}
> | WARNING: The following tables are missing relevant table and/or column 
> statistics. |
> | iceberg.gamebox_event_iceberg
> {code}
> Query: SHOW TABLE STATS `iceberg`.gamebox_event_iceberg
> +---+++--+---+-+---+-+
> | #Rows | #Files | Size   | Bytes Cached | Cache Replication | Format  | 
> Incremental stats | Location  
>   |
> +---+++--+---+-+---+-+
> | 0 | 590509 | 1.91TB | NOT CACHED   | NOT CACHED| PARQUET | 
> false | hdfs:///hive/warehouse/iceberg/gamebox_event_iceberg |
> +---+++--+---+-+---+-+
> It seems like Impala is not syncing iceberg's table and column statistics. 
> I'm not sure if this has anything to do with slow queries.
> As shown in the screenshot, i think the query time is mainly on planning and 
> execution backends , but I don't know what is the reason for these two time 
> consuming.
> Attachment is the complete profile for this query.
> How do I speed up the query? Can someone help with my question?plz.
>  !image-2022-11-03-17-37-14-712.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



  1   2   3   4   5   6   7   8   >