AntonOvsyannikov opened a new issue, #9135: URL: https://github.com/apache/iceberg/issues/9135
### Query engine spark 3.5.0 with odbc hive thrift server ### Question Hello, I am not sure it's real bug or it's thrift server issue, or we are doing something completely wrong with iceberg, could you please take a look? We are using iceberg tables in spark cluster, accessible trough odbc hive thrift server. The problem appears if there are several connections to thrift server (and different spark sessions behind I suppose), inserting and reading to same iceberg table. Rows appears in the table in different connections in quite non obvious manner: the data is not always aligned in different connections and its took like 20-30 seconds to sync. We are developing application (with `pyhton`, `sqlalchemy` and [pyhive](https://github.com/dropbox/PyHive)) with dense DB I/O, which requires connection pooling, and the app expects that data are committed at right point, and there are no ideas how to deal with such async behavior. May be there are some settings for spark catalog / hive metastore or other config options to fix this? Or some sort of flushing snapshots over all connections? One can reproduce the problem using beeline utility. So open beeline in one window and connect to server. ``` # window 1 spark@44c8c4990457:/opt/spark$ ./bin/beeline Beeline version 2.3.9 by Apache Hive beeline> !connect jdbc:hive2://spark-host:10000 Connected to: Spark SQL (version 3.5.0) Driver: Hive JDBC (version 2.3.9) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://spark-host:10000> use icetest; 0: jdbc:hive2://spark-host:10000> create table foo (id integer) using iceberg; 0: jdbc:hive2://spark-host:10000> show create table foo; +----------------------------------------------------+ | createtab_stmt | +----------------------------------------------------+ | CREATE TABLE spark_catalog.icetest.foo ( id INT) USING iceberg LOCATION 's3a://s3host/user/hive/warehouse/icetest.db/foo' TBLPROPERTIES ( 'current-snapshot-id' = 'none', 'format' = 'iceberg/parquet', 'format-version' = '2', 'write.parquet.compression-codec' = 'zstd') | +----------------------------------------------------+ 1 row selected (0.119 seconds) 0: jdbc:hive2://spark-host:10000> insert into foo values (1); 0: jdbc:hive2://spark-host:10000> show tblproperties foo; +----------------------------------+----------------------+ | key | value | +----------------------------------+----------------------+ | current-snapshot-id | 7015400927877614755 | | format | iceberg/parquet | | format-version | 2 | | write.parquet.compression-codec | zstd | +----------------------------------+----------------------+ 4 rows selected (0.126 seconds) 0: jdbc:hive2://spark-host:10000> select * from foo; +-----+ | id | +-----+ | 1 | +-----+ 1 row selected (0.202 seconds) 0: jdbc:hive2://spark-host:10000> ``` Now it's ok, open beeline in second window. ``` # window 2 0: jdbc:hive2://spark-host:10000> use icetest; 0: jdbc:hive2://spark-host:10000> show tblproperties foo; +----------------------------------+----------------------+ | key | value | +----------------------------------+----------------------+ | current-snapshot-id | 7015400927877614755 | | format | iceberg/parquet | | format-version | 2 | | write.parquet.compression-codec | zstd | +----------------------------------+----------------------+ 4 rows selected (0.138 seconds) 0: jdbc:hive2://spark-host:10000> select * from foo; +-----+ | id | +-----+ | 1 | +-----+ 1 row selected (0.219 seconds) ``` Also well till now. But let's try to insert data in second connection. ``` # window 2 0: jdbc:hive2://spark-host:10000> insert into foo values (2); 0: jdbc:hive2://spark-host:10000> show tblproperties foo; +----------------------------------+----------------------+ | key | value | +----------------------------------+----------------------+ | current-snapshot-id | 5769261824382796435 | | format | iceberg/parquet | | format-version | 2 | | write.parquet.compression-codec | zstd | +----------------------------------+----------------------+ 4 rows selected (0.107 seconds) 0: jdbc:hive2://spark-host:10000> select * from foo; +-----+ | id | +-----+ | 1 | | 2 | +-----+ 2 rows selected (0.233 seconds) ``` Also ok, but let get back to window 1 and try to query table here. ``` # window 1 0: jdbc:hive2://spark-host:10000> show tblproperties foo; +----------------------------------+----------------------+ | key | value | +----------------------------------+----------------------+ | current-snapshot-id | 7015400927877614755 | | format | iceberg/parquet | | format-version | 2 | | write.parquet.compression-codec | zstd | +----------------------------------+----------------------+ 4 rows selected (0.106 seconds) 0: jdbc:hive2://spark-host:10000> select * from foo; +-----+ | id | +-----+ | 1 | +-----+ ``` Still old data and old snapshot id :( And only after while, like 20-30 seconds data is aligned. ``` # window 1 0: jdbc:hive2://spark-host:10000> select * from foo; +-----+ | id | +-----+ | 1 | +-----+ 1 row selected (0.173 seconds) 0: jdbc:hive2://spark-host:10000> select * from foo; +-----+ | id | +-----+ | 1 | | 2 | +-----+ 2 rows selected (0.309 seconds) ``` One can force alignment with some insert in first connection. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org