[I] With spark and hive thrift server, inserts to iceberg table in one connection are not seen in another [iceberg]

via GitHub Wed, 22 Nov 2023 12:56:18 -0800


AntonOvsyannikov opened a new issue, #9135:
URL: https://github.com/apache/iceberg/issues/9135


   ### Query engine
   
   spark 3.5.0 with odbc hive thrift server 
   
   ### Question
   
   Hello,
   
   I am not sure it's real bug or it's thrift server issue, or we are doing 
something completely wrong with iceberg, could you please take a look?
   
   We are using iceberg tables in spark cluster, accessible trough odbc hive 
thrift server. 
   
   The problem appears if there are several connections to thrift server (and 
different spark sessions behind I suppose), 
   inserting and reading to same iceberg table.
   Rows appears in the table in different connections in quite non obvious 
manner: 
   the data is not always aligned in different connections and its took like 
20-30 seconds to sync.
   
   We are developing application (with `pyhton`, `sqlalchemy` and 
[pyhive](https://github.com/dropbox/PyHive)) 
   with dense DB I/O, which requires connection pooling, and the app expects 
that data are committed at right point,
   and there are no ideas how to deal with such async behavior.
   
   May be there are some settings for spark catalog / hive metastore or other 
config options to fix this? 
   Or some sort of flushing snapshots over all connections?
   
   One can reproduce the problem using beeline utility.
   
   So open beeline in one window and connect to server.
   
   ```
   # window 1
   
   spark@44c8c4990457:/opt/spark$ ./bin/beeline
   Beeline version 2.3.9 by Apache Hive
   beeline> !connect jdbc:hive2://spark-host:10000
   Connected to: Spark SQL (version 3.5.0)
   Driver: Hive JDBC (version 2.3.9)
   Transaction isolation: TRANSACTION_REPEATABLE_READ
   0: jdbc:hive2://spark-host:10000> use icetest;
   0: jdbc:hive2://spark-host:10000> create table foo (id integer) using 
iceberg;
   0: jdbc:hive2://spark-host:10000> show create table foo;
   +----------------------------------------------------+
   |                   createtab_stmt                   |
   +----------------------------------------------------+
   | CREATE TABLE spark_catalog.icetest.foo (
     id INT)
   USING iceberg
   LOCATION 's3a://s3host/user/hive/warehouse/icetest.db/foo'
   TBLPROPERTIES (
     'current-snapshot-id' = 'none',
     'format' = 'iceberg/parquet',
     'format-version' = '2',
     'write.parquet.compression-codec' = 'zstd')
    |
   +----------------------------------------------------+
   1 row selected (0.119 seconds)
   
   0: jdbc:hive2://spark-host:10000> insert into foo values (1);
   0: jdbc:hive2://spark-host:10000> show tblproperties foo;
   +----------------------------------+----------------------+
   |               key                |        value         |
   +----------------------------------+----------------------+
   | current-snapshot-id              | 7015400927877614755  |
   | format                           | iceberg/parquet      |
   | format-version                   | 2                    |
   | write.parquet.compression-codec  | zstd                 |
   +----------------------------------+----------------------+
   4 rows selected (0.126 seconds)
   0: jdbc:hive2://spark-host:10000> select * from foo;
   +-----+
   | id  |
   +-----+
   | 1   |
   +-----+
   1 row selected (0.202 seconds)
   0: jdbc:hive2://spark-host:10000>
   ```
   
   Now it's ok, open beeline in second window.
   
   ```
   # window 2
    
   0: jdbc:hive2://spark-host:10000> use icetest;
   0: jdbc:hive2://spark-host:10000> show tblproperties foo;
   +----------------------------------+----------------------+
   |               key                |        value         |
   +----------------------------------+----------------------+
   | current-snapshot-id              | 7015400927877614755  |
   | format                           | iceberg/parquet      |
   | format-version                   | 2                    |
   | write.parquet.compression-codec  | zstd                 |
   +----------------------------------+----------------------+
   4 rows selected (0.138 seconds)
   0: jdbc:hive2://spark-host:10000> select * from foo;
   +-----+
   | id  |
   +-----+
   | 1   |
   +-----+
   1 row selected (0.219 seconds)
   ```
   
   Also well till now. But let's try to insert data in second connection.
   
   ```
   # window 2
   
   0: jdbc:hive2://spark-host:10000> insert into foo values (2);
   0: jdbc:hive2://spark-host:10000> show tblproperties foo;
   +----------------------------------+----------------------+
   |               key                |        value         |
   +----------------------------------+----------------------+
   | current-snapshot-id              | 5769261824382796435  |
   | format                           | iceberg/parquet      |
   | format-version                   | 2                    |
   | write.parquet.compression-codec  | zstd                 |
   +----------------------------------+----------------------+
   4 rows selected (0.107 seconds)
   0: jdbc:hive2://spark-host:10000> select * from foo;
   +-----+
   | id  |
   +-----+
   | 1   |
   | 2   |
   +-----+
   2 rows selected (0.233 seconds)
   ```
   
   Also ok, but let get back to window 1 and try to query table here.
   
   ```
   # window 1
   
   0: jdbc:hive2://spark-host:10000> show tblproperties foo;
   +----------------------------------+----------------------+
   |               key                |        value         |
   +----------------------------------+----------------------+
   | current-snapshot-id              | 7015400927877614755  |
   | format                           | iceberg/parquet      |
   | format-version                   | 2                    |
   | write.parquet.compression-codec  | zstd                 |
   +----------------------------------+----------------------+
   4 rows selected (0.106 seconds)
   0: jdbc:hive2://spark-host:10000> select * from foo;
   +-----+
   | id  |
   +-----+
   | 1   |
   +-----+
   
   ```
   
   Still old data and old snapshot id :(
   
   And only after while, like 20-30 seconds data is aligned.
   
   ```
   # window 1
   
   0: jdbc:hive2://spark-host:10000> select * from foo;
   +-----+
   | id  |
   +-----+
   | 1   |
   +-----+
   1 row selected (0.173 seconds)
   0: jdbc:hive2://spark-host:10000> select * from foo;
   +-----+
   | id  |
   +-----+
   | 1   |
   | 2   |
   +-----+
   2 rows selected (0.309 seconds)
   ```
   
   One can force alignment with some insert in first connection.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] With spark and hive thrift server, inserts to iceberg table in one connection are not seen in another [iceberg]

Reply via email to