rmoff opened a new issue, #9947:
URL: https://github.com/apache/iceberg/issues/9947
### Query engine
Flink 1.18.1
Iceberg 1.5.0
### Question
I populate an Iceberg table using `CREATE …AS SELECT FROM` as a streaming
query. On disk I see datafiles written.
However, if I select from the iceberg table through Flink SQL, **nothing is
returned.**
If I then manually `INSERT` into the same table, when I query it I see the
value that I inserted, but nothing else.
Why isn't the CTAS behaving as I would expect?
## Source table (Kafka topic)
```sql
Flink SQL> CREATE TABLE t_k_test_topic (
> `msg` STRING
> ) WITH (
> 'connector' = 'kafka',
> 'topic' = 'test_topic',
> 'properties.bootstrap.servers' = 'broker:29092',
> 'scan.startup.mode' = 'earliest-offset',
> 'format' = 'raw'
> );
[INFO] Execute statement succeed.
Flink SQL> SET 'sql-client.execution.result-mode' = 'tableau';
[INFO] Execute statement succeed.
Flink SQL> SET 'execution.runtime-mode' = 'streaming';
[INFO] Execute statement succeed.
Flink SQL> SELECT * FROM t_k_test_topic;
+----+--------------------------------+
| op | msg |
+----+--------------------------------+
| +I | foobar |
| +I | foobar again |
^CQuery terminated, received a total of 2 rows
```
## Create and populate Iceberg table
```sql
Flink SQL> CREATE TABLE t_iceberg_test WITH (
> 'connector' = 'iceberg',
> 'catalog-type'='hive',
> 'catalog-name'='dev',
> 'warehouse' = 's3a://warehouse',
> 'hive-conf-dir' = './conf')
> AS
> SELECT * FROM t_k_test_topic;
[INFO] Submitting SQL update statement to the cluster...
[INFO] SQL update statement has been successfully submitted to the cluster:
Job ID: 13cebde34af11b7fe032c4059a813cbf
Flink SQL> show jobs;
+----------------------------------+-------------------------------------------------------------+----------+-------------------------+
| job id |
job name | status | start time |
+----------------------------------+-------------------------------------------------------------+----------+-------------------------+
| 13cebde34af11b7fe032c4059a813cbf |
insert-into_default_catalog.default_database.t_iceberg_test | RUNNING |
2024-03-13T12:05:25.146 |
+----------------------------------+-------------------------------------------------------------+----------+-------------------------+
1 rows in set
Flink SQL> stop job '13cebde34af11b7fe032c4059a813cbf';
[INFO] Execute statement succeed.
```
Look on disk - there are data files, with data in:
```
❯ docker exec mc bash -c \
"mc ls -r minio/warehouse/default_database.db/"
[2024-03-13 12:05:46 UTC] 438B STANDARD
t_iceberg_test/data/00000-0-20461db1-c6c4-4e88-95e2-463a354786a8-00001.parquet
[2024-03-13 12:05:24 UTC] 1.0KiB STANDARD
t_iceberg_test/metadata/00000-63f36419-05e4-4892-b55a-ce4bc2a764b0.metadata.json
```
## ❌Query the Iceberg table - it's empty
```sql
Flink SQL> SET 'execution.runtime-mode' = 'batch';
[INFO] Execute statement succeed.
Flink SQL> select * from t_iceberg_test;
Empty set
```
### But - there is data in the iceberg-generated parquet file on disk
```
❯ docker exec mc bash -c \
"mc cat
minio/warehouse/default_database.db/t_iceberg_test/data/00000-0-20461db1-c6c4-4e88-95e2-463a354786a8-00001.parquet"
> /tmp/data.parquet && \
duckdb :memory: "SELECT * FROM read_parquet('/tmp/data.parquet')"
-- Loading resources from /Users/rmoff/.duckdbrc
┌──────────────┐
│ msg │
│ varchar │
├──────────────┤
│ foobar │
│ foobar again │
└──────────────┘
```
## Manually insert a row
```sql
Flink SQL> DESCRIBE t_iceberg_test;
+------+--------+------+-----+--------+-----------+
| name | type | null | key | extras | watermark |
+------+--------+------+-----+--------+-----------+
| msg | STRING | TRUE | | | |
+------+--------+------+-----+--------+-----------+
1 row in set
Flink SQL> INSERT INTO t_iceberg_test VALUES ('snafu');
[INFO] Submitting SQL update statement to the cluster...
[INFO] SQL update statement has been successfully submitted to the cluster:
Job ID: c01c01e1f1eee815173895d7b52771c5
```
## ✅ The manually inserted row shows up as the only row in the table
```sql
Flink SQL> select * from t_iceberg_test;
+-------+
| msg |
+-------+
| snafu |
+-------+
1 row in set
```
## …but the data files on disk have three rows (two from the `CTAS`, one
from the `INSERT`)
```bash
❯ docker exec mc bash -c \
"mc ls -r minio/warehouse/default_database.db/t_iceberg_test/"
[2024-03-13 12:05:46 UTC] 438B STANDARD
data/00000-0-20461db1-c6c4-4e88-95e2-463a354786a8-00001.parquet
[2024-03-13 12:06:15 UTC] 421B STANDARD
data/00000-0-f7410475-58e7-4a50-9866-e3938096ea8d-00001.parquet
[2024-03-13 12:05:24 UTC] 1.0KiB STANDARD
metadata/00000-63f36419-05e4-4892-b55a-ce4bc2a764b0.metadata.json
[2024-03-13 12:06:15 UTC] 2.3KiB STANDARD
metadata/00001-f4e1c293-57b4-4883-897a-ff998a371449.metadata.json
[2024-03-13 12:06:15 UTC] 6.5KiB STANDARD
metadata/bb7d5d60-4e38-4cbd-bb66-b42e956e57fd-m0.avro
[2024-03-13 12:06:15 UTC] 4.1KiB STANDARD
metadata/snap-859291865824980759-1-bb7d5d60-4e38-4cbd-bb66-b42e956e57fd.avro
```
```bash
❯ docker exec mc bash -c \
"mc cat
minio/warehouse/default_database.db/t_iceberg_test/data/00000-0-f7410475-58e7-4a50-9866-e3938096ea8d-00001.parquet"
\
> /tmp/data.parquet && \
duckdb :memory: "SELECT * FROM read_parquet('/tmp/data.parquet')"
-- Loading resources from /Users/rmoff/.duckdbrc
┌─────────┐
│ msg │
│ varchar │
├─────────┤
│ snafu │
└─────────┘
❯ docker exec mc bash -c \
"mc cat
minio/warehouse/default_database.db/t_iceberg_test/data/00000-0-20461db1-c6c4-4e88-95e2-463a354786a8-00001.parquet"
> /tmp/data.parquet && \
duckdb :memory: "SELECT * FROM read_parquet('/tmp/data.parquet')"
-- Loading resources from /Users/rmoff/.duckdbrc
┌──────────────┐
│ msg │
│ varchar │
├──────────────┤
│ foobar │
│ foobar again │
└──────────────┘
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]