a8356555 opened a new issue, #9022:
URL: https://github.com/apache/iceberg/issues/9022
### Apache Iceberg version
1.4.0
### Query engine
Flink
### Please describe the bug 🐞
FlinkSQL Read From Iceberg using ORDER BY clause caused some data loss.
Flink version: 1.6.2
1. I have written a table to iceberg using flink
```sql
CREATE CATALOG `mycatalog`
WITH (
'type' = 'iceberg',
'catalog-impl' = 'org.apache.iceberg.aws.glue.GlueCatalog',
'io-impl' = 'org.apache.iceberg.aws.s3.S3FileIO',
'warehouse' = 's3://mywarehouse'
);
CREATE TABLE IF NOT EXISTS glue01.mydb.mytable
(
`id` INT,
`price` INT,
`op` STRING,
`date` TIMESTAMP,
`t` TIMESTAMP
)
PARTITIONED BY (`date`,`op`);
INSERT INTO glue01.mydb.mytable
VALUES
(1, 100, 'c', TIMESTAMP '2023-01-01', TIMESTAMP '2023-01-01 00:00:00'),
(2, 50, 'c', TIMESTAMP '2023-01-01', TIMESTAMP '2023-01-01 00:00:01'),
(3, 25, 'c', TIMESTAMP '2023-01-01', TIMESTAMP '2023-01-01 00:00:02');
INSERT INTO glue01.mydb.mytable
VALUES
(1, 123, 'u', TIMESTAMP '2023-01-02', TIMESTAMP '2023-01-02 00:00:00'),
(2, 321, 'u', TIMESTAMP '2023-01-02', TIMESTAMP '2023-01-02 00:00:10');
INSERT INTO glue01.mydb.mytable
VALUES
(1, 100, 'r', TIMESTAMP '2023-01-03', TIMESTAMP '2023-01-03 00:00:00'),
(2, 50, 'r', TIMESTAMP '2023-01-03', TIMESTAMP '2023-01-03 00:00:00'),
(3, 25, 'r', TIMESTAMP '2023-01-03', TIMESTAMP '2023-01-03 00:00:00');
INSERT INTO glue01.mydb.mytable
VALUES
(3, 25, 'r', TIMESTAMP '2023-01-04', TIMESTAMP '2023-01-04 00:00:00'),
(3, 25, 'd', TIMESTAMP '2023-01-04', TIMESTAMP '2023-01-04 00:10:00');
INSERT INTO glue01.mydb.mytable
VALUES
(4, 100, 'c', TIMESTAMP '2023-01-05', TIMESTAMP '2023-01-05 00:00:00'),
(4, 123, 'u', TIMESTAMP '2023-01-05', TIMESTAMP '2023-01-05 00:10:00');
```
2. Read Data.
```sql
CREATE TEMPORARY TABLE `mytable` (
`id` INT,
`price` INT,
`op` STRING,
`t` TIMESTAMP(3),
WATERMARK FOR `t` AS `t`
) WITH (
'connector' = 'iceberg',
'warehouse' = 's3://mywarehouse',
'catalog-impl' = 'org.apache.iceberg.aws.glue.GlueCatalog',
'io-impl' = 'org.apache.iceberg.aws.s3.S3FileIO',
'catalog-name' = 'mycatalog',
'catalog-database' = 'mydb',
'catalog-table' = 'mytable'
)
SELECT
`id`,
`price`,
`op`,
`t`
from `mytable`
/*+ OPTIONS('streaming'='true', 'monitor-interval'='1s')*/
```
the results are not ordered
```
#
+----+-------------+-------------+--------------------------------+-------------------------+
# | op | id | price | op |
t |
#
+----+-------------+-------------+--------------------------------+-------------------------+
# | +I | 3 | 25 | r |
2023-01-04 00:00:00.000 |
# | +I | 3 | 25 | d |
2023-01-04 00:10:00.000 |
# | +I | 4 | 123 | u |
2023-01-05 00:10:00.000 |
# | +I | 4 | 100 | c |
2023-01-05 00:00:00.000 |
# | +I | 1 | 123 | u |
2023-01-02 00:00:00.000 |
# | +I | 2 | 321 | u |
2023-01-02 00:00:10.000 |
# | +I | 1 | 100 | r |
2023-01-03 00:00:00.000 |
# | +I | 2 | 50 | r |
2023-01-03 00:00:00.000 |
# | +I | 3 | 25 | r |
2023-01-03 00:00:00.000 |
# | +I | 1 | 100 | c |
2023-01-01 00:00:00.000 |
# | +I | 2 | 50 | c |
2023-01-01 00:00:01.000 |
# | +I | 3 | 25 | c |
2023-01-01 00:00:02.000 |
```
I want the results ordered by `t`
```sql
select
`id`,
`price`,
`op`,
`t`
from `_table_b2f233fadf3d47478fac1f13620a08ca`
/*+ OPTIONS('streaming'='true', 'monitor-interval'='1s')*/
order by `t`
```
but some data lost
```
#
+----+-------------+-------------+--------------------------------+-------------------------+
# | op | id | price | op |
t |
#
+----+-------------+-------------+--------------------------------+-------------------------+
# | +I | 3 | 25 | r |
2023-01-04 00:00:00.000 |
# | +I | 3 | 25 | d |
2023-01-04 00:10:00.000 |
# | +I | 4 | 123 | u |
2023-01-05 00:10:00.000 |
```
How could I retrieve data in order without a data loss?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]