zeddit commented on issue #132:
URL: https://github.com/apache/iceberg-python/issues/132#issuecomment-1803131783
### 2. adding sorted_by properties, or the term of sort_order in iceberg
because we want a sorted iceberg table to store and return time series data.
so we create a sorted_by table in Trino with
```
CREATE TABLE test_table2(
date date
)
WITH (
format = 'PARQUET',
location = 's3a://test/test_table2',
sorted_by = ARRAY['date']
);
```
and we check the table properties in pyiceberg CLI to make sure sort_order
properties is set
```
$ pyiceberg describe test_table2
Table format version 2
Metadata location
s3a://test/test_table2/metadata/00025-cd229550-87da-4b49-8013-28d862e8…
Table UUID e9c38f74-f556-43b2-a9f9-6facdd884723
Last Updated 1699465355600
Partition spec []
Sort order [
1 ASC NULLS FIRST
]
Current schema Schema, id=0
└── 1: date: optional date
```
Then we insert the rows again to this new table. We still insert them one by
one, with a same sequence like '2021-01-01', '2021-01-04', '2021-01-03',
'2021-01-07', '2021-01-02', '2021-01-05', '2021-01-09', '2021-01-06',
'2021-01-08'.
Then we start to get data out of the new table.
In trino, when using select * from test_table1, there still is no consistent
order, the result returned varies between runs.
In pyiceberg, it's good to see there is a fixed order, however, the order is
not the Ascending order that we stated and want for the table. It's because we
insert rows one by one, and they are in different data-files underlying.
When we compact the table into one data-file with Trino `alter table
test_table2 execute optimize;`.
The result both in trino and pyiceberg is fixed and ordered. that is because
there is only one data-file and the result order is the one that how data is
written into that data-file.
When specifying `sorted_by` properties, a local sort will be performed to
ensure that in each data-file, there will be in order.
Then we try to append the same amount of data into the sorted table with
```
insert into test_table2 select * from test_table1;
```
we got the result below:
<img width="123" alt="截屏2023-11-09 12 04 17"
src="https://github.com/apache/iceberg-python/assets/30164206/b05a3e25-eacd-45a1-8d63-7442fb3fe12d">
It's not correct, what we expected is
<img width="126" alt="截屏2023-11-09 12 04 43"
src="https://github.com/apache/iceberg-python/assets/30164206/976d1022-4b1e-488c-8866-4f465db40176">
it's because the second insert store rows in another data-file, which is
just concat with the one before.
when using `alter table test_table2 execute optimize;` to merge them, the
result is as expected.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]