zeddit commented on issue #132:
URL: https://github.com/apache/iceberg-python/issues/132#issuecomment-1803117179
Here are my experiments and main findings.
### 1. checking for consistent ordering of pyiceberg
firstly, I create an emtry table with no partition and sorted_by properties,
and try to find if pyiceberg will return a deterministic results, because if
pyiceberg load data in random, there is no way to achieve the goal.
I create the table with hive catalog in trino
```
CREATE TABLE test_table1(
date date
)
WITH (
format = 'PARQUET',
location = 's3a://test/test_table1'
);
```
and insert some values into it with
```
INSERT INTO test_table1 VALUES (date '2021-01-04');
```
I have inserted a sequence of date one by one, e.g. '2021-01-01',
'2021-01-04', '2021-01-03', '2021-01-07', '2021-01-02'. the seq is not sorted.
Every insertion will create a snapshot, and because we insert the row one by
one, each row will be put in it's own data-file.
After making that table, we start to select/read the table, and observe the
result.
In trino, when using `select * from test_table1`, there won't be a
consistent result, every row could be in any position.
while in pyiceberg, the order of rows is fixed, and it is the reverse one as
we insert the rows.
I run the load_table 100 times and compare the results of every run, they
all returns the same order.
And when I try to insert some new rows into the table, it shows up in the
front few lines in the results. e.g. we adding '2021-01-05', '2021-01-09',
'2021-01-06', '2021-01-08' one by one
<img width="123" alt="截屏2023-11-09 11 37 58"
src="https://github.com/apache/iceberg-python/assets/30164206/0f30573c-a2f8-4d66-b312-e225ef49f4c0">
I also checked the source code in pyiceberg/io/pyarrow.py. there is a
consistent ordering for reading data-files, and within each data-file, the
order is the one when writting.
Then I use `alter table test_table1 execute optimize;` to merge them into
one data-file.
After that, both trino client and pyiceberg will return a consistent result.
it's because there is only one data-file, so the order in it is preserved.
However, if the table has no `sorted_by` properties, there is no order
guarantee when compact the data-files.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]