Re: [I] to_pandas() API which converts iceberg table scan to a pd.DataFrame will lost datetime data type and row order [iceberg-python]

via GitHub Wed, 08 Nov 2023 19:45:17 -0800


zeddit commented on issue #132:
URL: https://github.com/apache/iceberg-python/issues/132#issuecomment-1803117179


   Here are my experiments and main findings.
   ### 1. checking for consistent ordering of pyiceberg
   firstly, I create an emtry table with no partition and sorted_by properties, 
and try to find if pyiceberg will return a deterministic results, because if 
pyiceberg load data in random, there is no way to achieve the goal.
   
   I create the table with hive catalog in trino 
   ```
   CREATE TABLE test_table1(
       date date
   )
   WITH (
       format = 'PARQUET',
       location = 's3a://test/test_table1'
   );
   ```
   and insert some values into it with
   ```
   INSERT INTO test_table1 VALUES (date '2021-01-04');
   ```
   I have inserted a sequence of date one by one, e.g. '2021-01-01', 
'2021-01-04', '2021-01-03', '2021-01-07', '2021-01-02'. the seq is not sorted.
   Every insertion will create a snapshot, and because we insert the row one by 
one, each row will be put in it's own data-file.
   
   After making that table, we start to select/read the table, and observe the 
result.
   In trino, when using `select * from test_table1`, there won't be a 
consistent result, every row could be in any position.
   while in pyiceberg, the order of rows is fixed, and it is the reverse one as 
we insert the rows.
   I run the load_table 100 times and compare the results of every run, they 
all returns the same order.
   
   And when I try to insert some new rows into the table, it shows up in the 
front few lines in the results. e.g. we adding '2021-01-05', '2021-01-09', 
'2021-01-06', '2021-01-08' one by one
   <img width="123" alt="截屏2023-11-09 11 37 58" 
src="https://github.com/apache/iceberg-python/assets/30164206/0f30573c-a2f8-4d66-b312-e225ef49f4c0";>
   
   I also checked the source code in pyiceberg/io/pyarrow.py. there is a 
consistent ordering for reading data-files, and within each data-file, the 
order is the one when writting.
   
   Then I use `alter table test_table1 execute optimize;` to merge them into 
one data-file. 
   After that, both trino client and pyiceberg will return a consistent result. 
it's because there is only one data-file, so the order in it is preserved.
   However, if the table has no `sorted_by` properties, there is no order 
guarantee when compact the data-files.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] to_pandas() API which converts iceberg table scan to a pd.DataFrame will lost datetime data type and row order [iceberg-python]

Reply via email to