[I] v1 table data file spec id is None [iceberg-python]

via GitHub Fri, 06 Oct 2023 19:43:40 -0700


puchengy opened a new issue, #46:
URL: https://github.com/apache/iceberg-python/issues/46


   ### Apache Iceberg version
   
   None
   
   ### Please describe the bug 🐞
   
   v1 data file spec_id is optionally, but it seems spark is able to recognize 
the spec_id, but pyiceberg can't, any idea why?
   
   spark
   ```
   spark-sql> select * from pyang.test_ray_iceberg_read.files;
   content      file_path       file_format     spec_id partition       
record_count    file_size_in_bytes      column_sizes    value_counts    
null_value_counts       nan_value_counts        lower_bounds    upper_bounds    
key_metadata    split_offsets   equality_ids    sort_order_id   readable_metrics
   0    
s3n://qubole-pinterest/warehouse/pyang.db/test_ray_iceberg_read/dt=2022-01-02/userid_bucket_16=4/00000-2-72876d76-7f6a-4b82-812e-5390351917ef-00001.parquet
     PARQUET 1       {"dt":"2022-01-02","userid_bucket_16":4}        1       
871     {1:36,2:37,3:46}        {1:1,2:1,3:1}   {1:0,2:0,3:0}   {}      
{1:,2:2,3:2022-01-02}   {1:,2:2,3:2022-01-02}   NULL    [4]     NULL    0       
{"col":{"column_size":37,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"2","upper_bound":"2"},"dt":{"column_size":46,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"2022-01-02","upper_bound":"2022-01-02"},"userid":{"column_size":36,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":2,"upper_bound":2}}
   0    
s3n://qubole-pinterest/warehouse/pyang.db/test_ray_iceberg_read/dt=2022-01-01/00000-1-f2b3a0c1-a3e3-482a-bf24-9831626c5be7-00001.parquet
        PARQUET 0       {"dt":"2022-01-01","userid_bucket_16":null}     1       
870     {1:36,2:36,3:46}        {1:1,2:1,3:1}   {1:0,2:0,3:0}   {}      
{1:,2:1,3:2022-01-01}   {1:,2:1,3:2022-01-01}   NULL    [4]     NULL    0       
{"col":{"column_size":36,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"1","upper_bound":"1"},"dt":{"column_size":46,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"2022-01-01","upper_bound":"2022-01-01"},"userid":{"column_size":36,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":1,"upper_bound":1}}
   Time taken: 0.494 seconds, Fetched 2 row(s)
   ```
   
   pyiceberg (0.4.0)
   ```
   >>> tasks2[0]
   
FileScanTask(file=DataFile[file_path='s3n://qubole-pinterest/warehouse/pyang.db/test_ray_iceberg_read/dt=2022-01-02/userid_bucket_16=4/00000-2-72876d76-7f6a-4b82-812e-5390351917ef-00001.parquet',
 file_format=FileFormat.PARQUET, partition=Record[dt='2022-01-02', 
userid_bucket_16=4], record_count=1, file_size_in_bytes=871, column_sizes={1: 
36, 2: 37, 3: 46}, value_counts={1: 1, 2: 1, 3: 1}, null_value_counts={1: 0, 2: 
0, 3: 0}, nan_value_counts={}, lower_bounds={1: b'\x02\x00\x00\x00', 2: b'2', 
3: b'2022-01-02'}, upper_bounds={1: b'\x02\x00\x00\x00', 2: b'2', 3: 
b'2022-01-02'}, key_metadata=None, split_offsets=[4], sort_order_id=0, 
content=DataFileContent.DATA, equality_ids=None, spec_id=None], 
delete_files=set(), start=0, length=871)
   >>> tasks2[1]
   
FileScanTask(file=DataFile[file_path='s3n://qubole-pinterest/warehouse/pyang.db/test_ray_iceberg_read/dt=2022-01-01/00000-1-f2b3a0c1-a3e3-482a-bf24-9831626c5be7-00001.parquet',
 file_format=FileFormat.PARQUET, partition=Record[dt='2022-01-01'], 
record_count=1, file_size_in_bytes=870, column_sizes={1: 36, 2: 36, 3: 46}, 
value_counts={1: 1, 2: 1, 3: 1}, null_value_counts={1: 0, 2: 0, 3: 0}, 
nan_value_counts={}, lower_bounds={1: b'\x01\x00\x00\x00', 2: b'1', 3: 
b'2022-01-01'}, upper_bounds={1: b'\x01\x00\x00\x00', 2: b'1', 3: 
b'2022-01-01'}, key_metadata=None, split_offsets=[4], sort_order_id=0, 
content=DataFileContent.DATA, equality_ids=None, spec_id=None], 
delete_files=set(), start=0, length=870)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] v1 table data file spec id is None [iceberg-python]

Reply via email to