puchengy opened a new issue, #46:
URL: https://github.com/apache/iceberg-python/issues/46
### Apache Iceberg version
None
### Please describe the bug 🐞
v1 data file spec_id is optionally, but it seems spark is able to recognize
the spec_id, but pyiceberg can't, any idea why?
spark
```
spark-sql> select * from pyang.test_ray_iceberg_read.files;
content file_path file_format spec_id partition
record_count file_size_in_bytes column_sizes value_counts
null_value_counts nan_value_counts lower_bounds upper_bounds
key_metadata split_offsets equality_ids sort_order_id readable_metrics
0
s3n://qubole-pinterest/warehouse/pyang.db/test_ray_iceberg_read/dt=2022-01-02/userid_bucket_16=4/00000-2-72876d76-7f6a-4b82-812e-5390351917ef-00001.parquet
PARQUET 1 {"dt":"2022-01-02","userid_bucket_16":4} 1
871 {1:36,2:37,3:46} {1:1,2:1,3:1} {1:0,2:0,3:0} {}
{1:,2:2,3:2022-01-02} {1:,2:2,3:2022-01-02} NULL [4] NULL 0
{"col":{"column_size":37,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"2","upper_bound":"2"},"dt":{"column_size":46,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"2022-01-02","upper_bound":"2022-01-02"},"userid":{"column_size":36,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":2,"upper_bound":2}}
0
s3n://qubole-pinterest/warehouse/pyang.db/test_ray_iceberg_read/dt=2022-01-01/00000-1-f2b3a0c1-a3e3-482a-bf24-9831626c5be7-00001.parquet
PARQUET 0 {"dt":"2022-01-01","userid_bucket_16":null} 1
870 {1:36,2:36,3:46} {1:1,2:1,3:1} {1:0,2:0,3:0} {}
{1:,2:1,3:2022-01-01} {1:,2:1,3:2022-01-01} NULL [4] NULL 0
{"col":{"column_size":36,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"1","upper_bound":"1"},"dt":{"column_size":46,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"2022-01-01","upper_bound":"2022-01-01"},"userid":{"column_size":36,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":1,"upper_bound":1}}
Time taken: 0.494 seconds, Fetched 2 row(s)
```
pyiceberg (0.4.0)
```
>>> tasks2[0]
FileScanTask(file=DataFile[file_path='s3n://qubole-pinterest/warehouse/pyang.db/test_ray_iceberg_read/dt=2022-01-02/userid_bucket_16=4/00000-2-72876d76-7f6a-4b82-812e-5390351917ef-00001.parquet',
file_format=FileFormat.PARQUET, partition=Record[dt='2022-01-02',
userid_bucket_16=4], record_count=1, file_size_in_bytes=871, column_sizes={1:
36, 2: 37, 3: 46}, value_counts={1: 1, 2: 1, 3: 1}, null_value_counts={1: 0, 2:
0, 3: 0}, nan_value_counts={}, lower_bounds={1: b'\x02\x00\x00\x00', 2: b'2',
3: b'2022-01-02'}, upper_bounds={1: b'\x02\x00\x00\x00', 2: b'2', 3:
b'2022-01-02'}, key_metadata=None, split_offsets=[4], sort_order_id=0,
content=DataFileContent.DATA, equality_ids=None, spec_id=None],
delete_files=set(), start=0, length=871)
>>> tasks2[1]
FileScanTask(file=DataFile[file_path='s3n://qubole-pinterest/warehouse/pyang.db/test_ray_iceberg_read/dt=2022-01-01/00000-1-f2b3a0c1-a3e3-482a-bf24-9831626c5be7-00001.parquet',
file_format=FileFormat.PARQUET, partition=Record[dt='2022-01-01'],
record_count=1, file_size_in_bytes=870, column_sizes={1: 36, 2: 36, 3: 46},
value_counts={1: 1, 2: 1, 3: 1}, null_value_counts={1: 0, 2: 0, 3: 0},
nan_value_counts={}, lower_bounds={1: b'\x01\x00\x00\x00', 2: b'1', 3:
b'2022-01-01'}, upper_bounds={1: b'\x01\x00\x00\x00', 2: b'1', 3:
b'2022-01-01'}, key_metadata=None, split_offsets=[4], sort_order_id=0,
content=DataFileContent.DATA, equality_ids=None, spec_id=None],
delete_files=set(), start=0, length=870)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]