arunb2w opened a new issue, #6244:
URL: https://github.com/apache/iceberg/issues/6244
I tried creating an sample iceberg table with below schema
```
CREATE TABLE glue_dev.db.datatype_test (
id bigint,
data string,
category string
)
USING iceberg
TBLPROPERTIES ('read.split.target-size'='134217728',
"write.metadata.metrics.default"="full")
```
Then, inserted around 100 records and then did rewrite files after that so
that all the inserted data will be rewritten in a single file.
```
for num in range(1,100)
INSERT INTO glue_dev.db.datatype_test VALUES ({num}, 'data{num}',
'catagory{num}')
```
After that I tried to query the data_files metadata.
```
select * from glue_dev.db.datatype_test.data_files limit 10;
content file_path file_format spec_id record_count
file_size_in_bytes column_sizes value_counts null_value_counts
nan_value_counts lower_bounds upper_bounds key_metadata
split_offsets equality_ids sort_order_id
0
s3://bucket/folder/db.db/datatype_test/data/00000-0-4bb9ce80-c7a7-4192-98c6-ed6e7289a981-00001.parquet
PARQUET 0 559 4234 {1:991,2:1188,3:1258} {1:559,2:559,3:559}
{1:0,2:0,3:0} {} {1:,2:data1,3:catagory1}
{1:/,2:data99,3:catagory99} NULL [4] NULL 0
Time taken: 15.614 seconds, Fetched 1 row(s)
```
In this metadata, if you see the lower_bounds and upper_bounds data for the
id column which is of type bigint they dont represent the correct values.
Does that mean iceberg is not storing the metadata correctly?
In this case if i join using id column how iceberg will properly scan/skip
files since the lower bound and upper bound metadata is not correct?
This is a test data but i have actual data with around 10000 files and they
too exhibit similar behaviour for integer columns.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]