Re: [PR] fix `KeyError` raised by `add_files` when parquet file doe not have column stats [iceberg-python]

via GitHub Thu, 21 Nov 2024 22:38:29 -0800


binayakd commented on code in PR #1354:
URL: https://github.com/apache/iceberg-python/pull/1354#discussion_r1853355886



##########
tests/io/test_pyarrow_stats.py:
##########
@@ -681,6 +681,73 @@ def test_stats_types(table_schema_nested: Schema) -> None:
     ]
 
 
+def construct_test_table_without_stats() -> Tuple[pq.FileMetaData, 
Union[TableMetadataV1, TableMetadataV2]]:
+    table_metadata = {
+        "format-version": 2,
+        "location": "s3://bucket/test/location",
+        "last-column-id": 7,
+        "current-schema-id": 0,
+        "schemas": [
+            {
+                "type": "struct",
+                "schema-id": 0,
+                "fields": [
+                    {"id": 1, "name": "strings", "required": False, "type": 
"string"},
+                    {"id": 2, "name": "floats", "required": False, "type": 
"float"}
+                ]
+            }
+        ],
+        "default-spec-id": 0,
+        "partition-specs": [{"spec-id": 0, "fields": []}],
+        "properties": {},
+    }
+
+    table_metadata = TableMetadataUtil.parse_obj(table_metadata)
+    arrow_schema = schema_to_pyarrow(table_metadata.schemas[0])
+    _strings = ["zzzzzzzzzzzzzzzzzzzz", "rrrrrrrrrrrrrrrrrrrr", None, 
"aaaaaaaaaaaaaaaaaaaa"]
+    _floats = [3.14, math.nan, 1.69, 100]
+
+    table = pa.Table.from_pydict(
+        {
+            "strings": _strings,
+            "floats": _floats
+        },
+        schema=arrow_schema,
+    )
+
+    metadata_collector: List[Any] = []
+
+    with pa.BufferOutputStream() as f:
+        with pq.ParquetWriter(f, table.schema, 
metadata_collector=metadata_collector, write_statistics=False) as writer:
+            writer.write_table(table)
+
+    return metadata_collector[0], table_metadata
+
+
+def test_is_stats_set_false() -> None:
+    metadata, table_metadata = construct_test_table_without_stats()
+    schema = get_current_schema(table_metadata)
+    statistics = data_file_statistics_from_parquet_metadata(
+        parquet_metadata=metadata,
+        stats_columns=compute_statistics_plan(schema, 
table_metadata.properties),
+        parquet_column_mapping=parquet_path_to_id_mapping(schema),
+    )
+    datafile = DataFile(**statistics.to_serialized_dict())
+
+    # assert attributes except for column_aggregates and null_value_counts are 
present

Review Comment:
   Rewrote the test to use the test table but only with the "strings" column 
having statistics. 
   ```python
       # write statistics for only for "strings" column
       metadata, table_metadata = 
construct_test_table(write_statistics=["strings"])
   ```
   Added asserts to make sure the input metadata only has stats for the first 
(strings) column, and the rest don't have have any stats, specially the floats 
column (since the non-primitive columns get skipped in the iteration)
   
   ```python
   
       # expect only "strings" column to have statistics in metadata
       assert metadata.row_group(0).column(0).is_stats_set is True
       assert metadata.row_group(0).column(0).statistics is not None
   
       # expect all other columns to have no statistics
       for r in range(metadata.num_row_groups):
           for pos in range(1, metadata.num_columns):
               assert metadata.row_group(r).column(pos).is_stats_set is False
               assert metadata.row_group(r).column(pos).statistics is None
   ```
   From what I understand `col_aggs` is used to compute the  `upper_bound`, 
`lower_bound`  if the `datafile`, so
   then we assert that the  `upper_bound`, `lower_bound` and 
`null_value_counts` props of the `datafile` reflect only the values from the 
"strings" column, and no error is thrown:
   
   ```python
       # expect only "strings" column values to be reflected in the
       # upper_bound, lower_bound and null_value_counts props of datafile
       assert len(datafile.lower_bounds) == 1
       assert datafile.lower_bounds[1].decode() == "aaaaaaaaaaaaaaaa"
       assert len(datafile.upper_bounds) == 1
       assert datafile.upper_bounds[1].decode() == "zzzzzzzzzzzzzzz{"
       assert len(datafile.null_value_counts) == 1
       assert datafile.null_value_counts[1] == 1
   ```
   
   This should cover the case of some columns having stats, and some not? Not 
sure if its a valid case. 
   Hopefully this makes sense?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix `KeyError` raised by `add_files` when parquet file doe not have column stats [iceberg-python]

Reply via email to