Daniel Becker has uploaded a new patch set (#4). ( http://gerrit.cloudera.org:8080/22177 )
Change subject: IMPALA-13594: Read Puffin stats also from older snapshots ...................................................................... IMPALA-13594: Read Puffin stats also from older snapshots Before this change, Puffin stats were only read from the current snapshot. Now we also consider older snapshots, and for each column we choose the most recent available stats. Note that this means that the stats for different columns may come from different snapshots. In case there are both HMS and Puffin stats for a column, the more recent one will be used - for HMS stats we use the 'impala.lastComputeStatsTime' table property, and for Puffin stats we use the snapshot timestamp to determine which is more recent. Testing: - updated existing test cases and added new ones in test_iceberg_with_puffin.py - reorganised the tests in TestIcebergTableWithPuffinStats in test_iceberg_with_puffin.py: tests that modify table properties and other state that other tests rely on are now run separately to provide a clean environment for all tests. Change-Id: Ia37abe8c9eab6d91946c8f6d3df5fb0889704a39 --- M fe/src/main/java/org/apache/impala/catalog/IcebergTable.java M fe/src/main/java/org/apache/impala/catalog/PuffinStatsLoader.java M java/puffin-data-generator/src/main/java/org/apache/impala/puffindatagenerator/PuffinDataGenerator.java A testdata/ice_puffin/00000-c24f24ca-05a1-493f-ae7b-659daf21b5a9.metadata.json A testdata/ice_puffin/00001-ae590078-1d64-45cb-892f-80b58829d673.metadata.json A testdata/ice_puffin/00002-5302617e-4ca6-4e44-a513-0f2082b05700.metadata.json A testdata/ice_puffin/00003-442f9acd-964c-43d7-92b8-e0737a39719a.metadata.json A testdata/ice_puffin/00004-18244103-c1f4-4733-99ae-10b56c36f900.metadata.json M testdata/ice_puffin/generated/all_files_corrupt.metadata.json M testdata/ice_puffin/generated/all_stats.stats M testdata/ice_puffin/generated/all_stats_in_1_file.metadata.json M testdata/ice_puffin/generated/corrupt_file.stats M testdata/ice_puffin/generated/corrupt_file1.stats M testdata/ice_puffin/generated/corrupt_file2.stats M testdata/ice_puffin/generated/current_snapshot_id.stats M testdata/ice_puffin/generated/duplicate_stats_in_1_file.metadata.json M testdata/ice_puffin/generated/duplicate_stats_in_1_file.stats M testdata/ice_puffin/generated/duplicate_stats_in_2_files.metadata.json M testdata/ice_puffin/generated/duplicate_stats_in_2_files1.stats M testdata/ice_puffin/generated/duplicate_stats_in_2_files2.stats M testdata/ice_puffin/generated/existing_file.stats M testdata/ice_puffin/generated/file_contains_invalid_field_id.metadata.json M testdata/ice_puffin/generated/file_contains_invalid_field_id.stats M testdata/ice_puffin/generated/invalidAndCorruptSketches.metadata.json M testdata/ice_puffin/generated/invalidAndCorruptSketches.stats M testdata/ice_puffin/generated/metadata_ndv_ok_sketches_corrupt.stats M testdata/ice_puffin/generated/metadata_ndv_ok_stats_file_corrupt.metadata.json M testdata/ice_puffin/generated/missing_file.metadata.json M testdata/ice_puffin/generated/multiple_field_ids.metadata.json M testdata/ice_puffin/generated/multiple_field_ids.stats M testdata/ice_puffin/generated/non_corrupt_file.stats M testdata/ice_puffin/generated/not_all_blobs_current.metadata.json M testdata/ice_puffin/generated/not_all_blobs_current.stats M testdata/ice_puffin/generated/not_current_snapshot_id.stats M testdata/ice_puffin/generated/one_file_corrupt_one_not.metadata.json M testdata/ice_puffin/generated/one_file_current_one_not.metadata.json A testdata/ice_puffin/generated/some_blobs_current_some_not_in_2_files.metadata.json A testdata/ice_puffin/generated/some_blobs_current_some_not_in_2_files1.stats A testdata/ice_puffin/generated/some_blobs_current_some_not_in_2_files2.stats M testdata/ice_puffin/generated/stats_divided.metadata.json M testdata/ice_puffin/generated/stats_divided1.stats M testdata/ice_puffin/generated/stats_divided2.stats M testdata/ice_puffin/generated/stats_for_unsupported_type.metadata.json M testdata/ice_puffin/generated/stats_for_unsupported_type.stats A testdata/ice_puffin/snap-2630643801692665966-1-5010cf53-bb8c-4dd2-94a6-ce516a3152d6.avro A testdata/ice_puffin/snap-3941638984336887328-1-5896bbb0-146d-4f27-be25-c23bf13bf8ab.avro A testdata/ice_puffin/snap-4323499932319869599-1-88451644-3db8-481b-8dfa-618535418394.avro A testdata/ice_puffin/snap-6623980626006176926-1-5832292a-f62d-4c25-a10a-bf1a46098ead.avro M tests/custom_cluster/test_iceberg_with_puffin.py 49 files changed, 2,491 insertions(+), 433 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/77/22177/4 -- To view, visit http://gerrit.cloudera.org:8080/22177 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ia37abe8c9eab6d91946c8f6d3df5fb0889704a39 Gerrit-Change-Number: 22177 Gerrit-PatchSet: 4 Gerrit-Owner: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Noemi Pap-Takacs <npaptak...@cloudera.com> Gerrit-Reviewer: Peter Rozsa <pro...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>