bk-mz commented on issue #9833:
URL: https://github.com/apache/iceberg/issues/9833#issuecomment-1989424746
>do the files have incorrect partition tuples (the values actually used for
filtering)?
how do I check this?
Something is odd, because I see bloated small-file partitions, i.e. each
batch I run 3 consecutive compactions:
```
CALL glue.system.rewrite_position_delete_files(table => 'table', where =>
"data_load_ts between TIMESTAMP '2024-02-10 20:39:47.669' and TIMESTAMP
'2024-03-11 19:39:47.669'", options => map('partial-progress.enabled', 'true',
'min-file-size-bytes', '26843545', 'max-file-size-bytes', '134217728',
'min-input-files', '10', 'max-concurrent-file-group-rewrites', '500'))
+----------------------------+------------------------+---------------------+-----------------+
|rewritten_delete_files_count|added_delete_files_count|rewritten_bytes_count|added_bytes_count|
+----------------------------+------------------------+---------------------+-----------------+
|693 |689 |10353631
|10322951 |
+----------------------------+------------------------+---------------------+-----------------+
CALL glue.system.rewrite_data_files(table => 'table', where => "data_load_ts
between TIMESTAMP '2024-03-04 20:41:16.346' and TIMESTAMP '2024-03-11
19:41:16.346'", options => map('partial-progress.enabled', 'true',
'min-file-size-bytes', '53687091', 'max-file-size-bytes', '268435456',
'min-input-files', '20', 'max-concurrent-file-group-rewrites', '500'))
+--------------------------+----------------------+---------------------+-----------------------+
|rewritten_data_files_count|added_data_files_count|rewritten_bytes_count|failed_data_files_count|
+--------------------------+----------------------+---------------------+-----------------------+
|60 |6 |131249114 |0
|
+--------------------------+----------------------+---------------------+-----------------------+
CALL glue.system.rewrite_data_files(table => 'table', where => "data_load_ts
<= TIMESTAMP '2024-03-04 20:41:42.204'", options =>
map('partial-progress.enabled', 'true', 'min-file-size-bytes', '53687091',
'max-file-size-bytes', '268435456', 'min-input-files', '20',
'max-concurrent-file-group-rewrites', '1000'))
+--------------------------+----------------------+---------------------+-----------------------+
|rewritten_data_files_count|added_data_files_count|rewritten_bytes_count|failed_data_files_count|
+--------------------------+----------------------+---------------------+-----------------------+
|20 |2 |655906 |0
|
+--------------------------+----------------------+---------------------+-----------------------+
```
(where clause changes per each batch).
Then, this is what I see in the compactions logs:
```sql
with data as (
select
committed_at,
snapshot_id,
summary.`changed-partition-count` as changed_partition_count,
if(summary.`added-data-files` is null, "compact_delete_files",
"compact_base_files") as op,
if(summary.`added-data-files` is null,
summary.`added-position-delete-files`, summary.`added-data-files`) as
added_files,
if(summary.`added-data-files` is null,
summary.`removed-position-delete-files`, summary.`deleted-data-files`) as
removed_files
from
messaging_data_platform_pii_sandbox.mdp_temp_inflight_pffm_iceberg2.snapshots
where operation = "replace")
select
committed_at,
snapshot_id,
changed_partition_count,
op,
concat(removed_files, "->", added_files) as change,
removed_files / added_files as compact_ratio
from data
limit 100;
```
Results:
```
2024-03-10 20:36:07.15 8797169961871736418 23
compact_delete_files 165->157 1.0509554140127388
2024-03-10 20:36:08.41 7845087824966754379 23
compact_delete_files 263->249 1.0562248995983936
2024-03-10 20:36:09.532 7903647840988554268 23
compact_delete_files 634->616 1.0292207792207793
2024-03-10 20:36:10.764 483626480370270807 23
compact_delete_files 835->814 1.0257985257985258
2024-03-10 20:36:12.728 5477859694032525431 23
compact_delete_files 441->426 1.0352112676056338
2024-03-10 20:36:16.581 1255334267473732600 23
compact_delete_files 803->784 1.024234693877551
2024-03-10 20:36:18.947 7227297553321373728 23
compact_delete_files 505->478 1.0564853556485356
2024-03-10 20:36:21.98 1281338329182940375 23
compact_delete_files 582->566 1.028268551236749
2024-03-10 20:36:23.08 6410347697455606449 23
compact_delete_files 353->317 1.113564668769716
2024-03-10 20:36:24.808 7588875997900067709 20
compact_delete_files 599->588 1.0187074829931972
2024-03-10 20:40:05.974 198476720206951232 2
compact_base_files 40->3 13.333333333333334
2024-03-10 20:40:08.515 7210798375839628837 2
compact_base_files 42->4 10.5
2024-03-10 20:40:10.876 5692616137794405663 2
compact_base_files 41->4 10.25
2024-03-10 20:40:13.437 6290596725370482099 2
compact_base_files 40->4 10.0
2024-03-10 20:40:16.507 217541509135133596 2
compact_base_files 42->4 10.5
2024-03-10 20:40:20.011 4802230468835188293 1
compact_base_files 20->2 10.0
2024-03-10 20:41:04.273 6246580026217410227 23
compact_delete_files 166->155 1.070967741935484
2024-03-10 20:41:05.719 3297695269624318303 23
compact_delete_files 264->239 1.104602510460251
2024-03-10 20:41:06.785 5847626578154331062 23
compact_delete_files 666->654 1.018348623853211
2024-03-10 20:41:08.07 4267888321478088912 23
compact_delete_files 804->778 1.0334190231362468
2024-03-10 20:41:10.013 8135762954708433780 23
compact_delete_files 437->420 1.0404761904761906
2024-03-10 20:41:14.315 2664190367967793873 23
compact_delete_files 845->828 1.0205314009661837
2024-03-10 20:41:16.688 3946211637382110477 23
compact_delete_files 365->343 1.064139941690962
2024-03-10 20:41:18.796 4932807097664769635 23
compact_delete_files 546->511 1.0684931506849316
2024-03-10 20:41:19.909 6618458420791632675 23
compact_delete_files 448->439 1.020501138952164
2024-03-10 20:41:21.741 6703590499156726364 19
compact_delete_files 609->585 1.041025641025641
2024-03-10 20:45:05.725 9121859527157436009 2
compact_base_files 40->3 13.333333333333334
2024-03-10 20:45:08.159 4157493376091722749 2
compact_base_files 40->4 10.0
2024-03-10 20:45:10.364 7307518553901315194 2
compact_base_files 40->4 10.0
2024-03-10 20:45:12.53 7391382820317167039 2
compact_base_files 40->4 10.0
2024-03-10 20:45:15.294 1634346428437060541 2
compact_base_files 40->4 10.0
2024-03-10 20:45:18.664 6471891107683048438 1
compact_base_files 20->2 10.0
2024-03-10 20:45:23.872 2705895529971003001 1
compact_base_files 20->2 10.0
2024-03-10 20:46:09.038 3459810320855697041 23
compact_delete_files 182->170 1.0705882352941176
2024-03-10 20:46:10.458 2661568600570547866 23
compact_delete_files 454->430 1.0558139534883721
2024-03-10 20:46:11.621 1341779779088511826 23
compact_delete_files 577->564 1.0230496453900708
2024-03-10 20:46:12.949 1453266135319995705 23
compact_delete_files 859->835 1.02874251497006
2024-03-10 20:46:14.829 4995825204442784024 23
compact_delete_files 277->259 1.0694980694980696
2024-03-10 20:46:19.456 6932435756418096621 23
compact_delete_files 815->786 1.0368956743002544
2024-03-10 20:46:21.472 2430383133564740551 23
compact_delete_files 337->325 1.0369230769230768
2024-03-10 20:46:23.548 3455842334691225549 23
compact_delete_files 545->528 1.0321969696969697
2024-03-10 20:46:24.884 6197136702470865431 23
compact_delete_files 417->405 1.0296296296296297
2024-03-10 20:46:26.57 3018837951760552054 19
compact_delete_files 658->639 1.029733959311424
2024-03-10 20:50:05.474 5737241757182222621 2
compact_base_files 40->4 10.0
```
`compact_delete_files` actually do nothing, i.e. rewrite data into itself,
or even produce more files than needed.
Are there any obvious places to check for inconsistency in setup or setting?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]