bk-mz commented on issue #9833:
URL: https://github.com/apache/iceberg/issues/9833#issuecomment-1973061741
I investigated a little.
So it seems that iceberg keeps partitions mapped to some form of id. I.e.
`2024-02-29-06` partition is translated to `474425`. Apparently running both
rewrite_data_files and rewrite_position_delete_files has forced iceberg to leak
those internal partitions to filesystem.
```
spark-sql ()> SELECT * FROM database.table.partitions;
{"data_load_ts_hour":474111} 0 31581863 67 5518171238
0 0 0 0 2024-03-01 11:39:04.284 2980406515838442447
{"data_load_ts_hour":474110} 0 27528941 59 4744718083
0 0 0 0 2024-03-01 11:39:04.284 2980406515838442447
{"data_load_ts_hour":474113} 0 35247584 75 6106815135
0 0 0 0 2024-03-01 11:39:04.284 2980406515838442447
{"data_load_ts_hour":474112} 0 35767820 76 6203474378
0 0 0 0 2024-03-01 11:39:04.284 2980406515838442447
{"data_load_ts_hour":474115} 0 33848781 73 5714870794
0 0 0 0 2024-03-01 11:39:04.284 2980406515838442447
{"data_load_ts_hour":474114} 0 33251894 72 5706434958
0 0 0 0 2024-03-01 11:39:04.284 2980406515838442447
{"data_load_ts_hour":474117} 0 26825760 56 4575503869
0 0 0 0 2024-03-01 11:39:04.284 2980406515838442447
{"data_load_ts_hour":474116} 0 29780249 64 5100337983
0 0 0 0 2024-03-01 11:39:04.284 2980406515838442447
{"data_load_ts_hour":474109} 0 19755026 43 3250584769
0 0 0 0 2024-03-01 11:39:04.284 2980406515838442447
{"data_load_ts_hour":474108} 0 11820983 24 1801821967
0 0 0 0 2024-03-01 11:39:04.284 2980406515838442447
{"data_load_ts_hour":474127} 0 3751415 8 546119138 0
0 0 0 2024-03-01 11:39:04.284 2980406515838442447
{"data_load_ts_hour":474126} 0 4094247 8 583096432 0
0 0 0 2024-03-01 11:39:04.284 2980406515838442447
{"data_load_ts_hour":474129} 0 4341823 8 647139274 0
0 0 0 2024-03-01 11:39:04.284 2980406515838442447
{"data_load_ts_hour":474128} 0 4645898 8 661700686 0
0 0 0 2024-03-01 11:39:04.284 2980406515838442447
{"data_load_ts_hour":474131} 0 7696352 16 1157927863 0
0 0 0 2024-03-01 11:39:04.284 2980406515838442447
{"data_load_ts_hour":474130} 0 5359994 11 782552958 0
0 0 0 2024-03-01 11:39:04.284 2980406515838442447
```
I wasn't able to reproduce the issue.
For merging position delete files I switched to multi-stage
rewrite_data_files varying `where` clauses and `delete-files-threshold`.
For fresh partitions that have bigger possibility of update, I run
`rewrite_data_files {delete-files-threshold: 10}`. For older partitions
`rewrite_data_files {delete-files-threshold: 1}`.
Latter will merge all delete files into base files, while former will just
merge those base files that has at least 10 delete files associated with this.
Can anybody clarify on this weird mapping of iceberg partitions?
`{"data_load_ts_hour":474117}`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]