zachdisc opened a new pull request, #9731:
URL: https://github.com/apache/iceberg/pull/9731
## What
This adds a simple `sort` method to the `RewriteManifests` spark action
which lets user specify the partition column order to consider when grouping
manifests.
Illustration:
```
RewriteManifests.Result result =
actions
.rewriteManifests(table)
.sort("c", "b", "a") < -- this is the new api piece
.execute();
```
See issue https://github.com/apache/iceberg/issues/9615
## Why
Iceberg's metadata is organized into a forest of manifest_files which point
to data files sharing common partitions. By default, and during
`RewriteManifests`, the partition grouping is determined by the default `Spec`
partition order. If the primary query pattern is more aligned with the last
partition in the table's spec, manifests are poorly suited to quickly plan and
prune around those partitions.
EG
```
CREATE TABLE
...
PARTITIONED BY (region, storeId, bucket(ipAddress, 100), days(event_time)
```
Will create manifests that first group by `region`, whose `manifest_file`
contents may span a wide range of `event_time` values. For a primary query
pattern that doesn't care about `region`, `storeId`, etc, this leads to
inefficient queries.
## Requested Feedback and decisions
* I chose to make the input to `sort` be the _raw column names_ used in
partitioning, not the internal hidden ones. AKA `event_time` instead of
`event_time_day`. `foo` instead of `foo_bucket_1234`. Thoughts? Could readily
allow both, or just the real, hidden partition column names if people prefer
* I would next like to have a more capable functional interface `sort(row ->
{... return groupingString})`, but was struggling to express a Java
`Function`like input that could be used on a `DataSet<Row>`'s `data_file`
struct - any pointers on how to parse a `Row`'s struct into a native Pojo and
supply a function that can be used like in a UDF here would be apprecaited!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]