+1 - I also like the idea of having more data profiling info for the partition but worry about hostnames and IP addresses and maintaining those as things change, especially if you have hundreds of hosts, I'd rather leave that to the name node.
On Fri, 20 Nov 2020 at 17:48, Ryan Blue <[email protected]> wrote: > Thanks Vivekanand! > > I made some comments on the doc. Overall, I think a partition index is a > good idea. We've thought about adding sketches that contain skew estimates > for certain columns in a partition so that we can do better join > estimation. Getting a start on how we would store data like this is a good > step. > > I'm a bit more skeptical about locality information, since it would get > out of date and require rewriting old, large manifests. > > On Fri, Nov 20, 2020 at 1:44 AM Vivekanand Vellanki <[email protected]> > wrote: > >> Hi, >> >> I would like to propose additional fields in Iceberg manifest files >> <https://docs.google.com/document/d/1G6GeOXkGSiSTcu0lDS6VA1FtJ_uz9FO4tF2Pffmx9LU/edit#> >> to support the following scenarios: >> >> - Partition index to include per-partition stats to help support >> planning >> - Data locality information to support split assignment in >> distributed query engines >> >> Comments are welcome. >> >> -- >> Thanks >> Vivek >> >> > > -- > Ryan Blue > Software Engineer > Netflix >
