Thanks for the feedback. I responded to the comments in the doc. Regarding locality information, I introduced a timestamp field to track the time when the information was populated. Engines can use this timestamp to decide the validity of this data locality information. Further, when manifest files are restated as part of MergeAppend; or compaction; this information would be updated.
On Fri, Nov 20, 2020 at 11:18 PM Ryan Blue <[email protected]> wrote: > Thanks Vivekanand! > > I made some comments on the doc. Overall, I think a partition index is a > good idea. We've thought about adding sketches that contain skew estimates > for certain columns in a partition so that we can do better join > estimation. Getting a start on how we would store data like this is a good > step. > > I'm a bit more skeptical about locality information, since it would get > out of date and require rewriting old, large manifests. > > On Fri, Nov 20, 2020 at 1:44 AM Vivekanand Vellanki <[email protected]> > wrote: > >> Hi, >> >> I would like to propose additional fields in Iceberg manifest files >> <https://docs.google.com/document/d/1G6GeOXkGSiSTcu0lDS6VA1FtJ_uz9FO4tF2Pffmx9LU/edit#> >> to support the following scenarios: >> >> - Partition index to include per-partition stats to help support >> planning >> - Data locality information to support split assignment in >> distributed query engines >> >> Comments are welcome. >> >> -- >> Thanks >> Vivek >> >> > > -- > Ryan Blue > Software Engineer > Netflix >
