Thank you Piotr for all of the work you’ve put into this. I just checked the spec. I have a few newbie questions.
a. Instead of using an existing columnar format like parquet (one file for one type of stats) to store indexes, any reason why we have developed our own format and any benchmarks taken against Puffin vs other formats? b. How these Puffin files are linked to Iceberg's metadata files is still a missing link for me. As the Puffin spec says, these stats are table level (updated per snapshots). So, do we need an Iceberg spec change to store the file names of these Puffin files so that remove_orphan_files will not clean it up accidentally? (also needed for expire_snapshots) c. NDV's are column level stats. So, I expect the latest puffin file of that snapshot will have one row of stats representing stats for each column. But if we are to implement secondary index or table level partition stats, there can be many rows (millions) in puffin based on the dataset. So, for every commit, do we need to read the previous snapshot's Puffin file and write back a new file with updated stats? (the file might be very huge when data grows?). I think it will affect the commit time. Any thoughts on this? d. Slightly related to the above point, do we plan to asynchronously support collecting the stats like "ANALYZE table" and modify the table metadata with the stats file names? (might need an Iceberg commit to write new table metadata) e. Even though table level partition stats are available from _parition metadata table (along with filter push down support), computing metadata table per query will be expensive. Hence, we are looking forward to storing them in the Puffin format. But I'm not sure about storing it as a single file with millions of rows. I Would like to collaborate and discuss more on this. Thanks, Ajantha On Mon, Jun 13, 2022 at 2:45 AM Miao Wang <[email protected]> wrote: > +1 on the format! It looks great! > > > > Thanks for materializing the initial design idea. > > > > Miao > > *From: *Kyle Bendickson <[email protected]> > *Date: *Sunday, June 12, 2022 at 1:55 PM > *To: *[email protected] <[email protected]> > *Subject: *Re: [VOTE] Adopt Puffin format as a file format for statistics > and indexes > > *EXTERNAL: Use caution when clicking on links or opening attachments.* > > > > +1 [non-binding] > > > > Thank you Piotr for all of the work you’ve put into this. > > > > This should greatly benefit not only Iceberg on Trino, but hopefully can > be used in many novel ways due to its well thought out generic design and > incorporation of the ability to extend with new sketches. > > > > Looking forward to the improvements this will bring. > > > > - Kyle > > > > On Fri, Jun 10, 2022 at 1:47 PM Alexander Jo <[email protected]> > wrote: > > +1, let's do it! > > > > On Fri, Jun 10, 2022 at 2:47 PM John Zhuge <[email protected]> wrote: > > +1 Looking forward to the features it enables. > > > > On Fri, Jun 10, 2022 at 10:11 AM Yufei Gu <[email protected]> wrote: > > +1. Looking forward to the partition stats. > > Best, > > > > Yufei > > > > > > On Thu, Jun 9, 2022 at 6:32 PM Daniel Weeks <[email protected]> wrote: > > +1 as well. Excited about the progress here. > > > > -Dan > > On Thu, Jun 9, 2022, 6:25 PM Junjie Chen <[email protected]> wrote: > > +1, really nice! Indexes are coming! > > > > On Fri, Jun 10, 2022 at 8:04 AM Szehon Ho <[email protected]> wrote: > > +1, it's an exciting step for Iceberg, look forward to all the new > statistics and secondary indices it will allow. > > > > Had a few questions of what the reference to Puffin file(s) will be in the > Iceberg spec, but it's orthogonal to Puffin file format itself. > > > > Thanks, > > Szehon > > > > On Thu, Jun 9, 2022 at 3:32 PM Ryan Blue <[email protected]> wrote: > > +1 from me! > > > > There may also be people that haven't followed the design discussions and > we can start a DISCUSS thread if needed. But if everyone is comfortable > with the design and implementation, I think it's ready for a vote as well. > > > > Huge thanks to Piotr for getting this ready! I think the format is going > to be really useful for both stats and indexes in Iceberg. > > > > On Thu, Jun 9, 2022 at 3:35 AM Piotr Findeisen <[email protected]> > wrote: > > Hi Everyone, > > I propose that we adopt Puffin file format as a file format for statistics > and indexes in Iceberg tables. > > > > Puffin file format specification: > > https://github.com/apache/iceberg/blob/master/format/puffin-spec.md > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fblob%2Fmaster%2Fformat%2Fpuffin-spec.md&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=3Y04jqMw6ZIc%2BojDmWlpOeLL5zQ3YvLcdAgoHJTwL8c%3D&reserved=0> > > (previous discussions: https://github.com/apache/iceberg/pull/4944 > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F4944&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=tWuoyTfEaIWmOFivROQRt0fD1KRYc%2FqwRO2KoZhIoi8%3D&reserved=0> > , https://github.com/apache/iceberg-docs/pull/69 > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg-docs%2Fpull%2F69&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Uf8XiuLSLEO8YtCMkk%2BSXWS6lefw95O22K844P5Iovc%3D&reserved=0> > ) > > > > Intend use: > > * statistics in Iceberg tables (see > https://github.com/apache/iceberg/pull/4945 > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F4945&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=swByVgNPD6lbSlJjHIZZX4jgeVzC%2BT%2BWUvxrrg0Wpx8%3D&reserved=0> > and associated proposed implementation > https://github.com/apache/iceberg/pull/4741 > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F4741&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=dYckyv1f36iQqs9%2FaRQRsumtB2xEmwcFJAQihYZRYlw%3D&reserved=0> > ) > > * in the future: storage for secondary indexes > > > > Puffin file reader and writer implementation: > > https://github.com/apache/iceberg/pull/4537 > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F4537&data=05%7C01%7Cmiwang%40adobe.com%7Cba30cde28d1b4e3abe5108da4cb5ef83%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637906641543835876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=YY%2B52Eq%2FcnnseM5Nd4E0D3Xw8IWMsD4QaI98LXFMu9c%3D&reserved=0> > > > > Thanks, > > PF > > > > > > > -- > > Ryan Blue > > Tabular > > > > > -- > > Best Regards > > > > > -- > > John Zhuge > >
