Do you have the link at hand for the thread where this was discussed on the Parquet list? The docs seem quite old, and the PR stale, so I would like to understand the situation better. If it is possible to do this in Parquet, that would be great, but Avro, ORC would still suffer.
Amogh Jahagirdar <2am...@gmail.com> ezt írta (időpont: 2025. máj. 26., H, 22:07): > Hey Peter, > > Thanks for bringing this issue up. I think I agree with Fokko; the issue > of wide tables leading to Parquet metadata bloat and poor Thrift > deserialization performance is a long standing issue that I believe there's > motivation in the community to address. So to me it seems better to address > it in Parquet itself rather than Iceberg library facilitate a pattern which > works around the limitations. > > Thanks, > Amogh Jahagirdar > > On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong <fo...@apache.org> wrote: > >> Hi Peter, >> >> Thanks for bringing this up. Wouldn't it make more sense to fix this in >> Parquet itself? It has been a long-running issue on Parquet, and there is >> still active interest from the community. There is a PR to replace the >> footer with FlatBuffers, which dramatically improves performance >> <https://github.com/apache/arrow/pull/43793>. The underlying proposal >> can be found here >> <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa> >> . >> >> Kind regards, >> Fokko >> >> Op ma 26 mei 2025 om 20:35 schreef yun zou <yunzou.colost...@gmail.com>: >> >>> +1, I am really interested in this topic. Performance has always been a >>> problem when dealing with wide tables, not just read/write, but also during >>> compilation. Most of the ML use cases typically exhibit a vectorized >>> read/write pattern, I am also wondering if there is any way at the metadata >>> level to help the whole compilation and execution process. I do not have >>> any answer fo this yet, but I would be really interested in exploring this >>> further. >>> >>> Best Regards, >>> Yun >>> >>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang <py...@pinterest.com.invalid> >>> wrote: >>> >>>> Hi Peter, I am interested in this proposal. What's more, I am curious >>>> if there is a similar story on the write side as well (how to generate >>>> these splitted files) and specifically, are you targeting feature backfill >>>> use cases in ML use? >>>> >>>> On Mon, May 26, 2025 at 6:29 AM Péter Váry <peter.vary.apa...@gmail.com> >>>> wrote: >>>> >>>>> Hi Team, >>>>> >>>>> In machine learning use-cases, it's common to encounter tables with a >>>>> very high number of columns - sometimes even in the range of several >>>>> thousand. I've seen cases with up to 15,000 columns. Storing such wide >>>>> tables in a single Parquet file is often suboptimal, as Parquet can become >>>>> a bottleneck, even when only a subset of columns is queried. >>>>> >>>>> A common approach to mitigate this is to split the data across >>>>> multiple Parquet files. With the upcoming File Format API, we could >>>>> introduce a layer that combines these files into a single iterator, >>>>> enabling efficient reading of wide and very wide tables. >>>>> >>>>> To support this, we would need to revise the metadata specification. >>>>> Instead of the current `_file` column, we could introduce a _files column >>>>> containing: >>>>> - `_file_column_ids`: the column IDs present in each file >>>>> - `_file_path`: the path to the corresponding file >>>>> >>>>> Has there been any prior discussion around this idea? >>>>> Is anyone else interested in exploring this further? >>>>> >>>>> Best regards, >>>>> Peter >>>>> >>>>