Hi Team,
Thanks everyone for the reviews on
https://github.com/apache/iceberg/pull/12298!
I have addressed most of comments, but a few questions still remain which
might merit a bit wider audience:
1. We should decide on the expected filtering behavior when the filters
are pushed down to the readers. Currently the filters are applied as best
effort for the file format readers. Some readers (Avro) just skip them
altogether. There was a suggestion on the PR that we might enforce more
strict requirements and the readers either reject part of the filters, or
they could apply them fully.
2. Batch sizes are currently parameters for the reader builders which
could be set for non-vectorized readers too which could be confusing.
3. Currently the Spark batch reader uses different configuration objects
for ParquetBatchReadConf and OrcBatchReadConf as requested by the reviewers
of the Comet PR. There was a suggestion on the current PR to use a common
configuration instead.
I would be interested in hearing your thoughts about these topics.
My current take:
1. *File format filters*: I am leaning towards keeping the current
laninet behavior. Especially since Bloom filters are not able to do a full
filtering, and are often used as a way to filter out unwanted records.
Another option would be to implement a secondary filtering inside the file
formats themselves which I think would cause extra complexity, and possible
code duplication. Whatever the decision here, I would suggest moving this
out to a next PR as the current changeset is big enough as it is.
2. *Batch size configuration*: Currently this is the only property which
is different in the batch readers and the non-vectorized readers. I see 3
possible solutions:
1. Create different builders for vectorized and non-vectorized reads
- I don't think the current solution is confusing enough to
worth the extra
class
2. We could put this to the reader configuration property set - This
could work, but "hide" the possible configuration mode which is valid for
both Parquet and ORC readers
3. We could keep things as it is now - I would chose this one, but I
don't have a strong opinion here
3. *Spark configuration*: TBH, I'm open to bot solution and happy to
move to the direction the community decides on
Thanks,
Peter
Jean-Baptiste Onofré <[email protected]> ezt írta (időpont: 2025. márc. 14.,
P, 16:31):
> Hi Peter
>
> Thanks for the update. I will do a new pass on the PR.
>
> Regards
> JB
>
> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry <[email protected]>
> wrote:
> >
> > Hi Team,
> > I have rebased the File Format API proposal (
> https://github.com/apache/iceberg/pull/12298) to include the new changes
> needed for the Variant types. I would love to hear your feedback,
> especially Dan and Ryan, as you were the most active during our
> discussions. If I can help in any way to make the review easier, please let
> me know.
> > Thanks,
> > Peter
> >
> > Péter Váry <[email protected]> ezt írta (időpont: 2025. febr.
> 28., P, 17:50):
> >>
> >> Hi everyone,
> >> Thanks for all of the actionable, relevant feedback on the PR (
> https://github.com/apache/iceberg/pull/12298).
> >> Updated the code to address most of them. Please check if you agree
> with the general approach.
> >> If there is a consensus about the general approach, I could. separate
> out the PR to smaller pieces so we can have an easier time to review and
> merge those step-by-step.
> >> Thanks,
> >> Peter
> >>
> >> Jean-Baptiste Onofré <[email protected]> ezt írta (időpont: 2025. febr.
> 20., Cs, 14:14):
> >>>
> >>> Hi Peter
> >>>
> >>> sorry for the late reply on this.
> >>>
> >>> I did a pass on the proposal, it's very interesting and well written.
> >>> I like the DataFile API and definitely worth to discuss all together.
> >>>
> >>> Maybe we can schedule a specific meeting to discuss about DataFile API
> ?
> >>>
> >>> Thoughts ?
> >>>
> >>> Regards
> >>> JB
> >>>
> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry <
> [email protected]> wrote:
> >>> >
> >>> > Hi Team,
> >>> >
> >>> > As mentioned earlier on our Community Sync I am exploring the
> possibility to define a FileFormat API for accessing different file
> formats. I have put together a proposal based on my findings.
> >>> >
> >>> > -------------------
> >>> > Iceberg currently supports 3 different file formats: Avro, Parquet,
> ORC. With the introduction of Iceberg V3 specification many new features
> are added to Iceberg. Some of these features like new column types, default
> values require changes at the file format level. The changes are added by
> individual developers with different focus on the different file formats.
> As a result not all of the features are available for every supported file
> format.
> >>> > Also there are emerging file formats like Vortex [1] or Lance [2]
> which either by specialization, or by applying newer research results could
> provide better alternatives for certain use-cases like random access for
> data, or storing ML models.
> >>> > -------------------
> >>> >
> >>> > Please check the detailed proposal [3] and the google document [4],
> and comment there or reply on the dev list if you have any suggestions.
> >>> >
> >>> > Thanks,
> >>> > Peter
> >>> >
> >>> > [1] - https://github.com/spiraldb/vortex
> >>> > [2] - https://lancedb.github.io/lance/
> >>> > [3] - https://github.com/apache/iceberg/issues/12225
> >>> > [4] -
> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds
> >>> >
>