Re: [DISCUSS] FileFormat API proposal

Péter Váry Mon, 30 Jun 2025 09:42:30 -0700

During the PR review [1], we began exploring what could we use as an
intermediate layer to reduce the need for engines and file formats to
implement the full matrix of file format - object model conversions.


To support this discussion, I’ve created and run a set of performance
benchmarks and compiled a document outlining the potential benefits and
trade-offs [2].

Feedback is welcome, feel free to comment on the document, the PR, or
directly in this thread.

Thanks,
Peter

[1] - PR discussion -
https://github.com/apache/iceberg/pull/12774#discussion_r2093626096
[2] - File Format and engine object model transformation performance -
https://docs.google.com/document/d/1GdA8IowKMtS3QVdm8s-0X-ZRYetcHv2bhQ9mrSd3fd4

Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. máj. 7.,
Sze, 13:15):

> Hi everyone,
> The proposed API part is reviewed and ready to go. See:
> https://github.com/apache/iceberg/pull/12774
> Thanks to everyone who reviewed it already!
>
> Many of you wanted to review, but I know that the time constraints are
> there for everyone. I still very much would like to hear your voices, so I
> will not merge the PR this week. Please review it if you.
>
> Thanks,
> Peter
>
> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. ápr.
> 16., Sze, 7:02):
>
>> Hi Renjie,
>> The first one for the proposed new API is here:
>> https://github.com/apache/iceberg/pull/12774
>> Thanks, Peter
>>
>> On Wed, Apr 16, 2025, 05:40 Renjie Liu <liurenjie2...@gmail.com> wrote:
>>
>>> Hi, Peter:
>>>
>>> Thanks for the effort. I totally agree with splitting them into smaller
>>> prs to move forward.
>>>
>>> I'm quite interested in this topic, and please ping me in those splitted
>>> prs and I'll help to review.
>>>
>>> On Mon, Apr 14, 2025 at 11:22 PM Jean-Baptiste Onofré <j...@nanthrax.net>
>>> wrote:
>>>
>>>> Hi Peter
>>>>
>>>> Awesome ! Thank you so much !
>>>> I will do a new pass.
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On Fri, Apr 11, 2025 at 3:48 PM Péter Váry <peter.vary.apa...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Hi JB,
>>>> >
>>>> > Separated out the proposed interfaces to a new PR:
>>>> https://github.com/apache/iceberg/pull/12774.
>>>> > Reviewers can check that out if they are only interested in how the
>>>> new API would look like.
>>>> >
>>>> > Thanks,
>>>> > Peter
>>>> >
>>>> > Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. ápr.
>>>> 10., Cs, 18:25):
>>>> >>
>>>> >> Hi Peter
>>>> >>
>>>> >> Thanks for the ping about the PR.
>>>> >>
>>>> >> Maybe, to facilitate the review and move forward faster, we should
>>>> >> split the PR in smaller PRs:
>>>> >> - one with the interfaces (ReadBuilder, AppenderBuilder, ObjectModel,
>>>> >> AppenderBuilder, DataWriterBuilder, ...)
>>>> >> - one for each file providers (Parquet, Avro, ORC)
>>>> >>
>>>> >> Thoughts ? I can help on the split if needed.
>>>> >>
>>>> >> Regards
>>>> >> JB
>>>> >>
>>>> >> On Thu, Apr 10, 2025 at 5:16 AM Péter Váry <
>>>> peter.vary.apa...@gmail.com> wrote:
>>>> >> >
>>>> >> > Since the 1.9.0 release candidate has been created, I would like
>>>> to resurrect this PR: https://github.com/apache/iceberg/pull/12298 to
>>>> ensure that we have as long a testing period as possible for it.
>>>> >> >
>>>> >> > To recap, here is what the PR does after the review rounds:
>>>> >> >
>>>> >> > Created 3 interface classes which are implemented by the file
>>>> formats:
>>>> >> >
>>>> >> > ReadBuilder - Builder for reading data from data files
>>>> >> > AppenderBuilder - Builder for writing data to data files
>>>> >> > ObjectModel - Providing ReadBuilders, and AppenderBuilders for the
>>>> specific data file format and object model pair
>>>> >> >
>>>> >> > Updated the Parquet, Avro, ORC implementation for this interfaces,
>>>> and deprecated the old reader/writer APIs
>>>> >> > Created interface classes which will be used by the actual
>>>> readers/writers of the data files:
>>>> >> >
>>>> >> > AppenderBuilder - Builder for writing a file
>>>> >> > DataWriterBuilder - Builder for generating a data file
>>>> >> > PositionDeleteWriterBuilder - Builder for generating a position
>>>> delete file
>>>> >> > EqualityDeleteWriterBuilder - Builder for generating an equality
>>>> delete file
>>>> >> > No ReadBuilder here - the file format reader builder is reused
>>>> >> >
>>>> >> > Created a WriterBuilder class which implements the interfaces
>>>> above
>>>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
>>>> based on a provided file format specific AppenderBuilder
>>>> >> > Created an ObjectModelRegistry which stores the available
>>>> ObjectModels, and engines and users could request the readers (ReadBuilder)
>>>> and writers
>>>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
>>>> from.
>>>> >> > Created the appropriate ObjectModels:
>>>> >> >
>>>> >> > GenericObjectModels - for reading and writing Iceberg Records
>>>> >> > SparkObjectModels - for reading (vectorized and non-vectorized)
>>>> and writing Spark InternalRow/ColumnarBatch objects
>>>> >> > FlinkObjectModels - for reading and writing Flink RowData objects
>>>> >> > An arrow object model is also registered for vectorized reads of
>>>> Parquet files into Arrow ColumnarBatch objects
>>>> >> >
>>>> >> > Updated the production code where the reading and writing happens
>>>> to use the ObjectModelRegistry and the new reader/writer interfaces to
>>>> access data files
>>>> >> > Kept the testing code intact to ensure that the new API/code is
>>>> not breaking anything
>>>> >> >
>>>> >> > The original change was not small, and grew substantially during
>>>> the review rounds. So if you have questions, or I can do anything to make
>>>> the review easier, don't hesitate to ask. I am happy to do anything to move
>>>> this forward.
>>>> >> >
>>>> >> > Thanks,
>>>> >> > Peter
>>>> >> >
>>>> >> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025.
>>>> márc. 26., Sze, 14:54):
>>>> >> >>
>>>> >> >> Hi everyone,
>>>> >> >>
>>>> >> >> I have updated the File Format API PR (
>>>> https://github.com/apache/iceberg/pull/12298) based on the answers and
>>>> review comments.
>>>> >> >>
>>>> >> >> I would like to merge this only after the 1.9.0 release so we
>>>> have more time finding any issues and solving them before this goes to a
>>>> release for the users.
>>>> >> >>
>>>> >> >> For this I have updated the deprecation comments accordingly.
>>>> >> >> I would like to ask you to review the PR, so we iron out any
>>>> possible requested changes and be ready for the merge as soon as possible
>>>> after the 1.9.0 release.
>>>> >> >>
>>>> >> >> Thanks,
>>>> >> >> Peter
>>>> >> >>
>>>> >> >> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont:
>>>> 2025. márc. 21., P, 14:32):
>>>> >> >>>
>>>> >> >>> Hi Renije,
>>>> >> >>>
>>>> >> >>> > 1. File format filters
>>>> >> >>> >
>>>> >> >>> > Do the filters include both filter expressions from both user
>>>> query and delete filter?
>>>> >> >>>
>>>> >> >>> The current discussion is about the filters from the user query.
>>>> >> >>>
>>>> >> >>> About the delete filter:
>>>> >> >>> Based on the suggestions on the PR, I have moved the delete
>>>> filter out from the main API. Created a `SupportsDeleteFilter` interface
>>>> for it which would allow pushing down to the filter to Parquet vectorized
>>>> readers in Spark, as this is the only place where we currently implemented
>>>> this feature.
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> Renjie Liu <liurenjie2...@gmail.com> ezt írta (időpont: 2025.
>>>> márc. 21., P, 14:11):
>>>> >> >>>>
>>>> >> >>>> Hi, Peter:
>>>> >> >>>>
>>>> >> >>>> Thanks for the effort on this.
>>>> >> >>>>
>>>> >> >>>> 1. File format filters
>>>> >> >>>>
>>>> >> >>>> Do the filters include both filter expressions from both user
>>>> query and delete filter?
>>>> >> >>>>
>>>> >> >>>> For filters from user query, I agree with you that we should
>>>> keep the current behavior.
>>>> >> >>>>
>>>> >> >>>> For delete filters associated with data files, at first I
>>>> thought file format readers should not care about this. But now I realized
>>>> that maybe we need to also push it to file reader, this is useful when
>>>> `IS_DELETED` metadata column is not necessary and we could use these
>>>> filters (position deletes, etc) to further prune data.
>>>> >> >>>>
>>>> >> >>>> But anyway, I agree that we could postpone it in follow up pr.
>>>> >> >>>>
>>>> >> >>>> 2. Batch size configuration
>>>> >> >>>>
>>>> >> >>>> I'm leaning toward option 2.
>>>> >> >>>>
>>>> >> >>>> 3. Spark configuration
>>>> >> >>>>
>>>> >> >>>> I'm leaning towards using different configuration objects.
>>>> >> >>>>
>>>> >> >>>>
>>>> >> >>>>
>>>> >> >>>> On Thu, Mar 20, 2025 at 10:23 PM Péter Váry <
>>>> peter.vary.apa...@gmail.com> wrote:
>>>> >> >>>>>
>>>> >> >>>>> Hi Team,
>>>> >> >>>>> Thanks everyone for the reviews on
>>>> https://github.com/apache/iceberg/pull/12298!
>>>> >> >>>>> I have addressed most of comments, but a few questions still
>>>> remain which might merit a bit wider audience:
>>>> >> >>>>>
>>>> >> >>>>> We should decide on the expected filtering behavior when the
>>>> filters are pushed down to the readers. Currently the filters are applied
>>>> as best effort for the file format readers. Some readers (Avro) just skip
>>>> them altogether. There was a suggestion on the PR that we might enforce
>>>> more strict requirements and the readers either reject part of the filters,
>>>> or they could apply them fully.
>>>> >> >>>>> Batch sizes are currently parameters for the reader builders
>>>> which could be set for non-vectorized readers too which could be confusing.
>>>> >> >>>>> Currently the Spark batch reader uses different configuration
>>>> objects for ParquetBatchReadConf and OrcBatchReadConf as requested by the
>>>> reviewers of the Comet PR. There was a suggestion on the current PR to use
>>>> a common configuration instead.
>>>> >> >>>>>
>>>> >> >>>>> I would be interested in hearing your thoughts about these
>>>> topics.
>>>> >> >>>>>
>>>> >> >>>>> My current take:
>>>> >> >>>>>
>>>> >> >>>>> File format filters: I am leaning towards keeping the current
>>>> laninet behavior. Especially since Bloom filters are not able to do a full
>>>> filtering, and are often used as a way to filter out unwanted records.
>>>> Another option would be to implement a secondary filtering inside the file
>>>> formats themselves which I think would cause extra complexity, and possible
>>>> code duplication. Whatever the decision here, I would suggest moving this
>>>> out to a next PR as the current changeset is big enough as it is.
>>>> >> >>>>> Batch size configuration: Currently this is the only property
>>>> which is different in the batch readers and the non-vectorized readers. I
>>>> see 3 possible solutions:
>>>> >> >>>>>
>>>> >> >>>>> Create different builders for vectorized and non-vectorized
>>>> reads - I don't think the current solution is confusing enough to worth the
>>>> extra class
>>>> >> >>>>> We could put this to the reader configuration property set -
>>>> This could work, but "hide" the possible configuration mode which is valid
>>>> for both Parquet and ORC readers
>>>> >> >>>>> We could keep things as it is now - I would chose this one,
>>>> but I don't have a strong opinion here
>>>> >> >>>>>
>>>> >> >>>>> Spark configuration: TBH, I'm open to bot solution and happy
>>>> to move to the direction the community decides on
>>>> >> >>>>>
>>>> >> >>>>> Thanks,
>>>> >> >>>>> Peter
>>>> >> >>>>>
>>>> >> >>>>> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont:
>>>> 2025. márc. 14., P, 16:31):
>>>> >> >>>>>>
>>>> >> >>>>>> Hi Peter
>>>> >> >>>>>>
>>>> >> >>>>>> Thanks for the update. I will do a new pass on the PR.
>>>> >> >>>>>>
>>>> >> >>>>>> Regards
>>>> >> >>>>>> JB
>>>> >> >>>>>>
>>>> >> >>>>>> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry <
>>>> peter.vary.apa...@gmail.com> wrote:
>>>> >> >>>>>> >
>>>> >> >>>>>> > Hi Team,
>>>> >> >>>>>> > I have rebased the File Format API proposal (
>>>> https://github.com/apache/iceberg/pull/12298) to include the new
>>>> changes needed for the Variant types. I would love to hear your feedback,
>>>> especially Dan and Ryan, as you were the most active during our
>>>> discussions. If I can help in any way to make the review easier, please let
>>>> me know.
>>>> >> >>>>>> > Thanks,
>>>> >> >>>>>> > Peter
>>>> >> >>>>>> >
>>>> >> >>>>>> > Péter Váry <peter.vary.apa...@gmail.com> ezt írta
>>>> (időpont: 2025. febr. 28., P, 17:50):
>>>> >> >>>>>> >>
>>>> >> >>>>>> >> Hi everyone,
>>>> >> >>>>>> >> Thanks for all of the actionable, relevant feedback on the
>>>> PR (https://github.com/apache/iceberg/pull/12298).
>>>> >> >>>>>> >> Updated the code to address most of them. Please check if
>>>> you agree with the general approach.
>>>> >> >>>>>> >> If there is a consensus about the general approach, I
>>>> could. separate out the PR to smaller pieces so we can have an easier time
>>>> to review and merge those step-by-step.
>>>> >> >>>>>> >> Thanks,
>>>> >> >>>>>> >> Peter
>>>> >> >>>>>> >>
>>>> >> >>>>>> >> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont:
>>>> 2025. febr. 20., Cs, 14:14):
>>>> >> >>>>>> >>>
>>>> >> >>>>>> >>> Hi Peter
>>>> >> >>>>>> >>>
>>>> >> >>>>>> >>> sorry for the late reply on this.
>>>> >> >>>>>> >>>
>>>> >> >>>>>> >>> I did a pass on the proposal, it's very interesting and
>>>> well written.
>>>> >> >>>>>> >>> I like the DataFile API and definitely worth to discuss
>>>> all together.
>>>> >> >>>>>> >>>
>>>> >> >>>>>> >>> Maybe we can schedule a specific meeting to discuss about
>>>> DataFile API ?
>>>> >> >>>>>> >>>
>>>> >> >>>>>> >>> Thoughts ?
>>>> >> >>>>>> >>>
>>>> >> >>>>>> >>> Regards
>>>> >> >>>>>> >>> JB
>>>> >> >>>>>> >>>
>>>> >> >>>>>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry <
>>>> peter.vary.apa...@gmail.com> wrote:
>>>> >> >>>>>> >>> >
>>>> >> >>>>>> >>> > Hi Team,
>>>> >> >>>>>> >>> >
>>>> >> >>>>>> >>> > As mentioned earlier on our Community Sync I am
>>>> exploring the possibility to define a FileFormat API for accessing
>>>> different file formats. I have put together a proposal based on my 
>>>> findings.
>>>> >> >>>>>> >>> >
>>>> >> >>>>>> >>> > -------------------
>>>> >> >>>>>> >>> > Iceberg currently supports 3 different file formats:
>>>> Avro, Parquet, ORC. With the introduction of Iceberg V3 specification many
>>>> new features are added to Iceberg. Some of these features like new column
>>>> types, default values require changes at the file format level. The changes
>>>> are added by individual developers with different focus on the different
>>>> file formats. As a result not all of the features are available for every
>>>> supported file format.
>>>> >> >>>>>> >>> > Also there are emerging file formats like Vortex [1] or
>>>> Lance [2] which either by specialization, or by applying newer research
>>>> results could provide better alternatives for certain use-cases like random
>>>> access for data, or storing ML models.
>>>> >> >>>>>> >>> > -------------------
>>>> >> >>>>>> >>> >
>>>> >> >>>>>> >>> > Please check the detailed proposal [3] and the google
>>>> document [4], and comment there or reply on the dev list if you have any
>>>> suggestions.
>>>> >> >>>>>> >>> >
>>>> >> >>>>>> >>> > Thanks,
>>>> >> >>>>>> >>> > Peter
>>>> >> >>>>>> >>> >
>>>> >> >>>>>> >>> > [1] - https://github.com/spiraldb/vortex
>>>> >> >>>>>> >>> > [2] - https://lancedb.github.io/lance/
>>>> >> >>>>>> >>> > [3] - https://github.com/apache/iceberg/issues/12225
>>>> >> >>>>>> >>> > [4] -
>>>> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds
>>>> >> >>>>>> >>> >
>>>>
>>>

Re: [DISCUSS] FileFormat API proposal

Reply via email to