Re: [DISCUSS] FileFormat API proposal

Péter Váry Tue, 22 Jul 2025 07:10:19 -0700

Also put together a solution where the Engine specific format
transformation is separated from the writer, and the engines need to take
care of it separately.
This is somewhat complicated on the implementation side (see:
[RowDataTransformer](
https://github.com/apache/iceberg/pull/12298/files#diff-562fa4cc369c908a157f59a9235fd3f389096451e7901686fba37c87b53dee08),
and [InternalRowTransformer](
https://github.com/apache/iceberg/pull/12298/files#diff-546f9dc30e3207d1d2bc0a2722976b55f5a04dcf85a22855e4f400500c317140)),
but simplifies the API.


@rdblue: Please check the proposed solution. I think this is what you have
suggested

Péter Váry <[email protected]> ezt írta (időpont: 2025. jún. 30.,
H, 18:42):

> During the PR review [1], we began exploring what could we use as an
> intermediate layer to reduce the need for engines and file formats to
> implement the full matrix of file format - object model conversions.
>
> To support this discussion, I’ve created and run a set of performance
> benchmarks and compiled a document outlining the potential benefits and
> trade-offs [2].
>
> Feedback is welcome, feel free to comment on the document, the PR, or
> directly in this thread.
>
> Thanks,
> Peter
>
> [1] - PR discussion -
> https://github.com/apache/iceberg/pull/12774#discussion_r2093626096
> [2] - File Format and engine object model transformation performance -
> https://docs.google.com/document/d/1GdA8IowKMtS3QVdm8s-0X-ZRYetcHv2bhQ9mrSd3fd4
>
> Péter Váry <[email protected]> ezt írta (időpont: 2025. máj.
> 7., Sze, 13:15):
>
>> Hi everyone,
>> The proposed API part is reviewed and ready to go. See:
>> https://github.com/apache/iceberg/pull/12774
>> Thanks to everyone who reviewed it already!
>>
>> Many of you wanted to review, but I know that the time constraints are
>> there for everyone. I still very much would like to hear your voices, so I
>> will not merge the PR this week. Please review it if you.
>>
>> Thanks,
>> Peter
>>
>> Péter Váry <[email protected]> ezt írta (időpont: 2025. ápr.
>> 16., Sze, 7:02):
>>
>>> Hi Renjie,
>>> The first one for the proposed new API is here:
>>> https://github.com/apache/iceberg/pull/12774
>>> Thanks, Peter
>>>
>>> On Wed, Apr 16, 2025, 05:40 Renjie Liu <[email protected]> wrote:
>>>
>>>> Hi, Peter:
>>>>
>>>> Thanks for the effort. I totally agree with splitting them into smaller
>>>> prs to move forward.
>>>>
>>>> I'm quite interested in this topic, and please ping me in those
>>>> splitted prs and I'll help to review.
>>>>
>>>> On Mon, Apr 14, 2025 at 11:22 PM Jean-Baptiste Onofré <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Peter
>>>>>
>>>>> Awesome ! Thank you so much !
>>>>> I will do a new pass.
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On Fri, Apr 11, 2025 at 3:48 PM Péter Váry <
>>>>> [email protected]> wrote:
>>>>> >
>>>>> > Hi JB,
>>>>> >
>>>>> > Separated out the proposed interfaces to a new PR:
>>>>> https://github.com/apache/iceberg/pull/12774.
>>>>> > Reviewers can check that out if they are only interested in how the
>>>>> new API would look like.
>>>>> >
>>>>> > Thanks,
>>>>> > Peter
>>>>> >
>>>>> > Jean-Baptiste Onofré <[email protected]> ezt írta (időpont: 2025.
>>>>> ápr. 10., Cs, 18:25):
>>>>> >>
>>>>> >> Hi Peter
>>>>> >>
>>>>> >> Thanks for the ping about the PR.
>>>>> >>
>>>>> >> Maybe, to facilitate the review and move forward faster, we should
>>>>> >> split the PR in smaller PRs:
>>>>> >> - one with the interfaces (ReadBuilder, AppenderBuilder,
>>>>> ObjectModel,
>>>>> >> AppenderBuilder, DataWriterBuilder, ...)
>>>>> >> - one for each file providers (Parquet, Avro, ORC)
>>>>> >>
>>>>> >> Thoughts ? I can help on the split if needed.
>>>>> >>
>>>>> >> Regards
>>>>> >> JB
>>>>> >>
>>>>> >> On Thu, Apr 10, 2025 at 5:16 AM Péter Váry <
>>>>> [email protected]> wrote:
>>>>> >> >
>>>>> >> > Since the 1.9.0 release candidate has been created, I would like
>>>>> to resurrect this PR: https://github.com/apache/iceberg/pull/12298 to
>>>>> ensure that we have as long a testing period as possible for it.
>>>>> >> >
>>>>> >> > To recap, here is what the PR does after the review rounds:
>>>>> >> >
>>>>> >> > Created 3 interface classes which are implemented by the file
>>>>> formats:
>>>>> >> >
>>>>> >> > ReadBuilder - Builder for reading data from data files
>>>>> >> > AppenderBuilder - Builder for writing data to data files
>>>>> >> > ObjectModel - Providing ReadBuilders, and AppenderBuilders for
>>>>> the specific data file format and object model pair
>>>>> >> >
>>>>> >> > Updated the Parquet, Avro, ORC implementation for this
>>>>> interfaces, and deprecated the old reader/writer APIs
>>>>> >> > Created interface classes which will be used by the actual
>>>>> readers/writers of the data files:
>>>>> >> >
>>>>> >> > AppenderBuilder - Builder for writing a file
>>>>> >> > DataWriterBuilder - Builder for generating a data file
>>>>> >> > PositionDeleteWriterBuilder - Builder for generating a position
>>>>> delete file
>>>>> >> > EqualityDeleteWriterBuilder - Builder for generating an equality
>>>>> delete file
>>>>> >> > No ReadBuilder here - the file format reader builder is reused
>>>>> >> >
>>>>> >> > Created a WriterBuilder class which implements the interfaces
>>>>> above
>>>>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
>>>>> based on a provided file format specific AppenderBuilder
>>>>> >> > Created an ObjectModelRegistry which stores the available
>>>>> ObjectModels, and engines and users could request the readers 
>>>>> (ReadBuilder)
>>>>> and writers
>>>>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
>>>>> from.
>>>>> >> > Created the appropriate ObjectModels:
>>>>> >> >
>>>>> >> > GenericObjectModels - for reading and writing Iceberg Records
>>>>> >> > SparkObjectModels - for reading (vectorized and non-vectorized)
>>>>> and writing Spark InternalRow/ColumnarBatch objects
>>>>> >> > FlinkObjectModels - for reading and writing Flink RowData objects
>>>>> >> > An arrow object model is also registered for vectorized reads of
>>>>> Parquet files into Arrow ColumnarBatch objects
>>>>> >> >
>>>>> >> > Updated the production code where the reading and writing happens
>>>>> to use the ObjectModelRegistry and the new reader/writer interfaces to
>>>>> access data files
>>>>> >> > Kept the testing code intact to ensure that the new API/code is
>>>>> not breaking anything
>>>>> >> >
>>>>> >> > The original change was not small, and grew substantially during
>>>>> the review rounds. So if you have questions, or I can do anything to make
>>>>> the review easier, don't hesitate to ask. I am happy to do anything to 
>>>>> move
>>>>> this forward.
>>>>> >> >
>>>>> >> > Thanks,
>>>>> >> > Peter
>>>>> >> >
>>>>> >> > Péter Váry <[email protected]> ezt írta (időpont:
>>>>> 2025. márc. 26., Sze, 14:54):
>>>>> >> >>
>>>>> >> >> Hi everyone,
>>>>> >> >>
>>>>> >> >> I have updated the File Format API PR (
>>>>> https://github.com/apache/iceberg/pull/12298) based on the answers
>>>>> and review comments.
>>>>> >> >>
>>>>> >> >> I would like to merge this only after the 1.9.0 release so we
>>>>> have more time finding any issues and solving them before this goes to a
>>>>> release for the users.
>>>>> >> >>
>>>>> >> >> For this I have updated the deprecation comments accordingly.
>>>>> >> >> I would like to ask you to review the PR, so we iron out any
>>>>> possible requested changes and be ready for the merge as soon as possible
>>>>> after the 1.9.0 release.
>>>>> >> >>
>>>>> >> >> Thanks,
>>>>> >> >> Peter
>>>>> >> >>
>>>>> >> >> Péter Váry <[email protected]> ezt írta (időpont:
>>>>> 2025. márc. 21., P, 14:32):
>>>>> >> >>>
>>>>> >> >>> Hi Renije,
>>>>> >> >>>
>>>>> >> >>> > 1. File format filters
>>>>> >> >>> >
>>>>> >> >>> > Do the filters include both filter expressions from both user
>>>>> query and delete filter?
>>>>> >> >>>
>>>>> >> >>> The current discussion is about the filters from the user query.
>>>>> >> >>>
>>>>> >> >>> About the delete filter:
>>>>> >> >>> Based on the suggestions on the PR, I have moved the delete
>>>>> filter out from the main API. Created a `SupportsDeleteFilter` interface
>>>>> for it which would allow pushing down to the filter to Parquet vectorized
>>>>> readers in Spark, as this is the only place where we currently implemented
>>>>> this feature.
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> Renjie Liu <[email protected]> ezt írta (időpont: 2025.
>>>>> márc. 21., P, 14:11):
>>>>> >> >>>>
>>>>> >> >>>> Hi, Peter:
>>>>> >> >>>>
>>>>> >> >>>> Thanks for the effort on this.
>>>>> >> >>>>
>>>>> >> >>>> 1. File format filters
>>>>> >> >>>>
>>>>> >> >>>> Do the filters include both filter expressions from both user
>>>>> query and delete filter?
>>>>> >> >>>>
>>>>> >> >>>> For filters from user query, I agree with you that we should
>>>>> keep the current behavior.
>>>>> >> >>>>
>>>>> >> >>>> For delete filters associated with data files, at first I
>>>>> thought file format readers should not care about this. But now I realized
>>>>> that maybe we need to also push it to file reader, this is useful when
>>>>> `IS_DELETED` metadata column is not necessary and we could use these
>>>>> filters (position deletes, etc) to further prune data.
>>>>> >> >>>>
>>>>> >> >>>> But anyway, I agree that we could postpone it in follow up pr.
>>>>> >> >>>>
>>>>> >> >>>> 2. Batch size configuration
>>>>> >> >>>>
>>>>> >> >>>> I'm leaning toward option 2.
>>>>> >> >>>>
>>>>> >> >>>> 3. Spark configuration
>>>>> >> >>>>
>>>>> >> >>>> I'm leaning towards using different configuration objects.
>>>>> >> >>>>
>>>>> >> >>>>
>>>>> >> >>>>
>>>>> >> >>>> On Thu, Mar 20, 2025 at 10:23 PM Péter Váry <
>>>>> [email protected]> wrote:
>>>>> >> >>>>>
>>>>> >> >>>>> Hi Team,
>>>>> >> >>>>> Thanks everyone for the reviews on
>>>>> https://github.com/apache/iceberg/pull/12298!
>>>>> >> >>>>> I have addressed most of comments, but a few questions still
>>>>> remain which might merit a bit wider audience:
>>>>> >> >>>>>
>>>>> >> >>>>> We should decide on the expected filtering behavior when the
>>>>> filters are pushed down to the readers. Currently the filters are applied
>>>>> as best effort for the file format readers. Some readers (Avro) just skip
>>>>> them altogether. There was a suggestion on the PR that we might enforce
>>>>> more strict requirements and the readers either reject part of the 
>>>>> filters,
>>>>> or they could apply them fully.
>>>>> >> >>>>> Batch sizes are currently parameters for the reader builders
>>>>> which could be set for non-vectorized readers too which could be 
>>>>> confusing.
>>>>> >> >>>>> Currently the Spark batch reader uses different configuration
>>>>> objects for ParquetBatchReadConf and OrcBatchReadConf as requested by the
>>>>> reviewers of the Comet PR. There was a suggestion on the current PR to use
>>>>> a common configuration instead.
>>>>> >> >>>>>
>>>>> >> >>>>> I would be interested in hearing your thoughts about these
>>>>> topics.
>>>>> >> >>>>>
>>>>> >> >>>>> My current take:
>>>>> >> >>>>>
>>>>> >> >>>>> File format filters: I am leaning towards keeping the current
>>>>> laninet behavior. Especially since Bloom filters are not able to do a full
>>>>> filtering, and are often used as a way to filter out unwanted records.
>>>>> Another option would be to implement a secondary filtering inside the file
>>>>> formats themselves which I think would cause extra complexity, and 
>>>>> possible
>>>>> code duplication. Whatever the decision here, I would suggest moving this
>>>>> out to a next PR as the current changeset is big enough as it is.
>>>>> >> >>>>> Batch size configuration: Currently this is the only property
>>>>> which is different in the batch readers and the non-vectorized readers. I
>>>>> see 3 possible solutions:
>>>>> >> >>>>>
>>>>> >> >>>>> Create different builders for vectorized and non-vectorized
>>>>> reads - I don't think the current solution is confusing enough to worth 
>>>>> the
>>>>> extra class
>>>>> >> >>>>> We could put this to the reader configuration property set -
>>>>> This could work, but "hide" the possible configuration mode which is valid
>>>>> for both Parquet and ORC readers
>>>>> >> >>>>> We could keep things as it is now - I would chose this one,
>>>>> but I don't have a strong opinion here
>>>>> >> >>>>>
>>>>> >> >>>>> Spark configuration: TBH, I'm open to bot solution and happy
>>>>> to move to the direction the community decides on
>>>>> >> >>>>>
>>>>> >> >>>>> Thanks,
>>>>> >> >>>>> Peter
>>>>> >> >>>>>
>>>>> >> >>>>> Jean-Baptiste Onofré <[email protected]> ezt írta (időpont:
>>>>> 2025. márc. 14., P, 16:31):
>>>>> >> >>>>>>
>>>>> >> >>>>>> Hi Peter
>>>>> >> >>>>>>
>>>>> >> >>>>>> Thanks for the update. I will do a new pass on the PR.
>>>>> >> >>>>>>
>>>>> >> >>>>>> Regards
>>>>> >> >>>>>> JB
>>>>> >> >>>>>>
>>>>> >> >>>>>> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry <
>>>>> [email protected]> wrote:
>>>>> >> >>>>>> >
>>>>> >> >>>>>> > Hi Team,
>>>>> >> >>>>>> > I have rebased the File Format API proposal (
>>>>> https://github.com/apache/iceberg/pull/12298) to include the new
>>>>> changes needed for the Variant types. I would love to hear your feedback,
>>>>> especially Dan and Ryan, as you were the most active during our
>>>>> discussions. If I can help in any way to make the review easier, please 
>>>>> let
>>>>> me know.
>>>>> >> >>>>>> > Thanks,
>>>>> >> >>>>>> > Peter
>>>>> >> >>>>>> >
>>>>> >> >>>>>> > Péter Váry <[email protected]> ezt írta
>>>>> (időpont: 2025. febr. 28., P, 17:50):
>>>>> >> >>>>>> >>
>>>>> >> >>>>>> >> Hi everyone,
>>>>> >> >>>>>> >> Thanks for all of the actionable, relevant feedback on
>>>>> the PR (https://github.com/apache/iceberg/pull/12298).
>>>>> >> >>>>>> >> Updated the code to address most of them. Please check if
>>>>> you agree with the general approach.
>>>>> >> >>>>>> >> If there is a consensus about the general approach, I
>>>>> could. separate out the PR to smaller pieces so we can have an easier time
>>>>> to review and merge those step-by-step.
>>>>> >> >>>>>> >> Thanks,
>>>>> >> >>>>>> >> Peter
>>>>> >> >>>>>> >>
>>>>> >> >>>>>> >> Jean-Baptiste Onofré <[email protected]> ezt írta
>>>>> (időpont: 2025. febr. 20., Cs, 14:14):
>>>>> >> >>>>>> >>>
>>>>> >> >>>>>> >>> Hi Peter
>>>>> >> >>>>>> >>>
>>>>> >> >>>>>> >>> sorry for the late reply on this.
>>>>> >> >>>>>> >>>
>>>>> >> >>>>>> >>> I did a pass on the proposal, it's very interesting and
>>>>> well written.
>>>>> >> >>>>>> >>> I like the DataFile API and definitely worth to discuss
>>>>> all together.
>>>>> >> >>>>>> >>>
>>>>> >> >>>>>> >>> Maybe we can schedule a specific meeting to discuss
>>>>> about DataFile API ?
>>>>> >> >>>>>> >>>
>>>>> >> >>>>>> >>> Thoughts ?
>>>>> >> >>>>>> >>>
>>>>> >> >>>>>> >>> Regards
>>>>> >> >>>>>> >>> JB
>>>>> >> >>>>>> >>>
>>>>> >> >>>>>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter Váry <
>>>>> [email protected]> wrote:
>>>>> >> >>>>>> >>> >
>>>>> >> >>>>>> >>> > Hi Team,
>>>>> >> >>>>>> >>> >
>>>>> >> >>>>>> >>> > As mentioned earlier on our Community Sync I am
>>>>> exploring the possibility to define a FileFormat API for accessing
>>>>> different file formats. I have put together a proposal based on my 
>>>>> findings.
>>>>> >> >>>>>> >>> >
>>>>> >> >>>>>> >>> > -------------------
>>>>> >> >>>>>> >>> > Iceberg currently supports 3 different file formats:
>>>>> Avro, Parquet, ORC. With the introduction of Iceberg V3 specification many
>>>>> new features are added to Iceberg. Some of these features like new column
>>>>> types, default values require changes at the file format level. The 
>>>>> changes
>>>>> are added by individual developers with different focus on the different
>>>>> file formats. As a result not all of the features are available for every
>>>>> supported file format.
>>>>> >> >>>>>> >>> > Also there are emerging file formats like Vortex [1]
>>>>> or Lance [2] which either by specialization, or by applying newer research
>>>>> results could provide better alternatives for certain use-cases like 
>>>>> random
>>>>> access for data, or storing ML models.
>>>>> >> >>>>>> >>> > -------------------
>>>>> >> >>>>>> >>> >
>>>>> >> >>>>>> >>> > Please check the detailed proposal [3] and the google
>>>>> document [4], and comment there or reply on the dev list if you have any
>>>>> suggestions.
>>>>> >> >>>>>> >>> >
>>>>> >> >>>>>> >>> > Thanks,
>>>>> >> >>>>>> >>> > Peter
>>>>> >> >>>>>> >>> >
>>>>> >> >>>>>> >>> > [1] - https://github.com/spiraldb/vortex
>>>>> >> >>>>>> >>> > [2] - https://lancedb.github.io/lance/
>>>>> >> >>>>>> >>> > [3] - https://github.com/apache/iceberg/issues/12225
>>>>> >> >>>>>> >>> > [4] -
>>>>> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds
>>>>> >> >>>>>> >>> >
>>>>>
>>>>

Re: [DISCUSS] FileFormat API proposal

Reply via email to