Re: [DISCUSS] FileFormat API proposal

Péter Váry Sat, 18 Oct 2025 02:40:16 -0700

I explored several approaches to fully or partially preserve generic type
parameters in WriteBuilder and ReadBuilder without increasing the scope of
changes. This would allow users to create a builder with type parameters
defined at instantiation (when retrieved from the registry). Once created,
the builder could be passed around in the code, and any subsequent method
calls could rely on these generics to enforce correct Java types.


I believe I’ve found a viable approach and would like your feedback on it.
I’ve added this as an additional open question in the document:
https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds/edit?pli=1&tab=t.0#heading=h.r19pykd7h8ek

Thanks,
Peter

Péter Váry <[email protected]> ezt írta (időpont: 2025. okt. 3.,
P, 17:02):

> Hi Team,
> I've updated the design document with the outcomes of our recent
> discussion (
> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds
> ).
>
> *PR Updates*
>
>    - API: https://github.com/apache/iceberg/pull/12774
>    - PoC:  Due to a necessary rebase and significant changes, I’ve
>    created a temporary PR: https://github.com/apache/iceberg/pull/14230.
>    If there are no objections, I plan to force-push to the original PR
>    https://github.com/apache/iceberg/pull/12298 so we can continue the
>    discussion there.
>
> *Open Questions*
> I’ve added two questions to the doc for further input:
>
>    1. Should we expose FormatModelRegistry.writeBuilder?
>    2. Should we generalize the use of the new
>    Parquet.ReadBuilder.ReaderFunction and BatchReaderFunction introduced by
>    Russell in https://github.com/apache/iceberg/pull/14040, and apply
>    them to the new API?
>
> You can find more context on these questions in the doc:
>
>    - Question 1:
>    
> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds/edit?pli=1&tab=t.0#heading=h.fntuqbcl74w1
>    - Question 2:
>    
> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds/edit?pli=1&tab=t.0#heading=h.g4afybuzp2z8
>
> *Proposed Next Steps*
> Given the large scope of the changes, I suggest we break the work into
> smaller, manageable phases:
>
>    1. Review the current proposal and PoC to identify any remaining major
>    questions.
>    2. Finalize the API PR after a detailed review.
>    3. Iterate on the API until consensus is reached, then merge.
>    4. Develop the Technology Compatibility Kit (TCK) and the initial
>    Parquet implementation of ReadBuilder/WriteBuilder.
>       1. Implement ReadBuilder/WriteBuilder for Avro.
>       2. Implement ReadBuilder/WriteBuilder for ORC
>    5. Implement factories using FormatModelRegistry:
>       1. FileFormatFactory (Generic, Flink, Spark)
>       2. FileAppenderFactory (Generic, Flink)
>
> I’m aware there are differing opinions on this, so I’d like to open a
> discussion around breaking the work into smaller, more manageable parts.
> Maintaining a PR of this size is time-consuming and likely not the most
> effective use of time for both reviewers and contributors.
>
> Would love to hear your thoughts on the updated interfaces, the document,
> and the proposed plan.
> Thanks,
> Peter
>
> Péter Váry <[email protected]> ezt írta (időpont: 2025. szept.
> 30., K, 14:49):
>
>> *Thanks to everyone who participated in the File Format Sync!*
>> I think we moved forward quite a bit!
>> You can find the recording here:
>> https://www.youtube.com/watch?v=BvGejwOfn6w
>>
>> Topics discussed:
>>
>>    - *Exposing Variants*: We agreed to expose only
>>    *ReadBuilder.outputSchema* and *WriteBuilder.inputSchema *in the new
>>    API. These schemas will guide Variant shredding. We will not allow setting
>>    the *VariantShreddingFunction* on the new API. Exposing the engine
>>    specific schema solves another problem too: when a DataFrame contains
>>    values stored as *tinyint* or *shortint*, the writer can seamlessly
>>    convert and persist them as *integer* columns in the Iceberg table -
>>    without requiring any conversion logic from the engine itself.
>>    - *Generic vs. Non-Generic Builders*: We weighed the pros and cons of
>>    using generic types in ReadBuilder and WriteBuilder. While generics offer
>>    type safety, they don’t provide direct benefits since
>>    *FormatModelRegistry* must handle all *FormatModel*s and rely on
>>    callers to return correct types. Moreover, introducing generics would
>>    complicate implementation due to the need for proxy classes. Based on 
>> this,
>>    we decided *not to use generic parameters* in the builders provided
>>    by the *FormatModels*.
>>    - *Deprecating Old APIs*: Since we’re not introducing generics, we
>>    concluded that the old APIs don’t need to be rewritten and can be reused.
>>    - *Method Names*: To maintain consistency, we’ll keep the existing
>>    method names and only make changes where necessary.
>>    - *Re-registering Format Models*: We need a mechanism to detect and
>>    allow re-registration of the same Format Model with the same keys, which
>>    may occur when a Flink Job Manager restarts or Spark reuses a JVM.
>>    - *ContentFileWriteBuilder Interfaces vs. Classes*: We chose to keep
>>    these as interfaces for greater flexibility, especially in scenarios
>>    involving inheritance.
>>    - *Packaging*: We’ll move the relevant classes into a dedicated
>>    package called *format*.
>>
>> I’ll update the documentation and PRs accordingly.
>>
>> If you couldn’t attend the meeting but have concerns about any of the
>> decisions above, please leave a comment so we can revisit them.
>> Thanks,
>> Peter
>>
>>
>> Péter Váry <[email protected]> ezt írta (időpont: 2025. szept.
>> 25., Cs, 11:46):
>>
>>> Hi Team,
>>>
>>> As discussed on the last Community Sync (with the help of Kevin Liu -
>>> Thanks!) we created a recurring *File Format API Sync on every second
>>> Monday at 9:00-10:00 AM PST / 6:00-7:00 PM CEST starting on 29th of
>>> September*. You can join using the following link:
>>> https://meet.google.com/fow-bauj-mmg or using the dev calendar.
>>>
>>> Also updated the proposal document:
>>> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds
>>> Added a list of open questions to the end.
>>>
>>> I would be happy to see anyone who is interested, so we can speed up the
>>> process and move forward with the proposal.
>>> If you can't make it, or you can't wait :), feel free to ask your
>>> questions on the PRs (https://github.com/apache/iceberg/pull/12774,
>>> https://github.com/apache/iceberg/pull/12298), on the dev list or on
>>> Slack.
>>>
>>> Thanks,
>>> Peter
>>>
>>> Péter Váry <[email protected]> ezt írta (időpont: 2025.
>>> szept. 24., Sze, 14:35):
>>>
>>>> > I think there’s another option that would be cleaner than option 0.
>>>> Right now, the same FormatModel is being used for both position
>>>> deletes and data because they both share the object model type, like
>>>> InternalRow for Spark. But we’ve decided that we don’t intend to
>>>> support writing position deletes with row data, so there is no longer a
>>>> need for PositionDelete<InternalRow> and we can instead use
>>>> PositionDelete<Void> (or introduce an interface to generalize this).
>>>>
>>>> That’s a great suggestion. I’ve implemented it with a slight variation:
>>>> I used a PositionDelete<?> instead of PositionDelete<Void> to avoid
>>>> requiring users to update their code.
>>>> You can see how the proposed changes look in the PRs.
>>>>
>>>> > Remove unnecessary renames, like setAll -> set
>>>>
>>>> Several method name changes were made based on earlier reviewer
>>>> suggestions. At the time, each change seemed justified. That said, I agree
>>>> it's worth reviewing them. To help with this, I created a spreadsheet that
>>>> gives an overview of the changes. I've color-coded them for clarity:
>>>>
>>>>    - *Green*: unchanged
>>>>    - *Light green*: added for all readers (previously missing
>>>>    functionality)
>>>>    - *Yellow*: changed based on discussions
>>>>    - *Red*: removed
>>>>    - *Light yellow*: return type binding might have changed (only for
>>>>    build methods)
>>>>
>>>> You can find the document here:
>>>> https://docs.google.com/spreadsheets/d/1cBwyrO9-0x-OgquNhfBpkjfF__VRZgKBqujCnlAkXeY/edit?usp=sharing
>>>>
>>>> > Use and extend the existing format builders instead of creating new
>>>> ones to carry the return type.
>>>>
>>>> I think this deserves a deeper discussion. In the current PR, each
>>>> builder has two generic parameters:
>>>>
>>>>    - *WriteBuilder*
>>>>       - Input type: the type of object accepted (used to parameterize
>>>>       the Appender)
>>>>       - Input schema: the expected input schema used for Variant
>>>>       shredding
>>>>    - *ReadBuilder*
>>>>       - Output type: the type of object returned (used to parameterize
>>>>       the CloseableIterable)
>>>>       - Output schema: the expected output schema used for variant
>>>>       shredding
>>>>
>>>> While we could consider using Object for both types or rely on the
>>>> caller to set them correctly, both types are actually defined by the
>>>> FormatModel. If users provide incorrect types, it would lead to runtime
>>>> exceptions rather than compile-time errors—which I believe is poor API
>>>> design.
>>>>
>>>> If we stick with generic parameters, we can’t reuse the existing
>>>> builders. In that case, we have two options (I’m fine with either, but the
>>>> second makes deprecating the old API easier):
>>>>
>>>>    1. Create a new class that implements the new API and internally
>>>>    uses the old class.
>>>>    2. Create a new class with the new API, copy the old implementation
>>>>    into it with necessary changes, and refactor the old class to delegate 
>>>> to
>>>>    the new one. The current PoC follows this second approach.
>>>>
>>>> I’d love to hear the community’s thoughts on this.
>>>>
>>>> Thanks,
>>>> Peter
>>>> Ryan Blue <[email protected]> ezt írta (időpont: 2025. szept. 23., K,
>>>> 20:28):
>>>>
>>>>> I think there’s another option that would be cleaner than option 0.
>>>>> Right now, the same FormatModel is being used for both position
>>>>> deletes and data because they both share the object model type, like
>>>>> InternalRow for Spark. But we’ve decided that we don’t intend to
>>>>> support writing position deletes with row data, so there is no longer a
>>>>> need for PositionDelete<InternalRow> and we can instead use
>>>>> PositionDelete<Void> (or introduce an interface to generalize this).
>>>>>
>>>>> That means we are either producing FileAppender<InternalRow> or
>>>>> FileAppender<PositionDelete<?>> and we already have a way to do that
>>>>> using the registry. I think it would be better/cleaner to register the 
>>>>> same
>>>>> format twice, once for the row type and once for PositionDelete. That
>>>>> would be fairly easy to implement since the registry-based writer factory
>>>>> already resolves
>>>>> <https://github.com/apache/iceberg/pull/12298/files#diff-654036b55ad0af72c786a5c598a47c06400538a461d2db819535cb544a4a5d61R156-R157>
>>>>> the FormatModel every time. That also makes it easier to pass the
>>>>> content type for at least position deletes in the FormatModel that is
>>>>> registered.
>>>>>
>>>>> Unfortunately, I’m traveling tomorrow and can’t make it to the sync.
>>>>> But I did another pass at reviewing this today. In addition to updating 
>>>>> the
>>>>> position deletes, I think there are some other changes that would help 
>>>>> move
>>>>> this forward:
>>>>>
>>>>>    1. Remove unnecessary renames, like setAll -> set
>>>>>    2. Use and extend the existing format builders instead of creating
>>>>>    new ones to carry the return type. I don’t think we need to pass the 
>>>>> row
>>>>>    type through the builder since the current way has been working just 
>>>>> fine
>>>>>    for years.
>>>>>
>>>>> In general, I think this should avoid changing quite so much. This is
>>>>> already a lot to review and additional changes make it harder.
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Mon, Sep 15, 2025 at 3:58 PM Steven Wu <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Peter, thanks for summarizing the 4 options. Both 0 and 1 seem good
>>>>>> to me, as they are explicit and easier to deprecate and remove the 
>>>>>> position
>>>>>> deletes in the future. Maybe option 0 is a tiny bit better as it is 
>>>>>> similar
>>>>>> to the existing FileWriterFactory API.
>>>>>>
>>>>>> I will leave PR related comments in the PR directly.
>>>>>>
>>>>>> On Mon, Sep 15, 2025 at 8:38 AM Péter Váry <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Thanks for the feedback @Russell and @Renjie!
>>>>>>>
>>>>>>> Updated the PR accordingly.
>>>>>>> Also removed the possibility to set the row schema for the
>>>>>>> position delete writer. We will not need that after the PDWR 
>>>>>>> deprecation.
>>>>>>>
>>>>>>> You can see one possible implementation in
>>>>>>> https://github.com/apache/iceberg/pull/12298 - we can discuss that
>>>>>>> separately. I just made sure that the new API is able to serve all of 
>>>>>>> the
>>>>>>> current needs.
>>>>>>>
>>>>>>> @Ryan: What are your thoughts?
>>>>>>>
>>>>>>> Are we in a stage when we can vote on the current API?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Peter
>>>>>>>
>>>>>>>
>>>>>>> Renjie Liu <[email protected]> ezt írta (időpont: 2025.
>>>>>>> szept. 15., H, 12:08):
>>>>>>>
>>>>>>>> I would also vote for option 0. This api has clean separation and
>>>>>>>> makes refactoring easier, e.g. when we completely deprecate v2 table, 
>>>>>>>> we
>>>>>>>> could mark the *positionDeleteWriteBuilder *method as deprecated,
>>>>>>>> and it would be easier to remove its usage.
>>>>>>>>
>>>>>>>> On Fri, Sep 12, 2025 at 11:24 PM Russell Spitzer <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Now that I fully understand the situation I think option 0 as
>>>>>>>>> you've written is probably the best thing to do as long as 
>>>>>>>>> PositionDelete
>>>>>>>>> is a class. I
>>>>>>>>> think with hindsight it probably shouldn't have been a class and
>>>>>>>>> always been an interface so that our internal code could produce rows 
>>>>>>>>> which
>>>>>>>>> implement PositionDelete rather than PositionDeletes that wrap
>>>>>>>>> rows.
>>>>>>>>>
>>>>>>>>> On Fri, Sep 12, 2025 at 8:02 AM Péter Váry <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Let me summarize the state a bit:
>>>>>>>>>>
>>>>>>>>>> The FileFormat interface needs to expose two distinct methods:
>>>>>>>>>>
>>>>>>>>>>    - WriteBuilder<InternalRow>
>>>>>>>>>>    - WriteBuilder<PositionDelete<InternalRow>>
>>>>>>>>>>       - After the PDWR deprecation this will be
>>>>>>>>>>       WriteBuilder<PositionDelete>
>>>>>>>>>>       - After V2 deprecation this will be not needed anymore
>>>>>>>>>>
>>>>>>>>>> Based on the file format methods, the Registry must support four
>>>>>>>>>> builder types:
>>>>>>>>>>
>>>>>>>>>>    - WriteBuilder<InternalRow>
>>>>>>>>>>    - DataWriteBuilder<InternalRow>
>>>>>>>>>>    - EqualityDeleteWriteBuilder<InternalRow>
>>>>>>>>>>    - PositionDeleteWriteBuilder<InternalRow>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *API Design Considerations*
>>>>>>>>>> There is an argument that the two WriteBuilder methods provided
>>>>>>>>>> by FileFormat are essentially the same, differing only in the
>>>>>>>>>> writerFunction. While this is technically correct for current
>>>>>>>>>> implementations, I believe the API should clearly distinguish 
>>>>>>>>>> between the
>>>>>>>>>> two writer types to highlight the differences.
>>>>>>>>>>
>>>>>>>>>> *Discussed Approaches*
>>>>>>>>>>
>>>>>>>>>> *0. Two Explicit Methods on FormatModel* (removed based on
>>>>>>>>>> previous comments, but I personally still prefer this)
>>>>>>>>>>
>>>>>>>>>> *WriteBuilder<InternalRow> writeBuilder(OutputFile outputFile);*
>>>>>>>>>> *WriteBuilder<PositionDelete<InternalRow>>
>>>>>>>>>> positionDeleteWriteBuilder(OutputFile outputFile); *
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Pros: Clear separation of responsibilities
>>>>>>>>>>
>>>>>>>>>> *1. One Builder + One Converter*
>>>>>>>>>>
>>>>>>>>>> *WriteBuilder<InternalRow> writeBuilder(OutputFile outputFile);*
>>>>>>>>>> *Function<PositionDelete<D>, D> positionDeleteConverter(Schema
>>>>>>>>>> schema);*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Pros: Keeps the interface compact
>>>>>>>>>> Cons: Requires additional documentation and understanding why the
>>>>>>>>>> conversion logic is needed
>>>>>>>>>>
>>>>>>>>>> *2. Single Method with Javadoc Clarification* (most similar to
>>>>>>>>>> the current approach)
>>>>>>>>>>
>>>>>>>>>> *WriteBuilder writeBuilder(OutputFile outputFile); *
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Pros: Minimalistic
>>>>>>>>>> Cons: Least explicit; relies entirely on documentation
>>>>>>>>>>
>>>>>>>>>> *2/b. Single Builder with Type Parameter *(based on Russell's
>>>>>>>>>> suggestion)
>>>>>>>>>>
>>>>>>>>>> *WriteBuilder writeBuilder(OutputFile outputFile);*
>>>>>>>>>> *// Usage: builder.build(Class<D> inputType)*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Pros: Flexible
>>>>>>>>>> Cons: Relies on documentation to clarify the available input types
>>>>>>>>>>
>>>>>>>>>> *Bonus*
>>>>>>>>>> Options 0 and 1 make it easier to phase out PositionDelete
>>>>>>>>>> filtering once V2 tables are deprecated.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Peter
>>>>>>>>>>
>>>>>>>>>> Péter Váry <[email protected]> ezt írta (időpont:
>>>>>>>>>> 2025. szept. 11., Cs, 18:36):
>>>>>>>>>>
>>>>>>>>>>> > Wouldn't PositionDelete<InternalRow> also be an InternalRow in
>>>>>>>>>>> this example? I think that's what i'm confused about.
>>>>>>>>>>>
>>>>>>>>>>> With the *second approach*, the WriteBuilder doesn’t need to
>>>>>>>>>>> handle PositionDelete objects directly. The conversion layer
>>>>>>>>>>> takes care of that, so the WriteBuilder only needs to work with
>>>>>>>>>>> InternalRow.
>>>>>>>>>>>
>>>>>>>>>>> With the *first approach*, we shift that responsibility to the
>>>>>>>>>>> WriteBuilder, which then has to support both InternalRow and
>>>>>>>>>>> PositionDelete<InternalRow>.
>>>>>>>>>>>
>>>>>>>>>>> In both cases, the FormatModelRegistry API will still expose
>>>>>>>>>>> the more concrete types (PositionDelete / InternalRow).
>>>>>>>>>>> However, under the *first approach*, the lower-level API only
>>>>>>>>>>> needs to handle InternalRow, simplifying its interface.
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Peter
>>>>>>>>>>>
>>>>>>>>>>> Russell Spitzer <[email protected]> ezt írta (időpont:
>>>>>>>>>>> 2025. szept. 11., Cs, 17:12):
>>>>>>>>>>>
>>>>>>>>>>>> Wouldn't PositionDelete<InternalRow> also be an InternalRow in
>>>>>>>>>>>> this example? I think that's what i'm confused about.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Sep 11, 2025 at 5:35 AM Péter Váry <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks, Russell, for taking a look at this!
>>>>>>>>>>>>>
>>>>>>>>>>>>> We need to expose four methods on the user-facing API (
>>>>>>>>>>>>> FormatModelRegistry):
>>>>>>>>>>>>>
>>>>>>>>>>>>>    1. *writeBuilder* – for writing arbitrary files without
>>>>>>>>>>>>>    Iceberg metadata. In the Iceberg codebase, this is exposed via
>>>>>>>>>>>>>    FlinkAppenderFactory and the GenericAppenderFactory for
>>>>>>>>>>>>>    creating FileAppender<RowData> and FileAppender<Record>
>>>>>>>>>>>>>    only.
>>>>>>>>>>>>>    2. *dataWriteBuilder* – for creating and collecting
>>>>>>>>>>>>>    metadata for Iceberg DataFiles.
>>>>>>>>>>>>>    3. *equalityDeleteWriteBuilder* – for creating and
>>>>>>>>>>>>>    collecting metadata for Iceberg EqualityDeleteFiles.
>>>>>>>>>>>>>    4. *positionDeleteWriteBuilder* – for creating and
>>>>>>>>>>>>>    collecting metadata for Iceberg PositionDeleteFiles.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We’d like to implement all four using a single WriteBuilder
>>>>>>>>>>>>> created by the FormatModels.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Your suggestion is a good one—it helps formalize the
>>>>>>>>>>>>> requirements for the build method and also surfaces an
>>>>>>>>>>>>> important design question:
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Who should be responsible for handling the differences
>>>>>>>>>>>>> between normal rows (InternalRow) and position deletes
>>>>>>>>>>>>> (PositionDelete<InternalRow>)*?
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - Should we have a more complex WriteBuilder class that
>>>>>>>>>>>>>    can create both DataFileAppender and PositionDeleteAppender
>>>>>>>>>>>>>    ?
>>>>>>>>>>>>>    - Or should we push this responsibility to the
>>>>>>>>>>>>>    engine-specific code, where we already have some logic (e.g.,
>>>>>>>>>>>>>    pathTransformFunc) needed by each engine to create the
>>>>>>>>>>>>>    PositionDeleteAppender?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Russell Spitzer <[email protected]> ezt írta
>>>>>>>>>>>>> (időpont: 2025. szept. 11., Cs, 0:11):
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm a little confused here, I think Ryan mentioned this in
>>>>>>>>>>>>>> the comment here
>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774/files#r2254967177
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> From my understanding there are two options?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1) We either are producing FormatModels that take a generic
>>>>>>>>>>>>>> row type D and produce writers that all take D and write files.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2) we are creating IcebergModel specific writers that take
>>>>>>>>>>>>>> DataFile, PositionDeleteFile, EqualityDeleteFile etc ... and 
>>>>>>>>>>>>>> write files
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The PositionDelete Converter issue seems to stem from
>>>>>>>>>>>>>> attempting to do both model 1 (being very generic) and 2, 
>>>>>>>>>>>>>> wanting special
>>>>>>>>>>>>>> code to deal with PositionDeleteFile<R> objects.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It looks like the code in #12774 is mostly doing model 1, but
>>>>>>>>>>>>>> we are trying to add in a specific converter for 2?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Maybe i'm totally lost here but I was assuming we would do
>>>>>>>>>>>>>> something a little scala-y like
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> public <T> FileAppender<T> build(Class<T> type) {
>>>>>>>>>>>>>> if (type == DataFile.class) return (FileAppender<T>) new
>>>>>>>>>>>>>> DataFileAppender();
>>>>>>>>>>>>>> if (type == DeleteFile.class) return (FileAppender<T>) new
>>>>>>>>>>>>>> DeleteFileAppender();
>>>>>>>>>>>>>> // ...
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So that we only register a single signature and if writer
>>>>>>>>>>>>>> specific implementation needs to do something special it can? 
>>>>>>>>>>>>>> I'm trying to
>>>>>>>>>>>>>> catch back up to speed on this PR so it may help to do a quick 
>>>>>>>>>>>>>> summary of
>>>>>>>>>>>>>> the current state and intent. (At least for me)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Sep 9, 2025 at 3:42 AM Péter Váry <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Renjie,
>>>>>>>>>>>>>>> Thanks for taking a look!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Let me clarify a few points:
>>>>>>>>>>>>>>> - The converter API is only required for writing position
>>>>>>>>>>>>>>> delete files for V2 tables
>>>>>>>>>>>>>>> - Currently, there are no plans to support vectorized
>>>>>>>>>>>>>>> writing via the java API
>>>>>>>>>>>>>>> - Even if we decide to support vectorized writes, I don't
>>>>>>>>>>>>>>> think we would like to implement it for Positional Deletes, 
>>>>>>>>>>>>>>> which are
>>>>>>>>>>>>>>> deprecated in the new spec.
>>>>>>>>>>>>>>> - Also, once the positional deletes - which contain the
>>>>>>>>>>>>>>> deleted rows - are deprecated (as planned), the conversion of 
>>>>>>>>>>>>>>> the Position
>>>>>>>>>>>>>>> Deletes with only file name and position would be trivial, even 
>>>>>>>>>>>>>>> for the
>>>>>>>>>>>>>>> vectorized writes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So from my perspective, the converter method exists purely
>>>>>>>>>>>>>>> for backward compatibility, and we intend to remove it as soon 
>>>>>>>>>>>>>>> as possible.
>>>>>>>>>>>>>>> Sacrificing good practices for the sake of a deprecated feature 
>>>>>>>>>>>>>>> doesn’t
>>>>>>>>>>>>>>> seem worthwhile to me.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Renjie Liu <[email protected]> ezt írta (időpont:
>>>>>>>>>>>>>>> 2025. szept. 8., H, 12:34):
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi, Peter:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I would vote for the first approach. In spite of the
>>>>>>>>>>>>>>>> compromises described, the api is still cleaner. Also I think 
>>>>>>>>>>>>>>>> there are
>>>>>>>>>>>>>>>> some problems with the converter api. For example, for 
>>>>>>>>>>>>>>>> vectorized
>>>>>>>>>>>>>>>> implementations such as comet which accepts columnar batch 
>>>>>>>>>>>>>>>> rather than
>>>>>>>>>>>>>>>> rows, the converter method would make things more complicated.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, Aug 30, 2025 at 2:49 PM Péter Váry <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I’ve initiated a discussion thread regarding the
>>>>>>>>>>>>>>>>> deprecation of Position Deletes containing row data. You can 
>>>>>>>>>>>>>>>>> follow it
>>>>>>>>>>>>>>>>> here:
>>>>>>>>>>>>>>>>> https://lists.apache.org/thread/8jw6pb2vq3ghmdqf1yvy8n5n6gg1fq5s
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> We can proceed with the discussion about the native
>>>>>>>>>>>>>>>>> reader/writer deprecation when we decided on the final API, 
>>>>>>>>>>>>>>>>> as the chosen
>>>>>>>>>>>>>>>>> design may influence our approach.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Since then, one more question has come up - hopefully the
>>>>>>>>>>>>>>>>> last:
>>>>>>>>>>>>>>>>> *How should we handle Position Delete Writers?*
>>>>>>>>>>>>>>>>> The File Format API should return builders for either rows
>>>>>>>>>>>>>>>>> or PositionDelete objects. Currently the method
>>>>>>>>>>>>>>>>> `WriteBuilder.createWriterFunc(Function<MessageType,
>>>>>>>>>>>>>>>>> ParquetValueWriter<?>>)` defines the accepted input 
>>>>>>>>>>>>>>>>> parameters for the
>>>>>>>>>>>>>>>>> writer. Users are responsible for ensuring that the writer 
>>>>>>>>>>>>>>>>> function and the
>>>>>>>>>>>>>>>>> return type of the `WriteBuilder.build()` are compatible. In 
>>>>>>>>>>>>>>>>> the new API,
>>>>>>>>>>>>>>>>> we no longer expose writer functions. We still expose 
>>>>>>>>>>>>>>>>> FileContent, since
>>>>>>>>>>>>>>>>> writer configurations vary by content type, but we don’t 
>>>>>>>>>>>>>>>>> expose the types.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> There are two proposals for handling types for the
>>>>>>>>>>>>>>>>> WriteBuilders:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    1. *Implicit Type Definition via FileContent* - the
>>>>>>>>>>>>>>>>>    builder parameter for FileContent would implicitly define 
>>>>>>>>>>>>>>>>> the input type
>>>>>>>>>>>>>>>>>    for the writer returned by build(), or
>>>>>>>>>>>>>>>>>    2. *Engine level conversion* - Engines would convert
>>>>>>>>>>>>>>>>>    PositionDelete objects to their native types.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In code:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    - In the 1st proposal, the
>>>>>>>>>>>>>>>>>    FormatModel.writeBuilder(OutputFile outputFile) can return 
>>>>>>>>>>>>>>>>> anything:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *    WriteBuilder builder =
>>>>>>>>>>>>>>>>>    FormatModelRegistry.writeBuilder(PARQUET, 
>>>>>>>>>>>>>>>>> InternalRow.class, outputFile);
>>>>>>>>>>>>>>>>>      FileAppender<InternalRow> appender =         
>>>>>>>>>>>>>>>>> .schema(table.schema())
>>>>>>>>>>>>>>>>>        .content(FileContent.DATA)         ....         
>>>>>>>>>>>>>>>>> .build(); *
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *   // Exposed, but
>>>>>>>>>>>>>>>>>    FormatModelRegistry.positionDeleteWriteBuilder should be 
>>>>>>>>>>>>>>>>> used instead
>>>>>>>>>>>>>>>>>     WriteBuilder builder = 
>>>>>>>>>>>>>>>>> FormatModelRegistry.writeBuilder(PARQUET,
>>>>>>>>>>>>>>>>>    InternalRow.class, outputFile);
>>>>>>>>>>>>>>>>>     FileAppender<PositionDelete<InternalRow>> appender =
>>>>>>>>>>>>>>>>>    .schema(table.schema())         
>>>>>>>>>>>>>>>>> .content(FileContent.POSITION_DELETES)
>>>>>>>>>>>>>>>>>        ....         .build();*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    - In the 2nd proposal, the FormatModel needs another
>>>>>>>>>>>>>>>>>    method:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *Function<PositionDelete<D>, D>
>>>>>>>>>>>>>>>>> positionDeleteConverter(Schema schema);    *example
>>>>>>>>>>>>>>>>> implementation:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *    return delete -> {      deleteRecord.update(0,
>>>>>>>>>>>>>>>>> UTF8String.fromString(delete.path().toString()));
>>>>>>>>>>>>>>>>> deleteRecord.update(1, delete.pos());      
>>>>>>>>>>>>>>>>> deleteRecord.update(2,
>>>>>>>>>>>>>>>>> delete.row());      return deleteRecord;    };*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *    // Content is only used for writer property
>>>>>>>>>>>>>>>>>    configuration     WriteBuilder<InternalRow> builder =
>>>>>>>>>>>>>>>>>    sparkFormatModel.writeBuilder(outputFile);
>>>>>>>>>>>>>>>>>      FileAppender<InternalRow> appender =         
>>>>>>>>>>>>>>>>> .schema(table.schema())
>>>>>>>>>>>>>>>>>        .content(FileContent.DATA)         ....         
>>>>>>>>>>>>>>>>> .build();*
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Drawbacks
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    - Proposal 1:
>>>>>>>>>>>>>>>>>       - Type checking for the FileAppenders occurs only
>>>>>>>>>>>>>>>>>       at runtime, so user errors surface late.
>>>>>>>>>>>>>>>>>       - File Format specification must clearly specify
>>>>>>>>>>>>>>>>>       which builder type corresponds to which file content 
>>>>>>>>>>>>>>>>> parameter—generics
>>>>>>>>>>>>>>>>>       would offer better clarity.
>>>>>>>>>>>>>>>>>       - Inconsistent patterns between WriteBuilder and
>>>>>>>>>>>>>>>>>       ReadBuilder, as the latter can define output types via 
>>>>>>>>>>>>>>>>> generics.
>>>>>>>>>>>>>>>>>    - Proposal 2:
>>>>>>>>>>>>>>>>>       - Requires FormatModels to implement a converter
>>>>>>>>>>>>>>>>>       method to transform PositionDelete<InternalRow> into 
>>>>>>>>>>>>>>>>> InternalRow.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Since we deprecated writing position delete files in the
>>>>>>>>>>>>>>>>> V3 spec, this extra method in the 2nd proposal will be 
>>>>>>>>>>>>>>>>> deprecated too. As a
>>>>>>>>>>>>>>>>> result, in the long run, we will have a nice, clean API.
>>>>>>>>>>>>>>>>> OTOH, if we accept the compromise described in the 1st
>>>>>>>>>>>>>>>>> proposal, the results of our decision will remain, even when 
>>>>>>>>>>>>>>>>> the functions
>>>>>>>>>>>>>>>>> are removed.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Looking forward to your thoughts.
>>>>>>>>>>>>>>>>> Thanks, Peter
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Aug 14, 2025, 14:12 Péter Váry <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Team,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> During yesterday’s community sync, we discussed the
>>>>>>>>>>>>>>>>>> current state of the File Format API proposal and identified 
>>>>>>>>>>>>>>>>>> two key
>>>>>>>>>>>>>>>>>> questions that require input from the broader community:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> *1. Dropping support for Position Delete files with Row
>>>>>>>>>>>>>>>>>> Data*
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The current Iceberg V2 spec [1] defines two types of
>>>>>>>>>>>>>>>>>> position delete files:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    - Files that store only the file name and row
>>>>>>>>>>>>>>>>>>    position.
>>>>>>>>>>>>>>>>>>    - Files that also store the deleted row data.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Although this feature is defined in the spec and some
>>>>>>>>>>>>>>>>>> tests exist in the Iceberg codebase, we’re not aware of any 
>>>>>>>>>>>>>>>>>> actual
>>>>>>>>>>>>>>>>>> implementation using the second type (with row data). 
>>>>>>>>>>>>>>>>>> Supporting V2 table
>>>>>>>>>>>>>>>>>> writing via the new File Format API would be simpler if we 
>>>>>>>>>>>>>>>>>> dropped support
>>>>>>>>>>>>>>>>>> for this feature.
>>>>>>>>>>>>>>>>>> If you know of any use case or reason to retain support
>>>>>>>>>>>>>>>>>> for position deletes with row data, please let us know.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> *2. Deprecating Native File Format Readers/Writers in the
>>>>>>>>>>>>>>>>>> API*
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The current API contains format-specific readers/writers
>>>>>>>>>>>>>>>>>> for Parquet, Avro, and ORC. With the introduction of the 
>>>>>>>>>>>>>>>>>> InternalData and
>>>>>>>>>>>>>>>>>> File Format APIs, Iceberg users can now write files using:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    - InternalData API for metadata files (manifest,
>>>>>>>>>>>>>>>>>>    manifest list, partition stats).
>>>>>>>>>>>>>>>>>>    - File Format API for data and delete files.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I propose we deprecate the original format-specific
>>>>>>>>>>>>>>>>>> writers and guide users to use the new APIs based on the 
>>>>>>>>>>>>>>>>>> target file type.
>>>>>>>>>>>>>>>>>> If you’re aware of any use cases that still require the 
>>>>>>>>>>>>>>>>>> original
>>>>>>>>>>>>>>>>>> format-specific writers, please share them.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [1] - Position Delete File Spec:
>>>>>>>>>>>>>>>>>> https://iceberg.apache.org/spec/?h=delete#position-delete-files
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Péter Váry <[email protected]> ezt írta
>>>>>>>>>>>>>>>>>> (időpont: 2025. júl. 22., K, 16:09):
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Also put together a solution where the Engine specific
>>>>>>>>>>>>>>>>>>> format transformation is separated from the writer, and the 
>>>>>>>>>>>>>>>>>>> engines need to
>>>>>>>>>>>>>>>>>>> take care of it separately.
>>>>>>>>>>>>>>>>>>> This is somewhat complicated on the implementation side
>>>>>>>>>>>>>>>>>>> (see: [RowDataTransformer](
>>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298/files#diff-562fa4cc369c908a157f59a9235fd3f389096451e7901686fba37c87b53dee08),
>>>>>>>>>>>>>>>>>>> and [InternalRowTransformer](
>>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298/files#diff-546f9dc30e3207d1d2bc0a2722976b55f5a04dcf85a22855e4f400500c317140)),
>>>>>>>>>>>>>>>>>>> but simplifies the API.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> @rdblue: Please check the proposed solution. I think
>>>>>>>>>>>>>>>>>>> this is what you have suggested
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Péter Váry <[email protected]> ezt írta
>>>>>>>>>>>>>>>>>>> (időpont: 2025. jún. 30., H, 18:42):
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> During the PR review [1], we began exploring what could
>>>>>>>>>>>>>>>>>>>> we use as an intermediate layer to reduce the need for 
>>>>>>>>>>>>>>>>>>>> engines and file
>>>>>>>>>>>>>>>>>>>> formats to implement the full matrix of file format - 
>>>>>>>>>>>>>>>>>>>> object model
>>>>>>>>>>>>>>>>>>>> conversions.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> To support this discussion, I’ve created and run a set
>>>>>>>>>>>>>>>>>>>> of performance benchmarks and compiled a document 
>>>>>>>>>>>>>>>>>>>> outlining the potential
>>>>>>>>>>>>>>>>>>>> benefits and trade-offs [2].
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Feedback is welcome, feel free to comment on the
>>>>>>>>>>>>>>>>>>>> document, the PR, or directly in this thread.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> [1] - PR discussion -
>>>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774#discussion_r2093626096
>>>>>>>>>>>>>>>>>>>> [2] - File Format and engine object model
>>>>>>>>>>>>>>>>>>>> transformation performance -
>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1GdA8IowKMtS3QVdm8s-0X-ZRYetcHv2bhQ9mrSd3fd4
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Péter Váry <[email protected]> ezt írta
>>>>>>>>>>>>>>>>>>>> (időpont: 2025. máj. 7., Sze, 13:15):
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>>>> The proposed API part is reviewed and ready to go.
>>>>>>>>>>>>>>>>>>>>> See: https://github.com/apache/iceberg/pull/12774
>>>>>>>>>>>>>>>>>>>>> Thanks to everyone who reviewed it already!
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Many of you wanted to review, but I know that the time
>>>>>>>>>>>>>>>>>>>>> constraints are there for everyone. I still very much 
>>>>>>>>>>>>>>>>>>>>> would like to hear
>>>>>>>>>>>>>>>>>>>>> your voices, so I will not merge the PR this week. Please 
>>>>>>>>>>>>>>>>>>>>> review it if you.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Péter Váry <[email protected]> ezt írta
>>>>>>>>>>>>>>>>>>>>> (időpont: 2025. ápr. 16., Sze, 7:02):
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi Renjie,
>>>>>>>>>>>>>>>>>>>>>> The first one for the proposed new API is here:
>>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774
>>>>>>>>>>>>>>>>>>>>>> Thanks, Peter
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Wed, Apr 16, 2025, 05:40 Renjie Liu <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi, Peter:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks for the effort. I totally agree with
>>>>>>>>>>>>>>>>>>>>>>> splitting them into smaller prs to move forward.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I'm quite interested in this topic, and please ping
>>>>>>>>>>>>>>>>>>>>>>> me in those splitted prs and I'll help to review.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Mon, Apr 14, 2025 at 11:22 PM Jean-Baptiste
>>>>>>>>>>>>>>>>>>>>>>> Onofré <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi Peter
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Awesome ! Thank you so much !
>>>>>>>>>>>>>>>>>>>>>>>> I will do a new pass.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>>>> JB
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2025 at 3:48 PM Péter Váry <
>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>> > Hi JB,
>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>> > Separated out the proposed interfaces to a new
>>>>>>>>>>>>>>>>>>>>>>>> PR: https://github.com/apache/iceberg/pull/12774.
>>>>>>>>>>>>>>>>>>>>>>>> > Reviewers can check that out if they are only
>>>>>>>>>>>>>>>>>>>>>>>> interested in how the new API would look like.
>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>> > Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> > Peter
>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>> > Jean-Baptiste Onofré <[email protected]> ezt írta
>>>>>>>>>>>>>>>>>>>>>>>> (időpont: 2025. ápr. 10., Cs, 18:25):
>>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>>> >> Hi Peter
>>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>>> >> Thanks for the ping about the PR.
>>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>>> >> Maybe, to facilitate the review and move forward
>>>>>>>>>>>>>>>>>>>>>>>> faster, we should
>>>>>>>>>>>>>>>>>>>>>>>> >> split the PR in smaller PRs:
>>>>>>>>>>>>>>>>>>>>>>>> >> - one with the interfaces (ReadBuilder,
>>>>>>>>>>>>>>>>>>>>>>>> AppenderBuilder, ObjectModel,
>>>>>>>>>>>>>>>>>>>>>>>> >> AppenderBuilder, DataWriterBuilder, ...)
>>>>>>>>>>>>>>>>>>>>>>>> >> - one for each file providers (Parquet, Avro,
>>>>>>>>>>>>>>>>>>>>>>>> ORC)
>>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>>> >> Thoughts ? I can help on the split if needed.
>>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>>> >> Regards
>>>>>>>>>>>>>>>>>>>>>>>> >> JB
>>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>>> >> On Thu, Apr 10, 2025 at 5:16 AM Péter Váry <
>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>>>>>> >> > Since the 1.9.0 release candidate has been
>>>>>>>>>>>>>>>>>>>>>>>> created, I would like to resurrect this PR:
>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298 to
>>>>>>>>>>>>>>>>>>>>>>>> ensure that we have as long a testing period as 
>>>>>>>>>>>>>>>>>>>>>>>> possible for it.
>>>>>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>>>>>> >> > To recap, here is what the PR does after the
>>>>>>>>>>>>>>>>>>>>>>>> review rounds:
>>>>>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>>>>>> >> > Created 3 interface classes which are
>>>>>>>>>>>>>>>>>>>>>>>> implemented by the file formats:
>>>>>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>>>>>> >> > ReadBuilder - Builder for reading data from
>>>>>>>>>>>>>>>>>>>>>>>> data files
>>>>>>>>>>>>>>>>>>>>>>>> >> > AppenderBuilder - Builder for writing data to
>>>>>>>>>>>>>>>>>>>>>>>> data files
>>>>>>>>>>>>>>>>>>>>>>>> >> > ObjectModel - Providing ReadBuilders, and
>>>>>>>>>>>>>>>>>>>>>>>> AppenderBuilders for the specific data file format and 
>>>>>>>>>>>>>>>>>>>>>>>> object model pair
>>>>>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>>>>>> >> > Updated the Parquet, Avro, ORC implementation
>>>>>>>>>>>>>>>>>>>>>>>> for this interfaces, and deprecated the old 
>>>>>>>>>>>>>>>>>>>>>>>> reader/writer APIs
>>>>>>>>>>>>>>>>>>>>>>>> >> > Created interface classes which will be used
>>>>>>>>>>>>>>>>>>>>>>>> by the actual readers/writers of the data files:
>>>>>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>>>>>> >> > AppenderBuilder - Builder for writing a file
>>>>>>>>>>>>>>>>>>>>>>>> >> > DataWriterBuilder - Builder for generating a
>>>>>>>>>>>>>>>>>>>>>>>> data file
>>>>>>>>>>>>>>>>>>>>>>>> >> > PositionDeleteWriterBuilder - Builder for
>>>>>>>>>>>>>>>>>>>>>>>> generating a position delete file
>>>>>>>>>>>>>>>>>>>>>>>> >> > EqualityDeleteWriterBuilder - Builder for
>>>>>>>>>>>>>>>>>>>>>>>> generating an equality delete file
>>>>>>>>>>>>>>>>>>>>>>>> >> > No ReadBuilder here - the file format reader
>>>>>>>>>>>>>>>>>>>>>>>> builder is reused
>>>>>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>>>>>> >> > Created a WriterBuilder class which implements
>>>>>>>>>>>>>>>>>>>>>>>> the interfaces above
>>>>>>>>>>>>>>>>>>>>>>>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
>>>>>>>>>>>>>>>>>>>>>>>> based on a provided file format specific 
>>>>>>>>>>>>>>>>>>>>>>>> AppenderBuilder
>>>>>>>>>>>>>>>>>>>>>>>> >> > Created an ObjectModelRegistry which stores
>>>>>>>>>>>>>>>>>>>>>>>> the available ObjectModels, and engines and users 
>>>>>>>>>>>>>>>>>>>>>>>> could request the readers
>>>>>>>>>>>>>>>>>>>>>>>> (ReadBuilder) and writers
>>>>>>>>>>>>>>>>>>>>>>>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
>>>>>>>>>>>>>>>>>>>>>>>> from.
>>>>>>>>>>>>>>>>>>>>>>>> >> > Created the appropriate ObjectModels:
>>>>>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>>>>>> >> > GenericObjectModels - for reading and writing
>>>>>>>>>>>>>>>>>>>>>>>> Iceberg Records
>>>>>>>>>>>>>>>>>>>>>>>> >> > SparkObjectModels - for reading (vectorized
>>>>>>>>>>>>>>>>>>>>>>>> and non-vectorized) and writing Spark 
>>>>>>>>>>>>>>>>>>>>>>>> InternalRow/ColumnarBatch objects
>>>>>>>>>>>>>>>>>>>>>>>> >> > FlinkObjectModels - for reading and writing
>>>>>>>>>>>>>>>>>>>>>>>> Flink RowData objects
>>>>>>>>>>>>>>>>>>>>>>>> >> > An arrow object model is also registered for
>>>>>>>>>>>>>>>>>>>>>>>> vectorized reads of Parquet files into Arrow 
>>>>>>>>>>>>>>>>>>>>>>>> ColumnarBatch objects
>>>>>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>>>>>> >> > Updated the production code where the reading
>>>>>>>>>>>>>>>>>>>>>>>> and writing happens to use the ObjectModelRegistry and 
>>>>>>>>>>>>>>>>>>>>>>>> the new
>>>>>>>>>>>>>>>>>>>>>>>> reader/writer interfaces to access data files
>>>>>>>>>>>>>>>>>>>>>>>> >> > Kept the testing code intact to ensure that
>>>>>>>>>>>>>>>>>>>>>>>> the new API/code is not breaking anything
>>>>>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>>>>>> >> > The original change was not small, and grew
>>>>>>>>>>>>>>>>>>>>>>>> substantially during the review rounds. So if you have 
>>>>>>>>>>>>>>>>>>>>>>>> questions, or I can
>>>>>>>>>>>>>>>>>>>>>>>> do anything to make the review easier, don't hesitate 
>>>>>>>>>>>>>>>>>>>>>>>> to ask. I am happy to
>>>>>>>>>>>>>>>>>>>>>>>> do anything to move this forward.
>>>>>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>>>>>> >> > Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> >> > Peter
>>>>>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>>>>>> >> > Péter Váry <[email protected]> ezt
>>>>>>>>>>>>>>>>>>>>>>>> írta (időpont: 2025. márc. 26., Sze, 14:54):
>>>>>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>>>>>> >> >> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>>>>>> >> >> I have updated the File Format API PR (
>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298)
>>>>>>>>>>>>>>>>>>>>>>>> based on the answers and review comments.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>>>>>> >> >> I would like to merge this only after the
>>>>>>>>>>>>>>>>>>>>>>>> 1.9.0 release so we have more time finding any issues 
>>>>>>>>>>>>>>>>>>>>>>>> and solving them
>>>>>>>>>>>>>>>>>>>>>>>> before this goes to a release for the users.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>>>>>> >> >> For this I have updated the deprecation
>>>>>>>>>>>>>>>>>>>>>>>> comments accordingly.
>>>>>>>>>>>>>>>>>>>>>>>> >> >> I would like to ask you to review the PR, so
>>>>>>>>>>>>>>>>>>>>>>>> we iron out any possible requested changes and be 
>>>>>>>>>>>>>>>>>>>>>>>> ready for the merge as
>>>>>>>>>>>>>>>>>>>>>>>> soon as possible after the 1.9.0 release.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>>>>>> >> >> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> >> >> Peter
>>>>>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>>>>>> >> >> Péter Váry <[email protected]> ezt
>>>>>>>>>>>>>>>>>>>>>>>> írta (időpont: 2025. márc. 21., P, 14:32):
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>> Hi Renije,
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>> > 1. File format filters
>>>>>>>>>>>>>>>>>>>>>>>> >> >>> >
>>>>>>>>>>>>>>>>>>>>>>>> >> >>> > Do the filters include both filter
>>>>>>>>>>>>>>>>>>>>>>>> expressions from both user query and delete filter?
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>> The current discussion is about the filters
>>>>>>>>>>>>>>>>>>>>>>>> from the user query.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>> About the delete filter:
>>>>>>>>>>>>>>>>>>>>>>>> >> >>> Based on the suggestions on the PR, I have
>>>>>>>>>>>>>>>>>>>>>>>> moved the delete filter out from the main API. Created 
>>>>>>>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>>>>>> `SupportsDeleteFilter` interface for it which would 
>>>>>>>>>>>>>>>>>>>>>>>> allow pushing down to
>>>>>>>>>>>>>>>>>>>>>>>> the filter to Parquet vectorized readers in Spark, as 
>>>>>>>>>>>>>>>>>>>>>>>> this is the only
>>>>>>>>>>>>>>>>>>>>>>>> place where we currently implemented this feature.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>> Renjie Liu <[email protected]> ezt
>>>>>>>>>>>>>>>>>>>>>>>> írta (időpont: 2025. márc. 21., P, 14:11):
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>> Hi, Peter:
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>> Thanks for the effort on this.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>> 1. File format filters
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>> Do the filters include both filter
>>>>>>>>>>>>>>>>>>>>>>>> expressions from both user query and delete filter?
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>> For filters from user query, I agree with
>>>>>>>>>>>>>>>>>>>>>>>> you that we should keep the current behavior.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>> For delete filters associated with data
>>>>>>>>>>>>>>>>>>>>>>>> files, at first I thought file format readers should 
>>>>>>>>>>>>>>>>>>>>>>>> not care about this.
>>>>>>>>>>>>>>>>>>>>>>>> But now I realized that maybe we need to also push it 
>>>>>>>>>>>>>>>>>>>>>>>> to file reader, this
>>>>>>>>>>>>>>>>>>>>>>>> is useful when `IS_DELETED` metadata column is not 
>>>>>>>>>>>>>>>>>>>>>>>> necessary and we could
>>>>>>>>>>>>>>>>>>>>>>>> use these filters (position deletes, etc) to further 
>>>>>>>>>>>>>>>>>>>>>>>> prune data.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>> But anyway, I agree that we could postpone
>>>>>>>>>>>>>>>>>>>>>>>> it in follow up pr.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>> 2. Batch size configuration
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>> I'm leaning toward option 2.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>> 3. Spark configuration
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>> I'm leaning towards using different
>>>>>>>>>>>>>>>>>>>>>>>> configuration objects.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>> On Thu, Mar 20, 2025 at 10:23 PM Péter Váry
>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>> Hi Team,
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>> Thanks everyone for the reviews on
>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298!
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>> I have addressed most of comments, but a
>>>>>>>>>>>>>>>>>>>>>>>> few questions still remain which might merit a bit 
>>>>>>>>>>>>>>>>>>>>>>>> wider audience:
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>> We should decide on the expected filtering
>>>>>>>>>>>>>>>>>>>>>>>> behavior when the filters are pushed down to the 
>>>>>>>>>>>>>>>>>>>>>>>> readers. Currently the
>>>>>>>>>>>>>>>>>>>>>>>> filters are applied as best effort for the file format 
>>>>>>>>>>>>>>>>>>>>>>>> readers. Some
>>>>>>>>>>>>>>>>>>>>>>>> readers (Avro) just skip them altogether. There was a 
>>>>>>>>>>>>>>>>>>>>>>>> suggestion on the PR
>>>>>>>>>>>>>>>>>>>>>>>> that we might enforce more strict requirements and the 
>>>>>>>>>>>>>>>>>>>>>>>> readers either
>>>>>>>>>>>>>>>>>>>>>>>> reject part of the filters, or they could apply them 
>>>>>>>>>>>>>>>>>>>>>>>> fully.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>> Batch sizes are currently parameters for
>>>>>>>>>>>>>>>>>>>>>>>> the reader builders which could be set for 
>>>>>>>>>>>>>>>>>>>>>>>> non-vectorized readers too which
>>>>>>>>>>>>>>>>>>>>>>>> could be confusing.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>> Currently the Spark batch reader uses
>>>>>>>>>>>>>>>>>>>>>>>> different configuration objects for 
>>>>>>>>>>>>>>>>>>>>>>>> ParquetBatchReadConf and
>>>>>>>>>>>>>>>>>>>>>>>> OrcBatchReadConf as requested by the reviewers of the 
>>>>>>>>>>>>>>>>>>>>>>>> Comet PR. There was a
>>>>>>>>>>>>>>>>>>>>>>>> suggestion on the current PR to use a common 
>>>>>>>>>>>>>>>>>>>>>>>> configuration instead.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>> I would be interested in hearing your
>>>>>>>>>>>>>>>>>>>>>>>> thoughts about these topics.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>> My current take:
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>> File format filters: I am leaning towards
>>>>>>>>>>>>>>>>>>>>>>>> keeping the current laninet behavior. Especially since 
>>>>>>>>>>>>>>>>>>>>>>>> Bloom filters are
>>>>>>>>>>>>>>>>>>>>>>>> not able to do a full filtering, and are often used as 
>>>>>>>>>>>>>>>>>>>>>>>> a way to filter out
>>>>>>>>>>>>>>>>>>>>>>>> unwanted records. Another option would be to implement 
>>>>>>>>>>>>>>>>>>>>>>>> a secondary
>>>>>>>>>>>>>>>>>>>>>>>> filtering inside the file formats themselves which I 
>>>>>>>>>>>>>>>>>>>>>>>> think would cause
>>>>>>>>>>>>>>>>>>>>>>>> extra complexity, and possible code duplication. 
>>>>>>>>>>>>>>>>>>>>>>>> Whatever the decision
>>>>>>>>>>>>>>>>>>>>>>>> here, I would suggest moving this out to a next PR as 
>>>>>>>>>>>>>>>>>>>>>>>> the current changeset
>>>>>>>>>>>>>>>>>>>>>>>> is big enough as it is.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>> Batch size configuration: Currently this
>>>>>>>>>>>>>>>>>>>>>>>> is the only property which is different in the batch 
>>>>>>>>>>>>>>>>>>>>>>>> readers and the
>>>>>>>>>>>>>>>>>>>>>>>> non-vectorized readers. I see 3 possible solutions:
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>> Create different builders for vectorized
>>>>>>>>>>>>>>>>>>>>>>>> and non-vectorized reads - I don't think the current 
>>>>>>>>>>>>>>>>>>>>>>>> solution is confusing
>>>>>>>>>>>>>>>>>>>>>>>> enough to worth the extra class
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>> We could put this to the reader
>>>>>>>>>>>>>>>>>>>>>>>> configuration property set - This could work, but 
>>>>>>>>>>>>>>>>>>>>>>>> "hide" the possible
>>>>>>>>>>>>>>>>>>>>>>>> configuration mode which is valid for both Parquet and 
>>>>>>>>>>>>>>>>>>>>>>>> ORC readers
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>> We could keep things as it is now - I
>>>>>>>>>>>>>>>>>>>>>>>> would chose this one, but I don't have a strong 
>>>>>>>>>>>>>>>>>>>>>>>> opinion here
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>> Spark configuration: TBH, I'm open to bot
>>>>>>>>>>>>>>>>>>>>>>>> solution and happy to move to the direction the 
>>>>>>>>>>>>>>>>>>>>>>>> community decides on
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>> Peter
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>> Jean-Baptiste Onofré <[email protected]>
>>>>>>>>>>>>>>>>>>>>>>>> ezt írta (időpont: 2025. márc. 14., P, 16:31):
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> Hi Peter
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> Thanks for the update. I will do a new
>>>>>>>>>>>>>>>>>>>>>>>> pass on the PR.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> JB
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> On Thu, Mar 13, 2025 at 1:16 PM Péter
>>>>>>>>>>>>>>>>>>>>>>>> Váry <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> > Hi Team,
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> > I have rebased the File Format API
>>>>>>>>>>>>>>>>>>>>>>>> proposal (
>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298) to
>>>>>>>>>>>>>>>>>>>>>>>> include the new changes needed for the Variant types. 
>>>>>>>>>>>>>>>>>>>>>>>> I would love to hear
>>>>>>>>>>>>>>>>>>>>>>>> your feedback, especially Dan and Ryan, as you were 
>>>>>>>>>>>>>>>>>>>>>>>> the most active during
>>>>>>>>>>>>>>>>>>>>>>>> our discussions. If I can help in any way to make the 
>>>>>>>>>>>>>>>>>>>>>>>> review easier, please
>>>>>>>>>>>>>>>>>>>>>>>> let me know.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> > Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> > Peter
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> > Péter Váry <[email protected]>
>>>>>>>>>>>>>>>>>>>>>>>> ezt írta (időpont: 2025. febr. 28., P, 17:50):
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >> Thanks for all of the actionable,
>>>>>>>>>>>>>>>>>>>>>>>> relevant feedback on the PR (
>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298).
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >> Updated the code to address most of
>>>>>>>>>>>>>>>>>>>>>>>> them. Please check if you agree with the general 
>>>>>>>>>>>>>>>>>>>>>>>> approach.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >> If there is a consensus about the
>>>>>>>>>>>>>>>>>>>>>>>> general approach, I could. separate out the PR to 
>>>>>>>>>>>>>>>>>>>>>>>> smaller pieces so we can
>>>>>>>>>>>>>>>>>>>>>>>> have an easier time to review and merge those 
>>>>>>>>>>>>>>>>>>>>>>>> step-by-step.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >> Peter
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >> Jean-Baptiste Onofré <[email protected]>
>>>>>>>>>>>>>>>>>>>>>>>> ezt írta (időpont: 2025. febr. 20., Cs, 14:14):
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> Hi Peter
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> sorry for the late reply on this.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> I did a pass on the proposal, it's
>>>>>>>>>>>>>>>>>>>>>>>> very interesting and well written.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> I like the DataFile API and
>>>>>>>>>>>>>>>>>>>>>>>> definitely worth to discuss all together.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> Maybe we can schedule a specific
>>>>>>>>>>>>>>>>>>>>>>>> meeting to discuss about DataFile API ?
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> Thoughts ?
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> Regards
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> JB
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter
>>>>>>>>>>>>>>>>>>>>>>>> Váry <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Hi Team,
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > As mentioned earlier on our
>>>>>>>>>>>>>>>>>>>>>>>> Community Sync I am exploring the possibility to 
>>>>>>>>>>>>>>>>>>>>>>>> define a FileFormat API
>>>>>>>>>>>>>>>>>>>>>>>> for accessing different file formats. I have put 
>>>>>>>>>>>>>>>>>>>>>>>> together a proposal based
>>>>>>>>>>>>>>>>>>>>>>>> on my findings.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > -------------------
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Iceberg currently supports 3
>>>>>>>>>>>>>>>>>>>>>>>> different file formats: Avro, Parquet, ORC. With the 
>>>>>>>>>>>>>>>>>>>>>>>> introduction of
>>>>>>>>>>>>>>>>>>>>>>>> Iceberg V3 specification many new features are added 
>>>>>>>>>>>>>>>>>>>>>>>> to Iceberg. Some of
>>>>>>>>>>>>>>>>>>>>>>>> these features like new column types, default values 
>>>>>>>>>>>>>>>>>>>>>>>> require changes at the
>>>>>>>>>>>>>>>>>>>>>>>> file format level. The changes are added by individual 
>>>>>>>>>>>>>>>>>>>>>>>> developers with
>>>>>>>>>>>>>>>>>>>>>>>> different focus on the different file formats. As a 
>>>>>>>>>>>>>>>>>>>>>>>> result not all of the
>>>>>>>>>>>>>>>>>>>>>>>> features are available for every supported file format.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Also there are emerging file
>>>>>>>>>>>>>>>>>>>>>>>> formats like Vortex [1] or Lance [2] which either by 
>>>>>>>>>>>>>>>>>>>>>>>> specialization, or by
>>>>>>>>>>>>>>>>>>>>>>>> applying newer research results could provide better 
>>>>>>>>>>>>>>>>>>>>>>>> alternatives for
>>>>>>>>>>>>>>>>>>>>>>>> certain use-cases like random access for data, or 
>>>>>>>>>>>>>>>>>>>>>>>> storing ML models.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > -------------------
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Please check the detailed proposal
>>>>>>>>>>>>>>>>>>>>>>>> [3] and the google document [4], and comment there or 
>>>>>>>>>>>>>>>>>>>>>>>> reply on the dev list
>>>>>>>>>>>>>>>>>>>>>>>> if you have any suggestions.
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Peter
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > [1] -
>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/spiraldb/vortex
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > [2] -
>>>>>>>>>>>>>>>>>>>>>>>> https://lancedb.github.io/lance/
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > [3] -
>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/12225
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > [4] -
>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds
>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>

Re: [DISCUSS] FileFormat API proposal

Reply via email to