Re: [DISCUSS] Deprecate file_path field in column chunk

Micah Kornfield Fri, 19 Dec 2025 10:33:41 -0800

I also posted a PR updating the implementation status page to reflect that
it's not really file_path which is supported but _metadata files (
https://github.com/apache/parquet-site/pull/145).  I believe only
hyparquet might have support for actually reading external columns.


Thanks,
Micah

On Wed, Dec 10, 2025 at 11:49 PM Micah Kornfield <[email protected]>
wrote:

> Based on a conversation in the sync today, we thought not explicitly
> deprecating the field but providing guidance and documenting that new uses
> for the field should go through a feature addition process is probably a
> good path forward.
>
> I put up https://github.com/apache/parquet-format/pull/542 for a
> straw-man to capture this.
>
> On Tue, Dec 9, 2025 at 6:33 PM Julien Le Dem <[email protected]> wrote:
>
>> IMO Iceberg needs to be aware of Parquet files referencing others so that
>> it can prune older snapshots correctly and not delete parquet files
>> referenced by others when deleting old snapshots. Depending if this is a
>> cross table or within table file to file reference could make it more or
>> less complicated.
>>
>> I could imagine starting with a simple implementation for the write path:
>> "Create table foo using parquet options (reference_original_columns =
>> true)
>> as Select content, extract(content) as metadata from bar"
>> That would be constrained to simple plans that have the scan and the
>> output
>> in the same step (map only) so that rows are in the same order per file.
>> Alternatively "alter table foo add column metadata OPTIONS (column_familly
>> = 'bar')" with a subsequent "update table set metadata=extract(content)"
>> to
>> create those files.
>>
>> (just some random thoughts, I'm sure others have spent more time thinking
>> about this)
>>
>> This doesn't seem that different from the mechanism creating a deletion
>> vector in Iceberg.
>>
>> It could also be seen as a view in iceberg joining on _row_id.
>>
>> This can be a topic in the meeting tomorrow.
>>
>> On Mon, Dec 8, 2025 at 9:02 AM Daniel Weeks <[email protected]> wrote:
>>
>> > Thanks for the context Kenny.  That example is very similar to some of
>> the
>> > cases that come up in the multi-modal scenarios.
>> >
>> > I agree that we're in a little bit of a difficult situation due to lack
>> of
>> > existing support, which also leads to Micah's concern that it's a point
>> of
>> > confusion for implementers.
>> >
>> > I would be in favor of adding some additional context to the description
>> > because there are some basic things implementers should do (e.g.
>> validate
>> > that the file path is either not set or set to the current file being
>> read
>> > if they don't support disaggregated column data).  While older clients
>> will
>> > likely break if they encounter files written this way, there's almost no
>> > risk that it would result in silent failures or corruption as I suspect
>> > most implementations will read the ranges from the referencing file and
>> not
>> > be able to interpret it.
>> >
>> > Adding a read path is relatively straightforward (at least in the java
>> > implementation for both stream and vectored IO reads), but the write
>> path
>> > is where things get more complicated.
>> >
>> > I think we want to discuss some of these use cases in more detail and
>> see
>> > if they are practical and reasonable.  Some cases may make more sense
>> at a
>> > higher-level (like table metadata) while others may make sense to
>> handle at
>> > the file level (like asymmetric column sizes).
>> >
>> > -Dan
>> >
>> >
>> >
>> > On Sun, Dec 7, 2025 at 12:35 PM Kenny Daniel <[email protected]>
>> wrote:
>> >
>> > > Since I was the one who brought up file_path at the sync a couple
>> weeks
>> > > ago, I'll share my thoughts:
>> > >
>> > > I am interested in the file_path field for column chunks because it
>> would
>> > > allow for some extremely efficient data engineering in specific cases
>> > > like *adding
>> > > a column to existing data*.
>> > >
>> > > My use case is LLM data. LLM data is often huge piles of text in
>> parquet
>> > > format (see: all of huggingface, or any llm request/response logs).
>> If I
>> > > have a 400mb source.parquet file, how can I annotate each row with an
>> > added
>> > > "score" column efficiently? I would prefer to not have to copy all
>> 400mb
>> > of
>> > > data just to add a "score" column. It would be slick if I could make a
>> > new
>> > > annotated.parquet file that points to source.parquet for the source
>> > > columns, and then only includes the new "score" column in the
>> > > annotated.parquet file. The source.parquet would remain 400mb, the
>> > > annotated parquet could be ~10kb and incorporate the source data by
>> > > reference.
>> > >
>> > > As the implementor of hyparquet I have conflicting opinions on this
>> > > feature. On the one hand, it's a cool capability, already built into
>> > > parquet. On the other hand... none of the parquet implementations
>> support
>> > > it. Hyparquet has a branch for reading/writing file_path that I used
>> for
>> > > testing. It does work. But I don't want to ship it unless theres at
>> least
>> > > ONE other implementation that supports it (there isn't).
>> > >
>> > > I agree that this would be better implemented at the table format
>> level
>> > > (eg- iceberg). BUT... *iceberg does not support my adding column use
>> > case*!
>> > > The problem is that, despite parquet being a column-oriented format,
>> > > iceberg has no support to efficiently zip a new column with existing
>> > data.
>> > > The only option for "add column" in iceberg would be to *add a column
>> > with
>> > > default values and then re-write every row* (including the heavy text
>> > > data). So iceberg fails to solve my problem at all.
>> > >
>> > > Anyway, I'm fine with deprecating, or not. But I did want to at least
>> > make
>> > > the case that it could serve a purpose that I don't see any other good
>> > way
>> > > of solving at the moment.
>> > >
>> > > Kenny
>> > >
>> > >
>> > >
>> > > On Fri, Dec 5, 2025 at 9:46 PM Micah Kornfield <[email protected]
>> >
>> > > wrote:
>> > >
>> > > > Hi Dan,
>> > > >
>> > > > > However, there are ongoing discussions around multi-modal cases
>> where
>> > > > > either separating large columns (e.g. inline blobs) or appending
>> > column
>> > > > > data without rewriting existing data may leverage this.
>> > > >
>> > > >
>> > > > Do you have any design docs or mailing list discussions you can
>> point
>> > to?
>> > > >
>> > > > I don't feel like leaving this for now while we explore those use
>> cases
>> > > > > would cause any additional confusion/complexity.
>> > > >
>> > > >
>> > > > Agreed, it isn't urgent to clean this up. But having a more concrete
>> > > > timeline would be helpful, this does seem to be a semi-regular
>> source
>> > of
>> > > > confusion for folks, so it would be nice to clean up the loose end.
>> > > >
>> > > > Thanks,
>> > > > Micah
>> > > >
>> > > > On Fri, Dec 5, 2025 at 4:07 PM Daniel Weeks <[email protected]>
>> wrote:
>> > > >
>> > > > > I'd actually prefer that we don't deprecate this field (at least
>> not
>> > > > > immediately).
>> > > > >
>> > > > > Recognizing that we've discussed separating column data into
>> multiple
>> > > > files
>> > > > > for over a decade without any concrete implementations, there are
>> > > > emerging
>> > > > > use cases that may benefit from investing in this feature.
>> > > > >
>> > > > > Many of the use cases in the past have been misaligned (e.g.
>> > separating
>> > > > > column data for security/encryption) and better alternatives
>> > addressed
>> > > > > those scenarios.
>> > > > >
>> > > > > However, there are ongoing discussions around multi-modal cases
>> where
>> > > > > either separating large columns (e.g. inline blobs) or appending
>> > column
>> > > > > data without rewriting existing data may leverage this.
>> > > > >
>> > > > > I don't feel like leaving this for now while we explore those use
>> > cases
>> > > > > would cause any additional confusion/complexity.
>> > > > >
>> > > > > -Dan
>> > > > >
>> > > > > On Thu, Dec 4, 2025 at 9:04 AM Micah Kornfield <
>> > [email protected]>
>> > > > > wrote:
>> > > > >
>> > > > > > > What does "deprecated" entail here? Do we plan to remove this
>> > field
>> > > > > > from the format? Otherwise, is it just documentation?
>> > > > > >
>> > > > > > I was imagining just documentation, since we don't want to break
>> > the
>> > > > > > "_metadata file" use case.
>> > > > > >
>> > > > > > On Thu, Dec 4, 2025 at 8:18 AM Antoine Pitrou <
>> [email protected]>
>> > > > > wrote:
>> > > > > >
>> > > > > > >
>> > > > > > > What does "deprecated" entail here? Do we plan to remove this
>> > field
>> > > > > > > from the format? Otherwise, is it just documentation?
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > On Mon, 1 Dec 2025 12:09:18 -0800
>> > > > > > > Micah Kornfield <[email protected]>
>> > > > > > > wrote:
>> > > > > > > > This has come up a few times in the sync and other forums.
>> I
>> > > > wanted
>> > > > > to
>> > > > > > > > start the conversation about deprecating file_path
>> > > > > > > > <
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962
>> > > > > > > >
>> > > > > > > > [1] in the parquet footer.
>> > > > > > > >
>> > > > > > > > Outside of the "_metadata" file index use-case I don't think
>> > this
>> > > > is
>> > > > > > used
>> > > > > > > > or implemented in any reader (effectively a poor man's table
>> > > > format).
>> > > > > > > >
>> > > > > > > > With the rise of file formats, it seems like a reasonable
>> > design
>> > > > > choice
>> > > > > > > to
>> > > > > > > > push complexity of referencing columns across files to the
>> > table
>> > > > > level
>> > > > > > > and
>> > > > > > > > keep parquet focused on single file storage (encodings,
>> > indexing,
>> > > > > etc).
>> > > > > > > >
>> > > > > > > > Implementing this at a file level also can be challenging in
>> > the
>> > > > > > context
>> > > > > > > of
>> > > > > > > > knowing all credentials one might need to read from
>> different
>> > > > objects
>> > > > > > on
>> > > > > > > > object storage?
>> > > > > > > >
>> > > > > > > > Thoughts/Objections?
>> > > > > > > >
>> > > > > > > > Thanks,
>> > > > > > > > Micah
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > [1]
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962
>> > > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] Deprecate file_path field in column chunk

Reply via email to