Re: [DISCUSS] Deprecate file_path field in column chunk

Micah Kornfield Wed, 10 Dec 2025 23:49:36 -0800

Based on a conversation in the sync today, we thought not explicitly
deprecating the field but providing guidance and documenting that new uses
for the field should go through a feature addition process is probably a
good path forward.


I put up https://github.com/apache/parquet-format/pull/542 for a straw-man
to capture this.

On Tue, Dec 9, 2025 at 6:33 PM Julien Le Dem <[email protected]> wrote:

> IMO Iceberg needs to be aware of Parquet files referencing others so that
> it can prune older snapshots correctly and not delete parquet files
> referenced by others when deleting old snapshots. Depending if this is a
> cross table or within table file to file reference could make it more or
> less complicated.
>
> I could imagine starting with a simple implementation for the write path:
> "Create table foo using parquet options (reference_original_columns = true)
> as Select content, extract(content) as metadata from bar"
> That would be constrained to simple plans that have the scan and the output
> in the same step (map only) so that rows are in the same order per file.
> Alternatively "alter table foo add column metadata OPTIONS (column_familly
> = 'bar')" with a subsequent "update table set metadata=extract(content)" to
> create those files.
>
> (just some random thoughts, I'm sure others have spent more time thinking
> about this)
>
> This doesn't seem that different from the mechanism creating a deletion
> vector in Iceberg.
>
> It could also be seen as a view in iceberg joining on _row_id.
>
> This can be a topic in the meeting tomorrow.
>
> On Mon, Dec 8, 2025 at 9:02 AM Daniel Weeks <[email protected]> wrote:
>
> > Thanks for the context Kenny.  That example is very similar to some of
> the
> > cases that come up in the multi-modal scenarios.
> >
> > I agree that we're in a little bit of a difficult situation due to lack
> of
> > existing support, which also leads to Micah's concern that it's a point
> of
> > confusion for implementers.
> >
> > I would be in favor of adding some additional context to the description
> > because there are some basic things implementers should do (e.g. validate
> > that the file path is either not set or set to the current file being
> read
> > if they don't support disaggregated column data).  While older clients
> will
> > likely break if they encounter files written this way, there's almost no
> > risk that it would result in silent failures or corruption as I suspect
> > most implementations will read the ranges from the referencing file and
> not
> > be able to interpret it.
> >
> > Adding a read path is relatively straightforward (at least in the java
> > implementation for both stream and vectored IO reads), but the write path
> > is where things get more complicated.
> >
> > I think we want to discuss some of these use cases in more detail and see
> > if they are practical and reasonable.  Some cases may make more sense at
> a
> > higher-level (like table metadata) while others may make sense to handle
> at
> > the file level (like asymmetric column sizes).
> >
> > -Dan
> >
> >
> >
> > On Sun, Dec 7, 2025 at 12:35 PM Kenny Daniel <[email protected]> wrote:
> >
> > > Since I was the one who brought up file_path at the sync a couple weeks
> > > ago, I'll share my thoughts:
> > >
> > > I am interested in the file_path field for column chunks because it
> would
> > > allow for some extremely efficient data engineering in specific cases
> > > like *adding
> > > a column to existing data*.
> > >
> > > My use case is LLM data. LLM data is often huge piles of text in
> parquet
> > > format (see: all of huggingface, or any llm request/response logs). If
> I
> > > have a 400mb source.parquet file, how can I annotate each row with an
> > added
> > > "score" column efficiently? I would prefer to not have to copy all
> 400mb
> > of
> > > data just to add a "score" column. It would be slick if I could make a
> > new
> > > annotated.parquet file that points to source.parquet for the source
> > > columns, and then only includes the new "score" column in the
> > > annotated.parquet file. The source.parquet would remain 400mb, the
> > > annotated parquet could be ~10kb and incorporate the source data by
> > > reference.
> > >
> > > As the implementor of hyparquet I have conflicting opinions on this
> > > feature. On the one hand, it's a cool capability, already built into
> > > parquet. On the other hand... none of the parquet implementations
> support
> > > it. Hyparquet has a branch for reading/writing file_path that I used
> for
> > > testing. It does work. But I don't want to ship it unless theres at
> least
> > > ONE other implementation that supports it (there isn't).
> > >
> > > I agree that this would be better implemented at the table format level
> > > (eg- iceberg). BUT... *iceberg does not support my adding column use
> > case*!
> > > The problem is that, despite parquet being a column-oriented format,
> > > iceberg has no support to efficiently zip a new column with existing
> > data.
> > > The only option for "add column" in iceberg would be to *add a column
> > with
> > > default values and then re-write every row* (including the heavy text
> > > data). So iceberg fails to solve my problem at all.
> > >
> > > Anyway, I'm fine with deprecating, or not. But I did want to at least
> > make
> > > the case that it could serve a purpose that I don't see any other good
> > way
> > > of solving at the moment.
> > >
> > > Kenny
> > >
> > >
> > >
> > > On Fri, Dec 5, 2025 at 9:46 PM Micah Kornfield <[email protected]>
> > > wrote:
> > >
> > > > Hi Dan,
> > > >
> > > > > However, there are ongoing discussions around multi-modal cases
> where
> > > > > either separating large columns (e.g. inline blobs) or appending
> > column
> > > > > data without rewriting existing data may leverage this.
> > > >
> > > >
> > > > Do you have any design docs or mailing list discussions you can point
> > to?
> > > >
> > > > I don't feel like leaving this for now while we explore those use
> cases
> > > > > would cause any additional confusion/complexity.
> > > >
> > > >
> > > > Agreed, it isn't urgent to clean this up. But having a more concrete
> > > > timeline would be helpful, this does seem to be a semi-regular source
> > of
> > > > confusion for folks, so it would be nice to clean up the loose end.
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > > On Fri, Dec 5, 2025 at 4:07 PM Daniel Weeks <[email protected]>
> wrote:
> > > >
> > > > > I'd actually prefer that we don't deprecate this field (at least
> not
> > > > > immediately).
> > > > >
> > > > > Recognizing that we've discussed separating column data into
> multiple
> > > > files
> > > > > for over a decade without any concrete implementations, there are
> > > > emerging
> > > > > use cases that may benefit from investing in this feature.
> > > > >
> > > > > Many of the use cases in the past have been misaligned (e.g.
> > separating
> > > > > column data for security/encryption) and better alternatives
> > addressed
> > > > > those scenarios.
> > > > >
> > > > > However, there are ongoing discussions around multi-modal cases
> where
> > > > > either separating large columns (e.g. inline blobs) or appending
> > column
> > > > > data without rewriting existing data may leverage this.
> > > > >
> > > > > I don't feel like leaving this for now while we explore those use
> > cases
> > > > > would cause any additional confusion/complexity.
> > > > >
> > > > > -Dan
> > > > >
> > > > > On Thu, Dec 4, 2025 at 9:04 AM Micah Kornfield <
> > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > > What does "deprecated" entail here? Do we plan to remove this
> > field
> > > > > > from the format? Otherwise, is it just documentation?
> > > > > >
> > > > > > I was imagining just documentation, since we don't want to break
> > the
> > > > > > "_metadata file" use case.
> > > > > >
> > > > > > On Thu, Dec 4, 2025 at 8:18 AM Antoine Pitrou <
> [email protected]>
> > > > > wrote:
> > > > > >
> > > > > > >
> > > > > > > What does "deprecated" entail here? Do we plan to remove this
> > field
> > > > > > > from the format? Otherwise, is it just documentation?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, 1 Dec 2025 12:09:18 -0800
> > > > > > > Micah Kornfield <[email protected]>
> > > > > > > wrote:
> > > > > > > > This has come up a few times in the sync and other forums.  I
> > > > wanted
> > > > > to
> > > > > > > > start the conversation about deprecating file_path
> > > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962
> > > > > > > >
> > > > > > > > [1] in the parquet footer.
> > > > > > > >
> > > > > > > > Outside of the "_metadata" file index use-case I don't think
> > this
> > > > is
> > > > > > used
> > > > > > > > or implemented in any reader (effectively a poor man's table
> > > > format).
> > > > > > > >
> > > > > > > > With the rise of file formats, it seems like a reasonable
> > design
> > > > > choice
> > > > > > > to
> > > > > > > > push complexity of referencing columns across files to the
> > table
> > > > > level
> > > > > > > and
> > > > > > > > keep parquet focused on single file storage (encodings,
> > indexing,
> > > > > etc).
> > > > > > > >
> > > > > > > > Implementing this at a file level also can be challenging in
> > the
> > > > > > context
> > > > > > > of
> > > > > > > > knowing all credentials one might need to read from different
> > > > objects
> > > > > > on
> > > > > > > > object storage?
> > > > > > > >
> > > > > > > > Thoughts/Objections?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Micah
> > > > > > > >
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Deprecate file_path field in column chunk

Reply via email to