Based on a conversation in the sync today, we thought not explicitly deprecating the field but providing guidance and documenting that new uses for the field should go through a feature addition process is probably a good path forward.
I put up https://github.com/apache/parquet-format/pull/542 for a straw-man to capture this. On Tue, Dec 9, 2025 at 6:33 PM Julien Le Dem <[email protected]> wrote: > IMO Iceberg needs to be aware of Parquet files referencing others so that > it can prune older snapshots correctly and not delete parquet files > referenced by others when deleting old snapshots. Depending if this is a > cross table or within table file to file reference could make it more or > less complicated. > > I could imagine starting with a simple implementation for the write path: > "Create table foo using parquet options (reference_original_columns = true) > as Select content, extract(content) as metadata from bar" > That would be constrained to simple plans that have the scan and the output > in the same step (map only) so that rows are in the same order per file. > Alternatively "alter table foo add column metadata OPTIONS (column_familly > = 'bar')" with a subsequent "update table set metadata=extract(content)" to > create those files. > > (just some random thoughts, I'm sure others have spent more time thinking > about this) > > This doesn't seem that different from the mechanism creating a deletion > vector in Iceberg. > > It could also be seen as a view in iceberg joining on _row_id. > > This can be a topic in the meeting tomorrow. > > On Mon, Dec 8, 2025 at 9:02 AM Daniel Weeks <[email protected]> wrote: > > > Thanks for the context Kenny. That example is very similar to some of > the > > cases that come up in the multi-modal scenarios. > > > > I agree that we're in a little bit of a difficult situation due to lack > of > > existing support, which also leads to Micah's concern that it's a point > of > > confusion for implementers. > > > > I would be in favor of adding some additional context to the description > > because there are some basic things implementers should do (e.g. validate > > that the file path is either not set or set to the current file being > read > > if they don't support disaggregated column data). While older clients > will > > likely break if they encounter files written this way, there's almost no > > risk that it would result in silent failures or corruption as I suspect > > most implementations will read the ranges from the referencing file and > not > > be able to interpret it. > > > > Adding a read path is relatively straightforward (at least in the java > > implementation for both stream and vectored IO reads), but the write path > > is where things get more complicated. > > > > I think we want to discuss some of these use cases in more detail and see > > if they are practical and reasonable. Some cases may make more sense at > a > > higher-level (like table metadata) while others may make sense to handle > at > > the file level (like asymmetric column sizes). > > > > -Dan > > > > > > > > On Sun, Dec 7, 2025 at 12:35 PM Kenny Daniel <[email protected]> wrote: > > > > > Since I was the one who brought up file_path at the sync a couple weeks > > > ago, I'll share my thoughts: > > > > > > I am interested in the file_path field for column chunks because it > would > > > allow for some extremely efficient data engineering in specific cases > > > like *adding > > > a column to existing data*. > > > > > > My use case is LLM data. LLM data is often huge piles of text in > parquet > > > format (see: all of huggingface, or any llm request/response logs). If > I > > > have a 400mb source.parquet file, how can I annotate each row with an > > added > > > "score" column efficiently? I would prefer to not have to copy all > 400mb > > of > > > data just to add a "score" column. It would be slick if I could make a > > new > > > annotated.parquet file that points to source.parquet for the source > > > columns, and then only includes the new "score" column in the > > > annotated.parquet file. The source.parquet would remain 400mb, the > > > annotated parquet could be ~10kb and incorporate the source data by > > > reference. > > > > > > As the implementor of hyparquet I have conflicting opinions on this > > > feature. On the one hand, it's a cool capability, already built into > > > parquet. On the other hand... none of the parquet implementations > support > > > it. Hyparquet has a branch for reading/writing file_path that I used > for > > > testing. It does work. But I don't want to ship it unless theres at > least > > > ONE other implementation that supports it (there isn't). > > > > > > I agree that this would be better implemented at the table format level > > > (eg- iceberg). BUT... *iceberg does not support my adding column use > > case*! > > > The problem is that, despite parquet being a column-oriented format, > > > iceberg has no support to efficiently zip a new column with existing > > data. > > > The only option for "add column" in iceberg would be to *add a column > > with > > > default values and then re-write every row* (including the heavy text > > > data). So iceberg fails to solve my problem at all. > > > > > > Anyway, I'm fine with deprecating, or not. But I did want to at least > > make > > > the case that it could serve a purpose that I don't see any other good > > way > > > of solving at the moment. > > > > > > Kenny > > > > > > > > > > > > On Fri, Dec 5, 2025 at 9:46 PM Micah Kornfield <[email protected]> > > > wrote: > > > > > > > Hi Dan, > > > > > > > > > However, there are ongoing discussions around multi-modal cases > where > > > > > either separating large columns (e.g. inline blobs) or appending > > column > > > > > data without rewriting existing data may leverage this. > > > > > > > > > > > > Do you have any design docs or mailing list discussions you can point > > to? > > > > > > > > I don't feel like leaving this for now while we explore those use > cases > > > > > would cause any additional confusion/complexity. > > > > > > > > > > > > Agreed, it isn't urgent to clean this up. But having a more concrete > > > > timeline would be helpful, this does seem to be a semi-regular source > > of > > > > confusion for folks, so it would be nice to clean up the loose end. > > > > > > > > Thanks, > > > > Micah > > > > > > > > On Fri, Dec 5, 2025 at 4:07 PM Daniel Weeks <[email protected]> > wrote: > > > > > > > > > I'd actually prefer that we don't deprecate this field (at least > not > > > > > immediately). > > > > > > > > > > Recognizing that we've discussed separating column data into > multiple > > > > files > > > > > for over a decade without any concrete implementations, there are > > > > emerging > > > > > use cases that may benefit from investing in this feature. > > > > > > > > > > Many of the use cases in the past have been misaligned (e.g. > > separating > > > > > column data for security/encryption) and better alternatives > > addressed > > > > > those scenarios. > > > > > > > > > > However, there are ongoing discussions around multi-modal cases > where > > > > > either separating large columns (e.g. inline blobs) or appending > > column > > > > > data without rewriting existing data may leverage this. > > > > > > > > > > I don't feel like leaving this for now while we explore those use > > cases > > > > > would cause any additional confusion/complexity. > > > > > > > > > > -Dan > > > > > > > > > > On Thu, Dec 4, 2025 at 9:04 AM Micah Kornfield < > > [email protected]> > > > > > wrote: > > > > > > > > > > > > What does "deprecated" entail here? Do we plan to remove this > > field > > > > > > from the format? Otherwise, is it just documentation? > > > > > > > > > > > > I was imagining just documentation, since we don't want to break > > the > > > > > > "_metadata file" use case. > > > > > > > > > > > > On Thu, Dec 4, 2025 at 8:18 AM Antoine Pitrou < > [email protected]> > > > > > wrote: > > > > > > > > > > > > > > > > > > > > What does "deprecated" entail here? Do we plan to remove this > > field > > > > > > > from the format? Otherwise, is it just documentation? > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, 1 Dec 2025 12:09:18 -0800 > > > > > > > Micah Kornfield <[email protected]> > > > > > > > wrote: > > > > > > > > This has come up a few times in the sync and other forums. I > > > > wanted > > > > > to > > > > > > > > start the conversation about deprecating file_path > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962 > > > > > > > > > > > > > > > > [1] in the parquet footer. > > > > > > > > > > > > > > > > Outside of the "_metadata" file index use-case I don't think > > this > > > > is > > > > > > used > > > > > > > > or implemented in any reader (effectively a poor man's table > > > > format). > > > > > > > > > > > > > > > > With the rise of file formats, it seems like a reasonable > > design > > > > > choice > > > > > > > to > > > > > > > > push complexity of referencing columns across files to the > > table > > > > > level > > > > > > > and > > > > > > > > keep parquet focused on single file storage (encodings, > > indexing, > > > > > etc). > > > > > > > > > > > > > > > > Implementing this at a file level also can be challenging in > > the > > > > > > context > > > > > > > of > > > > > > > > knowing all credentials one might need to read from different > > > > objects > > > > > > on > > > > > > > > object storage? > > > > > > > > > > > > > > > > Thoughts/Objections? > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Micah > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
