I also posted a PR updating the implementation status page to reflect that it's not really file_path which is supported but _metadata files ( https://github.com/apache/parquet-site/pull/145). I believe only hyparquet might have support for actually reading external columns.
Thanks, Micah On Wed, Dec 10, 2025 at 11:49 PM Micah Kornfield <[email protected]> wrote: > Based on a conversation in the sync today, we thought not explicitly > deprecating the field but providing guidance and documenting that new uses > for the field should go through a feature addition process is probably a > good path forward. > > I put up https://github.com/apache/parquet-format/pull/542 for a > straw-man to capture this. > > On Tue, Dec 9, 2025 at 6:33 PM Julien Le Dem <[email protected]> wrote: > >> IMO Iceberg needs to be aware of Parquet files referencing others so that >> it can prune older snapshots correctly and not delete parquet files >> referenced by others when deleting old snapshots. Depending if this is a >> cross table or within table file to file reference could make it more or >> less complicated. >> >> I could imagine starting with a simple implementation for the write path: >> "Create table foo using parquet options (reference_original_columns = >> true) >> as Select content, extract(content) as metadata from bar" >> That would be constrained to simple plans that have the scan and the >> output >> in the same step (map only) so that rows are in the same order per file. >> Alternatively "alter table foo add column metadata OPTIONS (column_familly >> = 'bar')" with a subsequent "update table set metadata=extract(content)" >> to >> create those files. >> >> (just some random thoughts, I'm sure others have spent more time thinking >> about this) >> >> This doesn't seem that different from the mechanism creating a deletion >> vector in Iceberg. >> >> It could also be seen as a view in iceberg joining on _row_id. >> >> This can be a topic in the meeting tomorrow. >> >> On Mon, Dec 8, 2025 at 9:02 AM Daniel Weeks <[email protected]> wrote: >> >> > Thanks for the context Kenny. That example is very similar to some of >> the >> > cases that come up in the multi-modal scenarios. >> > >> > I agree that we're in a little bit of a difficult situation due to lack >> of >> > existing support, which also leads to Micah's concern that it's a point >> of >> > confusion for implementers. >> > >> > I would be in favor of adding some additional context to the description >> > because there are some basic things implementers should do (e.g. >> validate >> > that the file path is either not set or set to the current file being >> read >> > if they don't support disaggregated column data). While older clients >> will >> > likely break if they encounter files written this way, there's almost no >> > risk that it would result in silent failures or corruption as I suspect >> > most implementations will read the ranges from the referencing file and >> not >> > be able to interpret it. >> > >> > Adding a read path is relatively straightforward (at least in the java >> > implementation for both stream and vectored IO reads), but the write >> path >> > is where things get more complicated. >> > >> > I think we want to discuss some of these use cases in more detail and >> see >> > if they are practical and reasonable. Some cases may make more sense >> at a >> > higher-level (like table metadata) while others may make sense to >> handle at >> > the file level (like asymmetric column sizes). >> > >> > -Dan >> > >> > >> > >> > On Sun, Dec 7, 2025 at 12:35 PM Kenny Daniel <[email protected]> >> wrote: >> > >> > > Since I was the one who brought up file_path at the sync a couple >> weeks >> > > ago, I'll share my thoughts: >> > > >> > > I am interested in the file_path field for column chunks because it >> would >> > > allow for some extremely efficient data engineering in specific cases >> > > like *adding >> > > a column to existing data*. >> > > >> > > My use case is LLM data. LLM data is often huge piles of text in >> parquet >> > > format (see: all of huggingface, or any llm request/response logs). >> If I >> > > have a 400mb source.parquet file, how can I annotate each row with an >> > added >> > > "score" column efficiently? I would prefer to not have to copy all >> 400mb >> > of >> > > data just to add a "score" column. It would be slick if I could make a >> > new >> > > annotated.parquet file that points to source.parquet for the source >> > > columns, and then only includes the new "score" column in the >> > > annotated.parquet file. The source.parquet would remain 400mb, the >> > > annotated parquet could be ~10kb and incorporate the source data by >> > > reference. >> > > >> > > As the implementor of hyparquet I have conflicting opinions on this >> > > feature. On the one hand, it's a cool capability, already built into >> > > parquet. On the other hand... none of the parquet implementations >> support >> > > it. Hyparquet has a branch for reading/writing file_path that I used >> for >> > > testing. It does work. But I don't want to ship it unless theres at >> least >> > > ONE other implementation that supports it (there isn't). >> > > >> > > I agree that this would be better implemented at the table format >> level >> > > (eg- iceberg). BUT... *iceberg does not support my adding column use >> > case*! >> > > The problem is that, despite parquet being a column-oriented format, >> > > iceberg has no support to efficiently zip a new column with existing >> > data. >> > > The only option for "add column" in iceberg would be to *add a column >> > with >> > > default values and then re-write every row* (including the heavy text >> > > data). So iceberg fails to solve my problem at all. >> > > >> > > Anyway, I'm fine with deprecating, or not. But I did want to at least >> > make >> > > the case that it could serve a purpose that I don't see any other good >> > way >> > > of solving at the moment. >> > > >> > > Kenny >> > > >> > > >> > > >> > > On Fri, Dec 5, 2025 at 9:46 PM Micah Kornfield <[email protected] >> > >> > > wrote: >> > > >> > > > Hi Dan, >> > > > >> > > > > However, there are ongoing discussions around multi-modal cases >> where >> > > > > either separating large columns (e.g. inline blobs) or appending >> > column >> > > > > data without rewriting existing data may leverage this. >> > > > >> > > > >> > > > Do you have any design docs or mailing list discussions you can >> point >> > to? >> > > > >> > > > I don't feel like leaving this for now while we explore those use >> cases >> > > > > would cause any additional confusion/complexity. >> > > > >> > > > >> > > > Agreed, it isn't urgent to clean this up. But having a more concrete >> > > > timeline would be helpful, this does seem to be a semi-regular >> source >> > of >> > > > confusion for folks, so it would be nice to clean up the loose end. >> > > > >> > > > Thanks, >> > > > Micah >> > > > >> > > > On Fri, Dec 5, 2025 at 4:07 PM Daniel Weeks <[email protected]> >> wrote: >> > > > >> > > > > I'd actually prefer that we don't deprecate this field (at least >> not >> > > > > immediately). >> > > > > >> > > > > Recognizing that we've discussed separating column data into >> multiple >> > > > files >> > > > > for over a decade without any concrete implementations, there are >> > > > emerging >> > > > > use cases that may benefit from investing in this feature. >> > > > > >> > > > > Many of the use cases in the past have been misaligned (e.g. >> > separating >> > > > > column data for security/encryption) and better alternatives >> > addressed >> > > > > those scenarios. >> > > > > >> > > > > However, there are ongoing discussions around multi-modal cases >> where >> > > > > either separating large columns (e.g. inline blobs) or appending >> > column >> > > > > data without rewriting existing data may leverage this. >> > > > > >> > > > > I don't feel like leaving this for now while we explore those use >> > cases >> > > > > would cause any additional confusion/complexity. >> > > > > >> > > > > -Dan >> > > > > >> > > > > On Thu, Dec 4, 2025 at 9:04 AM Micah Kornfield < >> > [email protected]> >> > > > > wrote: >> > > > > >> > > > > > > What does "deprecated" entail here? Do we plan to remove this >> > field >> > > > > > from the format? Otherwise, is it just documentation? >> > > > > > >> > > > > > I was imagining just documentation, since we don't want to break >> > the >> > > > > > "_metadata file" use case. >> > > > > > >> > > > > > On Thu, Dec 4, 2025 at 8:18 AM Antoine Pitrou < >> [email protected]> >> > > > > wrote: >> > > > > > >> > > > > > > >> > > > > > > What does "deprecated" entail here? Do we plan to remove this >> > field >> > > > > > > from the format? Otherwise, is it just documentation? >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > On Mon, 1 Dec 2025 12:09:18 -0800 >> > > > > > > Micah Kornfield <[email protected]> >> > > > > > > wrote: >> > > > > > > > This has come up a few times in the sync and other forums. >> I >> > > > wanted >> > > > > to >> > > > > > > > start the conversation about deprecating file_path >> > > > > > > > < >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962 >> > > > > > > > >> > > > > > > > [1] in the parquet footer. >> > > > > > > > >> > > > > > > > Outside of the "_metadata" file index use-case I don't think >> > this >> > > > is >> > > > > > used >> > > > > > > > or implemented in any reader (effectively a poor man's table >> > > > format). >> > > > > > > > >> > > > > > > > With the rise of file formats, it seems like a reasonable >> > design >> > > > > choice >> > > > > > > to >> > > > > > > > push complexity of referencing columns across files to the >> > table >> > > > > level >> > > > > > > and >> > > > > > > > keep parquet focused on single file storage (encodings, >> > indexing, >> > > > > etc). >> > > > > > > > >> > > > > > > > Implementing this at a file level also can be challenging in >> > the >> > > > > > context >> > > > > > > of >> > > > > > > > knowing all credentials one might need to read from >> different >> > > > objects >> > > > > > on >> > > > > > > > object storage? >> > > > > > > > >> > > > > > > > Thoughts/Objections? >> > > > > > > > >> > > > > > > > Thanks, >> > > > > > > > Micah >> > > > > > > > >> > > > > > > > >> > > > > > > > [1] >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962 >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> >
