Thanks for the context Kenny. That example is very similar to some of the cases that come up in the multi-modal scenarios.
I agree that we're in a little bit of a difficult situation due to lack of existing support, which also leads to Micah's concern that it's a point of confusion for implementers. I would be in favor of adding some additional context to the description because there are some basic things implementers should do (e.g. validate that the file path is either not set or set to the current file being read if they don't support disaggregated column data). While older clients will likely break if they encounter files written this way, there's almost no risk that it would result in silent failures or corruption as I suspect most implementations will read the ranges from the referencing file and not be able to interpret it. Adding a read path is relatively straightforward (at least in the java implementation for both stream and vectored IO reads), but the write path is where things get more complicated. I think we want to discuss some of these use cases in more detail and see if they are practical and reasonable. Some cases may make more sense at a higher-level (like table metadata) while others may make sense to handle at the file level (like asymmetric column sizes). -Dan On Sun, Dec 7, 2025 at 12:35 PM Kenny Daniel <[email protected]> wrote: > Since I was the one who brought up file_path at the sync a couple weeks > ago, I'll share my thoughts: > > I am interested in the file_path field for column chunks because it would > allow for some extremely efficient data engineering in specific cases > like *adding > a column to existing data*. > > My use case is LLM data. LLM data is often huge piles of text in parquet > format (see: all of huggingface, or any llm request/response logs). If I > have a 400mb source.parquet file, how can I annotate each row with an added > "score" column efficiently? I would prefer to not have to copy all 400mb of > data just to add a "score" column. It would be slick if I could make a new > annotated.parquet file that points to source.parquet for the source > columns, and then only includes the new "score" column in the > annotated.parquet file. The source.parquet would remain 400mb, the > annotated parquet could be ~10kb and incorporate the source data by > reference. > > As the implementor of hyparquet I have conflicting opinions on this > feature. On the one hand, it's a cool capability, already built into > parquet. On the other hand... none of the parquet implementations support > it. Hyparquet has a branch for reading/writing file_path that I used for > testing. It does work. But I don't want to ship it unless theres at least > ONE other implementation that supports it (there isn't). > > I agree that this would be better implemented at the table format level > (eg- iceberg). BUT... *iceberg does not support my adding column use case*! > The problem is that, despite parquet being a column-oriented format, > iceberg has no support to efficiently zip a new column with existing data. > The only option for "add column" in iceberg would be to *add a column with > default values and then re-write every row* (including the heavy text > data). So iceberg fails to solve my problem at all. > > Anyway, I'm fine with deprecating, or not. But I did want to at least make > the case that it could serve a purpose that I don't see any other good way > of solving at the moment. > > Kenny > > > > On Fri, Dec 5, 2025 at 9:46 PM Micah Kornfield <[email protected]> > wrote: > > > Hi Dan, > > > > > However, there are ongoing discussions around multi-modal cases where > > > either separating large columns (e.g. inline blobs) or appending column > > > data without rewriting existing data may leverage this. > > > > > > Do you have any design docs or mailing list discussions you can point to? > > > > I don't feel like leaving this for now while we explore those use cases > > > would cause any additional confusion/complexity. > > > > > > Agreed, it isn't urgent to clean this up. But having a more concrete > > timeline would be helpful, this does seem to be a semi-regular source of > > confusion for folks, so it would be nice to clean up the loose end. > > > > Thanks, > > Micah > > > > On Fri, Dec 5, 2025 at 4:07 PM Daniel Weeks <[email protected]> wrote: > > > > > I'd actually prefer that we don't deprecate this field (at least not > > > immediately). > > > > > > Recognizing that we've discussed separating column data into multiple > > files > > > for over a decade without any concrete implementations, there are > > emerging > > > use cases that may benefit from investing in this feature. > > > > > > Many of the use cases in the past have been misaligned (e.g. separating > > > column data for security/encryption) and better alternatives addressed > > > those scenarios. > > > > > > However, there are ongoing discussions around multi-modal cases where > > > either separating large columns (e.g. inline blobs) or appending column > > > data without rewriting existing data may leverage this. > > > > > > I don't feel like leaving this for now while we explore those use cases > > > would cause any additional confusion/complexity. > > > > > > -Dan > > > > > > On Thu, Dec 4, 2025 at 9:04 AM Micah Kornfield <[email protected]> > > > wrote: > > > > > > > > What does "deprecated" entail here? Do we plan to remove this field > > > > from the format? Otherwise, is it just documentation? > > > > > > > > I was imagining just documentation, since we don't want to break the > > > > "_metadata file" use case. > > > > > > > > On Thu, Dec 4, 2025 at 8:18 AM Antoine Pitrou <[email protected]> > > > wrote: > > > > > > > > > > > > > > What does "deprecated" entail here? Do we plan to remove this field > > > > > from the format? Otherwise, is it just documentation? > > > > > > > > > > > > > > > > > > > > On Mon, 1 Dec 2025 12:09:18 -0800 > > > > > Micah Kornfield <[email protected]> > > > > > wrote: > > > > > > This has come up a few times in the sync and other forums. I > > wanted > > > to > > > > > > start the conversation about deprecating file_path > > > > > > < > > > > > > > > > > > > > > > https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962 > > > > > > > > > > > > [1] in the parquet footer. > > > > > > > > > > > > Outside of the "_metadata" file index use-case I don't think this > > is > > > > used > > > > > > or implemented in any reader (effectively a poor man's table > > format). > > > > > > > > > > > > With the rise of file formats, it seems like a reasonable design > > > choice > > > > > to > > > > > > push complexity of referencing columns across files to the table > > > level > > > > > and > > > > > > keep parquet focused on single file storage (encodings, indexing, > > > etc). > > > > > > > > > > > > Implementing this at a file level also can be challenging in the > > > > context > > > > > of > > > > > > knowing all credentials one might need to read from different > > objects > > > > on > > > > > > object storage? > > > > > > > > > > > > Thoughts/Objections? > > > > > > > > > > > > Thanks, > > > > > > Micah > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
