Re: [DISCUSS] Deprecate file_offset in ColumnChunk struct

2024-06-27 Thread Micah Kornfield
https://github.com/apache/parquet-format/pull/440 has a few approvals already. I think unless there are objections we can aim to merge mid next week (I'm not sure if this warrants a vote as it captures the reality as it stands today)? Thanks, Micah On Tue, Jun 25, 2024 at 12:54 PM Ed Seidl wro

Re: [DISCUSS] Deprecate file_offset in ColumnChunk struct

2024-06-25 Thread Ed Seidl
FWIW, I've throw up an alternative strawman PR [1]. Ed [1] https://github.com/apache/parquet-format/pull/440 On 6/25/24 11:56 AM, Ed Seidl wrote: I've now tested three implemenations (parquet-java, pyarrow/parquet-cpp, arrow-rs) to see what they all do. For brevity, I'll refer to the ColumnMe

Re: [DISCUSS] Deprecate file_offset in ColumnChunk struct

2024-06-25 Thread Ed Seidl
I've now tested three implemenations (parquet-java, pyarrow/parquet-cpp, arrow-rs) to see what they all do. For brevity, I'll refer to the ColumnMetaData struct following the chunk data as IMD (inline metadata), and the copy in the footer as FMD (footer metadata). parquet-java, as reported, do

Re: [DISCUSS] Deprecate file_offset in ColumnChunk struct

2024-06-25 Thread Andrew Lamb
> > If the ColumnMetaData in the footer is de facto > required, then I think we should at a minimum change the thrift to make > it so. If the result of changing the thrift is that parquet files that can not be read by old readers, I disagree with this tradeoff. I think it is better for the spec

Re: [DISCUSS] Deprecate file_offset in ColumnChunk struct

2024-06-25 Thread Gang Wu
I think the main argument here is whether the behavioral change to this field will actually break any reader implementation. Considering the current state of parquet-java, I would guess no. But I agree that let's modify the comment of the spec to make it clear and do the right thing on the writer's

Re: [DISCUSS] Deprecate file_offset in ColumnChunk struct

2024-06-25 Thread Ed Seidl
The issue I have is that we're currently in a position where a file written to the letter of the specification will likely be readable by none of the major parquet implementations. (I'm going to test this hypothesis today). If the ColumnMetaData in the footer is de facto required, then I think

Re: [DISCUSS] Deprecate file_offset in ColumnChunk struct

2024-06-25 Thread Andrew Lamb
If the goal of this exercise is to avoid confusion, I agree with Michah that updating parquet.thrift is best. Here [3] is a PR to update the thrift file to clarify that the field is not written by all writers and is not read by many. In my opinion any backwards incompatible changes do nothing othe

Re: [DISCUSS] Deprecate file_offset in ColumnChunk struct

2024-06-25 Thread Alkis Evlogimenos
We need a mechanism to remove fields. Typically this would involve some time horizon. I suggest we establish a deprecation horizon now, say 3y, and start the clocks ticking. Plus some convention for marking deprecated fields because the thrift IDL lacks a way to do this in code. I propose the anno

Re: [DISCUSS] Deprecate file_offset in ColumnChunk struct

2024-06-25 Thread Micah Kornfield
> > but I'm not clear on how that will > impact existing parsers. This can break older parsers, that validate required fields are in fact present. I think it would be best to just update documentation on the current state of affairs, and let implementations update accordingly if necessary. On M

[DISCUSS] Deprecate file_offset in ColumnChunk struct

2024-06-24 Thread Ed Seidl
Resurrecting a thread from earlier in the month regarding inconsistent use of the file_offset field [1][2]. It seems like the preferred path forward is to deprecate this (AFAICT) unused field to prevent further confusion. If there are no violent objections, I'll submit a PR to do so in a few da