https://github.com/apache/parquet-format/pull/440 has a few approvals
already. I think unless there are objections we can aim to merge mid next
week (I'm not sure if this warrants a vote as it captures the reality as
it stands today)?
Thanks,
Micah
On Tue, Jun 25, 2024 at 12:54 PM Ed Seidl wro
FWIW, I've throw up an alternative strawman PR [1].
Ed
[1] https://github.com/apache/parquet-format/pull/440
On 6/25/24 11:56 AM, Ed Seidl wrote:
I've now tested three implemenations (parquet-java,
pyarrow/parquet-cpp, arrow-rs) to see what they all do. For brevity,
I'll refer to the ColumnMe
I've now tested three implemenations (parquet-java, pyarrow/parquet-cpp,
arrow-rs) to see what they all do. For brevity, I'll refer to the
ColumnMetaData struct following the chunk data as IMD (inline metadata),
and the copy in the footer as FMD (footer metadata).
parquet-java, as reported, do
>
> If the ColumnMetaData in the footer is de facto
> required, then I think we should at a minimum change the thrift to make
> it so.
If the result of changing the thrift is that parquet files that can not be
read by old readers, I disagree with this tradeoff. I think it is better
for the spec
I think the main argument here is whether the behavioral
change to this field will actually break any reader implementation.
Considering the current state of parquet-java, I would guess no.
But I agree that let's modify the comment of the spec to make it
clear and do the right thing on the writer's
The issue I have is that we're currently in a position where a file
written to the letter of the specification will likely be readable by
none of the major parquet implementations. (I'm going to test this
hypothesis today). If the ColumnMetaData in the footer is de facto
required, then I think
If the goal of this exercise is to avoid confusion, I agree with Michah
that updating parquet.thrift is best. Here [3] is a PR to update the thrift
file to clarify that the field is not written by all writers and is not
read by many.
In my opinion any backwards incompatible changes do nothing othe
We need a mechanism to remove fields. Typically this would involve some
time horizon.
I suggest we establish a deprecation horizon now, say 3y, and start the
clocks ticking. Plus some convention for marking deprecated fields because
the thrift IDL lacks a way to do this in code. I propose the anno
>
> but I'm not clear on how that will
> impact existing parsers.
This can break older parsers, that validate required fields are in fact
present. I think it would be best to just update documentation on the
current state of affairs, and let implementations update accordingly if
necessary.
On M
Resurrecting a thread from earlier in the month regarding inconsistent
use of the file_offset field [1][2]. It seems like the preferred path
forward is to deprecate this (AFAICT) unused field to prevent further
confusion. If there are no violent objections, I'll submit a PR to do so
in a few da
10 matches
Mail list logo