Re: [DISCUSS] Infrastructure/Documentation improvement in Parquet

2024-06-04 Thread Micah Kornfield
As part of this track I wrote up two draft PRs for what I think might be a workable release process for new features and giving concrete guidance on when they should be enabled by default in other implementations: https://github.com/apache/parquet-format/pull/258 https://github.com/apache/parquet-

[PR] DRAFT: PARQUET-2489: Strawman proposal for releases [parquet-site]

2024-06-04 Thread via GitHub
emkornfield opened a new pull request, #61: URL: https://github.com/apache/parquet-site/pull/61 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe,

Re: ColumnMetaData location

2024-06-04 Thread Ed Seidl
Julien, yes I'm referring to the diagram, as well as the wording that follows it:   "The file metadata contains the locations of all the column metadata start locations. More details on what is contained in the metadata can be found in the Thrift definition.   Metadata is written after the d

Re: [DISCUSS] Improvements to File Footer metadata (v3 discussion follow-up)

2024-06-04 Thread Julien Le Dem
I agree that flatbuffer is a good option if we are happy with the perf and it let's access column metadata in O(1) without reading other columns. If we're going to make an incompatible metadata change, let's make it once with a transition path to easily move from PAR1 to PAR3 letting them coexist i

Re: ColumnMetaData location

2024-06-04 Thread Julien Le Dem
As far as I remember, we didn't intend to write the ColumnMetaData at the end of the Column Chunk. So this might be a case of the spec being ambiguous. Ed, are you referring to this illustration in the spec? I think here "Column 1 Chunk 1 + Column Metadata" I meant the chunk *and* its metadata but

Re: [DISCUSS] schema_index

2024-06-04 Thread Alkis Evlogimenos
The drawback with having the reverse mapping is that only empty in all row groups columns can be elided. Columns that are empty in some row groups can't. I do not have good stats to decide either way. That said, if we assume there is some post-processing done after deserializing FileMetaData, one

Re: [PR] PARQUET-2310: implementation status [parquet-site]

2024-06-04 Thread via GitHub
alippai commented on PR #34: URL: https://github.com/apache/parquet-site/pull/34#issuecomment-2147981789 I like both! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To uns

Re: [PR] PARQUET-2310: implementation status [parquet-site]

2024-06-04 Thread via GitHub
alamb commented on PR #34: URL: https://github.com/apache/parquet-site/pull/34#issuecomment-2147973327 > Totally agree, thanks for the guide. What do you think about the non-Apache or other projects? (Duckdb, fastparquet, impala, cudf) Echoing what @pitrou said, I suggest we add a n

[DISCUSS] Patch release for parquet-java 1.14.1?

2024-06-04 Thread Gang Wu
Hi, It seems that we need a patch release 1.14.1 to fix [1]. All new commits in branch 1.14.x can be viewed at [2]. If there is any additional fix to be included, please let me know. If the community believes the release is necessary, I can volunteer to be the release manager. [1] https://issues.

Re: [DISCUSS] schema_index

2024-06-04 Thread Jan Finis
I would agree that at least for our use cases, this trade off would not be favorable, so we would rather always write some metadata for "empty" columns and therefore get random I/O into the columns array. If I understand the use case correctly though, then this is mostly meant for completely empty

Re: [DISCUSS] schema_index

2024-06-04 Thread Antoine Pitrou
On Tue, 4 Jun 2024 10:52:54 +0200 Alkis Evlogimenos wrote: > > > > Finally, one point I wanted to highlight here (I also mentioned it in the > > PR): If we want random access, we have to abolish the concept that the data > > in the columns array is in a different order than in the schema. Your PR

Re: [PR] PARQUET-2310: implementation status [parquet-site]

2024-06-04 Thread via GitHub
pitrou commented on PR #34: URL: https://github.com/apache/parquet-site/pull/34#issuecomment-2147379889 IMHO, any currently maintained open source implementation of Parquet deserves mentioning there. But that also requires involvement from their respective maintainers (we shouldn't expect u

Re: [PR] PARQUET-2310: implementation status [parquet-site]

2024-06-04 Thread via GitHub
alippai commented on PR #34: URL: https://github.com/apache/parquet-site/pull/34#issuecomment-2147369156 Totally agree, thanks for the guide. What do you think about the non-Apache or other projects? (Duckdb, fastparquet, impala, cuff) -- This is an automated message from the Apache Git S

Re: [DISCUSS] Improve Bloom Filter documentation?

2024-06-04 Thread Andrew Lamb
So it seems there are at least three ways forward: 1. Leave this email chain in the archives (nothing more) 2. Incorporate some of the content of these emails into the spec as a "Background" and "Recommendation" section 3. Write a separate blog post / other content I would personally be inclined f

Re: [DISCUSS] Improve Bloom Filter documentation?

2024-06-04 Thread Andrew Lamb
Thank you Jan, I have learned quite a bit. > Boom, we have just created a bloom filter that is 4 times larger than the data itself, ouch! I think for me this summarizes the core challenge very well. The whole point of bloom filters is to save resources during query processing, so if the bloom f

Re: [DISCUSS] Improvements to File Footer metadata (v3 discussion follow-up)

2024-06-04 Thread Alkis Evlogimenos
> > Finally, one point I wanted to highlight here (I also mentioned it in the > PR): If we want random access, we have to abolish the concept that the data > in the columns array is in a different order than in the schema. Your PR > [1] even added a new field schema_index for matching between > Col

Re: [VOTE] Migration of parquet-cpp issues to Arrow's issue tracker

2024-06-04 Thread Rok Mihevc
Corrected results with input from Julien and Antoine: Parquet: 3x +1 binding (Gang Wu, Wes McKinney, Julien Le Dem) 10x +1 non-binding (Micah Kornfield, Felipe Oliveira Carvalho, Fokko Driesprong, Antoine Pitrou, Alenka Frim, Andy Grove, Raúl Cumplido, Sutou Kouhei, Jiashen Zhang, Rok Mihevc) Arr

Re: [VOTE] Migration of parquet-cpp issues to Arrow's issue tracker

2024-06-04 Thread Antoine Pitrou
Correction: my vote is non-binding for Parquet. Regards Antoine. Le 04/06/2024 à 02:23, Rok Mihevc a écrit : Thanks all for voting. I tallied the votes (assuming simple +1 votes were meant as +1 Parquet, +1 Arrow) and the vote succeeded with the following results: Parquet: 3x +1 binding (G