As part of this track I wrote up two draft PRs for what I think might be a
workable release process for new features and giving concrete guidance on
when they should be enabled by default in other implementations:
https://github.com/apache/parquet-format/pull/258
https://github.com/apache/parquet-
emkornfield opened a new pull request, #61:
URL: https://github.com/apache/parquet-site/pull/61
(no comment)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe,
Julien, yes I'm referring to the diagram, as well as the wording that
follows it:
"The file metadata contains the locations of all the column metadata
start locations. More details on what is contained in the metadata can
be found in the Thrift definition.
Metadata is written after the d
I agree that flatbuffer is a good option if we are happy with the perf and
it let's access column metadata in O(1) without reading other columns.
If we're going to make an incompatible metadata change, let's make it once
with a transition path to easily move from PAR1 to PAR3 letting them
coexist i
As far as I remember, we didn't intend to write the ColumnMetaData at the
end of the Column Chunk.
So this might be a case of the spec being ambiguous.
Ed, are you referring to this illustration in the spec?
I think here "Column 1 Chunk 1 + Column Metadata" I meant the chunk *and*
its metadata but
The drawback with having the reverse mapping is that only empty in all row
groups columns can be elided. Columns that are empty in some row groups
can't. I do not have good stats to decide either way.
That said, if we assume there is some post-processing done after
deserializing FileMetaData, one
alippai commented on PR #34:
URL: https://github.com/apache/parquet-site/pull/34#issuecomment-2147981789
I like both!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To uns
alamb commented on PR #34:
URL: https://github.com/apache/parquet-site/pull/34#issuecomment-2147973327
> Totally agree, thanks for the guide. What do you think about the
non-Apache or other projects? (Duckdb, fastparquet, impala, cudf)
Echoing what @pitrou said, I suggest we add a n
Hi,
It seems that we need a patch release 1.14.1 to fix [1]. All new commits
in branch 1.14.x can be viewed at [2]. If there is any additional fix to be
included, please let me know. If the community believes the release is
necessary, I can volunteer to be the release manager.
[1] https://issues.
I would agree that at least for our use cases, this trade off would not be
favorable, so we would rather always write some metadata for "empty"
columns and therefore get random I/O into the columns array.
If I understand the use case correctly though, then this is mostly meant
for completely empty
On Tue, 4 Jun 2024 10:52:54 +0200
Alkis Evlogimenos
wrote:
> >
> > Finally, one point I wanted to highlight here (I also mentioned it in the
> > PR): If we want random access, we have to abolish the concept that the data
> > in the columns array is in a different order than in the schema. Your PR
pitrou commented on PR #34:
URL: https://github.com/apache/parquet-site/pull/34#issuecomment-2147379889
IMHO, any currently maintained open source implementation of Parquet
deserves mentioning there. But that also requires involvement from their
respective maintainers (we shouldn't expect u
alippai commented on PR #34:
URL: https://github.com/apache/parquet-site/pull/34#issuecomment-2147369156
Totally agree, thanks for the guide. What do you think about the non-Apache
or other projects? (Duckdb, fastparquet, impala, cuff)
--
This is an automated message from the Apache Git S
So it seems there are at least three ways forward:
1. Leave this email chain in the archives (nothing more)
2. Incorporate some of the content of these emails into the spec as a
"Background" and "Recommendation" section
3. Write a separate blog post / other content
I would personally be inclined f
Thank you Jan,
I have learned quite a bit.
> Boom, we have just created a bloom filter that is 4 times larger than
the data itself, ouch!
I think for me this summarizes the core challenge very well. The whole
point of bloom filters is to save resources during query processing, so if
the bloom f
>
> Finally, one point I wanted to highlight here (I also mentioned it in the
> PR): If we want random access, we have to abolish the concept that the data
> in the columns array is in a different order than in the schema. Your PR
> [1] even added a new field schema_index for matching between
> Col
Corrected results with input from Julien and Antoine:
Parquet:
3x +1 binding (Gang Wu, Wes McKinney, Julien Le Dem)
10x +1 non-binding (Micah Kornfield, Felipe Oliveira Carvalho, Fokko
Driesprong, Antoine Pitrou, Alenka Frim, Andy Grove, Raúl Cumplido, Sutou
Kouhei, Jiashen Zhang, Rok Mihevc)
Arr
Correction: my vote is non-binding for Parquet.
Regards
Antoine.
Le 04/06/2024 à 02:23, Rok Mihevc a écrit :
Thanks all for voting. I tallied the votes (assuming simple +1 votes were
meant as +1 Parquet, +1 Arrow) and the vote succeeded with the following
results:
Parquet:
3x +1 binding (G
18 matches
Mail list logo