I agree on making naming consistent. Row is a good choice.
Also, agree on thrift ids only are part of the spec as that’s what ends up
in the binary.
On Fri, May 31, 2024 at 13:41 Andrew Lamb wrote:
> I filed a JIRA[1] and a PR[2] to change parquet.thrift to use "row"
>
> Thanks,
> Andrew
>
> [1]
I think it would be a good idea to have an extension mechanism that allows
embedding extra information in the format.
Something akin to what Alkis is suggesting having a reserved extension
point.
- The file can still be read by a standard parquet implementation without
extra libraries
- Vendors can
Micah, would it make sense to start a google doc specifically to discuss:
- the goals (there could be a few subsets)
- the candidate encodings
- the existing/future prototypes to validate candidates.
On Thu, May 30, 2024 at 3:14 AM Steve Loughran
wrote:
> be good for a benchmark to be targetable
When we first looked into Parquet bloom filters[1] it was hard to
understand how effective they would be for a given amount of space
overhead.
When we plugged our data's cardinality into the target ndv and fpp
parameters, it implied 2MB bloom filters *per column* per row group which
was unacceptab
I filed a JIRA[1] and a PR[2] to change parquet.thrift to use "row"
Thanks,
Andrew
[1]: https://issues.apache.org/jira/browse/PARQUET-2488
[2]: https://github.com/apache/parquet-format/pull/256
On Wed, May 29, 2024 at 8:45 AM Antoine Pitrou wrote:
>
> I agree that "row" is a more widespread te
I think the names of classes in the code can different than how the spec
refers to the concepts, if the maintainers don't mind. In my mind, changing
the parquet.thrift file to use consistent terminology doesn't change the
spec, nor will it require (or prevent) implementations from changing their
in
Thanks Westom for bringing this up. It's good to have everyone's attention.
As mentioned in the thread that Weston refers to [1], a few projects are
leaping Java 17 (Spark 4), and the rest of them are looking at each other
to move first :)
This was also discussed back in February on the Iceberg D
now is your chance to improve them.
FWIW Hadoop has a separate module to automate as many of the operations
which can be done
https://github.com/apache/hadoop-release-support
1. this includes things like; patching the x86 tarball with the arm
binaries and generating new checksums
2. GPG
1. Hadoop doesn't use arrow.
2. The Hadoop team would love to drop java 8 and in the last release
said "will happen soon"
3. all the client stuff is happy java17+, it's just some of the server
side stuff which is a bit of pain point.
4. Hadoop 3.2.x is needed to move beyond java8.
Would we also want to add issue templates to encourage some structure? See
[1] for inspiration.
[1] https://github.com/apache/arrow/blob/main/.github/ISSUE_TEMPLATE
On Fri, May 31, 2024 at 3:50 AM Gang Wu wrote:
> Thanks Dewey!
>
> Please note that parquet-site has already enabled GitHub issues
10 matches
Mail list logo