[DISCUSS] Parquet metadata evolution proposal

2024-05-29 Thread Alkis Evlogimenos
Hi folks. It is great to see the community moving forward with changes to parquet metadata to make parquet work better in general and in particular with wider schemata. I have been looking at the current proposals: - https://github.com/apache/parquet-format/pull/242 - https://github.com/apache/pa

Re: [DISCUSS] Extensibility of Parquet

2024-05-30 Thread Alkis Evlogimenos
With the extension point described here: https://github.com/apache/parquet-format/pull/254 We can have vendor encodings without drawbacks. For example a vendor wants to add another encoding for integers. It extends ColumnChunk, and embeds an additional location in the file where the alternative r

Re: [DISCUSS] Improvements to File Footer metadata (v3 discussion follow-up)

2024-05-30 Thread Alkis Evlogimenos
Thank you for summarizing Micah and thanks to everyone commenting on the proposal and PRs. After processing the comments I think we might want to discuss the extension point https://github.com/apache/parquet-format/pull/254 separately. The extension point will allow vendors to experiment on diffe

Re: [DISCUSS] Improvements to File Footer metadata (v3 discussion follow-up)

2024-06-04 Thread Alkis Evlogimenos
hat > > haven't > > > > abstracted out thrift structures an easy path to incorporating the > new > > > > footer (i.e. just do translation at the boundaries). > > > > 2. Do people see value in trying to do a Thrift only iteration which > > >

Re: [DISCUSS] schema_index

2024-06-04 Thread Alkis Evlogimenos
to metadata (not > vice versa!), the schema should point us to the correct metadata instead of > the metadata pointing us to the correct schema entry. > > (I'll post this suggestion also into the PR for reference) > > Cheers, > Jan > > > > > Am Di., 4. Juni 2024 u

Re: [DISCUSS] Improvements to File Footer metadata (v3 discussion follow-up)

2024-06-05 Thread Alkis Evlogimenos
pacted but haven't actively > contributed to this discussion so far to review and chime in. > This is a big change with large potential impact here. > > Do people prefer google doc or a PR with a .md for this? I personally like > google docs (we can copy it in the repo

Re: [DISCUSS] schema_index

2024-06-05 Thread Alkis Evlogimenos
column is all nulls. Combined with statistics if they are exact, it also > > allows one to determine if a column is entirely a single value. If the > > size overhead is reasonable for those two elements, then I think the main > > consideration is whether we should be changing th

Re: [DISCUSS] Improvements to File Footer metadata (v3 discussion follow-up)

2024-06-05 Thread Alkis Evlogimenos
> > > > > > > > I think that before voting on this, we should summarize in a doc the > > > whole > > > > PAR3 footer metadata discussion: > > > > 1) Goals: (O(1) random access, extensibility, ...) > > > > 2) preferred option > &g

Re: [DISCUSS] schema_index

2024-06-05 Thread Alkis Evlogimenos
le, > but we want to skip the parsing altogether and make it lazy, so we cannot > even do post processing. > > Does that make sense, or did I misunderstand your point? > > Cheers > Jan > > Am Mi., 5. Juni 2024 um 21:09 Uhr schrieb Alkis Evlogimenos > : > > > >

Re: [DISCUSS] Improvements to File Footer metadata (v3 discussion follow-up)

2024-06-05 Thread Alkis Evlogimenos
tail. On Wed, Jun 5, 2024 at 9:38 PM Jan Finis wrote: > Got it, makes sense. > > How do we expect readers who want the FlatBuffer footer to get it? Would we > use [1], storing information about the location of the FlatBuffer footer at > the end of the file? Or would a reader just read t

Re: [DISCUSS] schema_index

2024-06-05 Thread Alkis Evlogimenos
so this isn't a problem in practice anymore. > > Cheers, > Jan > > Am Mi., 5. Juni 2024 um 21:43 Uhr schrieb Alkis Evlogimenos > : > > > (2) would take unduly long - if the metadata decoder is not performant > > enough. The speed of the decoder strongly dep

Re: [DISCUSS] schema_index

2024-06-06 Thread Alkis Evlogimenos
This is a guestimate based on current parsing speeds I am measuring for 3k columns. Statistics are not optimized yet (which is the bulk of the metadata) and I am at 800us. On Thu, Jun 6, 2024 at 5:48 PM Antoine Pitrou wrote: > On Wed, 5 Jun 2024 21:41:39 +0200 > Alkis Evlogimenos >

flatbuffer footer stream

2024-06-06 Thread Alkis Evlogimenos
Hey folks. I have been asked to share the latest flatbuffer prototype. I will put the latest in this gist left with TODOs if folks want to collaborate. I am iterating in our internal C++ codebase, it would be nice if someone more k

Re: flatbuffer footer stream

2024-06-07 Thread Alkis Evlogimenos
lity that should probably be done > before we iterate on it is changing the License on the top of the gist to > the Apache 2.0 license (if I am reading it correctly it appears to be > marked as proprietary currently). > > > Thanks, > Micah > > > On Thu, Jun 6, 202

Re: [DISCUSS] schema_index

2024-06-12 Thread Alkis Evlogimenos
e: > On Wed, 5 Jun 2024 21:09:04 +0200 > Alkis Evlogimenos > > wrote: > > > > In practice what we want is things to be performant. Sometimes O(1) > > matters, sometimes not. > > +1, good point :-) > > > (3) doing a pass over the metadata to guarantee (4)

Re: flatbuffer footer stream

2024-06-17 Thread Alkis Evlogimenos
nating footers to the foundation to build a benchmark database. On Fri, Jun 7, 2024 at 9:58 AM Alkis Evlogimenos < alkis.evlogime...@databricks.com> wrote: > Absolutely, when we are ready to move to a shared repo I will start the > formal release process. > > > On Thu, Jun 6, 2024

Re: [DISCUSS] Can FIXED_LEN_BYTE_ARRAY be annotated with STRING?

2024-06-18 Thread Alkis Evlogimenos
I don't see why it shouldn't be supported. FBLA and String are orthogonal features. The first optimizes encoding by not storing lengths and the latter says the binary is valid UTF8. On Tue, Jun 18, 2024 at 8:35 AM Gang Wu wrote: > FYI, both parquet-cpp [1] and parquet-java [2] do not allow FLBA.

Re: [DISCUSS] Merge initial Implementation Status PR and incrementally improve it

2024-06-18 Thread Alkis Evlogimenos
+1. I would suggest you address the comments first? I went through the open ones and most of them make sense to me (and left few additional comments). On Tue, Jun 18, 2024 at 12:42 PM Andrew Lamb wrote: > Thank you > > On Mon, Jun 17, 2024 at 11:40 PM Micah Kornfield > wrote: > > > Hi Andrew,

Re: [External] Re: [DISCUSS] Can FIXED_LEN_BYTE_ARRAY be annotated with STRING?

2024-06-21 Thread Alkis Evlogimenos
;s why I've asked in the mentioned PR. > It seems FLBA is just a special case of BYTE_ARRAY. > > On Tue, Jun 18, 2024 at 10:16 PM Alkis Evlogimenos > wrote: > > > I don't see why it shouldn't be supported. FBLA and String are orthogonal > > features. The fir

[DISCUSS] Parquet extensions

2024-06-21 Thread Alkis Evlogimenos
Hey folks. I want to move the extension PR forward. Unfortunately the discussion was spread across the PR, other threads and documents making it slow to progress. To avoid further fragmentation I have put together a document

Re: [DISCUSS] Parquet extensions

2024-06-23 Thread Alkis Evlogimenos
Due to some sharing snafus with automation, please request access to comment. If you are just reading I've published this here: https://docs.google.com/document/d/e/2PACX-1vThXkhHNozn_p1ZZWF-nCzOtoP1lKmkaV4Legq2FaRiIgwyY2XC9AmKpBtpeF8jbBB4wfjmQ6UTg03k/pub On Fri, Jun 21, 2024 at 10:29 AM

Re: [DISCUSS] Parquet extensions

2024-06-24 Thread Alkis Evlogimenos
The snafus are fixed. The original should work now. On Sun, 23 Jun 2024, 17:58 Alkis Evlogimenos, < alkis.evlogime...@databricks.com> wrote: > Due to some sharing snafus with automation, please request access to > comment. If you are just reading I've published t

Re: [DISCUSS] Deprecate file_offset in ColumnChunk struct

2024-06-25 Thread Alkis Evlogimenos
We need a mechanism to remove fields. Typically this would involve some time horizon. I suggest we establish a deprecation horizon now, say 3y, and start the clocks ticking. Plus some convention for marking deprecated fields because the thrift IDL lacks a way to do this in code. I propose the anno

Re: [DISCUSS] Merge initial Implementation Status PR and incrementally improve it

2024-06-26 Thread Alkis Evlogimenos
ll/5956 > > On Tue, Jun 18, 2024 at 10:39 AM Alkis Evlogimenos > wrote: > > > +1. > > > > I would suggest you address the comments first? I went through the open > > ones and most of them make sense to me (and left few additional > comments). > &

Re: [DISCUSS] Parquet extensions

2024-06-26 Thread Alkis Evlogimenos
good > for users to always have proprietary extensions inside of Parquet. > > IMO, I think the next steps would be to add implementations to write out > the footer extension points. > > Thanks, > Micah > > On Mon, Jun 24, 2024 at 1:24 PM Alkis Evlogimenos > wrote: >

Re: [DISCUSS] Parquet extensions

2024-06-28 Thread Alkis Evlogimenos
nizations to experiment with > new footer designs (but possibly also in others). > > Thanks, > Micah > > > > > On Wed, Jun 26, 2024 at 9:33 AM Alkis Evlogimenos > wrote: > > > Thank you for taking a look Micah. > > > > On the topic of openness t

Re: [VOTE] Adopt proposal on new features for parquet-format and release for Parquet Java

2024-07-03 Thread Alkis Evlogimenos
+1 this is great, it puts a lot of clarity in the process. On Thu, Jul 4, 2024 at 4:26 AM Gang Wu wrote: > Generally +1 on the proposal. Thanks for finalizing it! > > I have left a comment regarding the next major release of parquet-java. > > Best, > Gang > > On Thu, Jul 4, 2024 at 1:55 AM Mica

Re: [DISCUSS] Parquet sync day and time

2024-07-09 Thread Alkis Evlogimenos
Thank you for opening the discussion Julien. I am in CEST timezone (GMT+2) so the meeting time is always late for me. Luckily I have reserved wed+thu for late night meetings. Current availability is Wed 9am-11am PT and Thu 7am-11am. Cheers, On Tue, Jul 9, 2024 at 3:25 AM Julien Le Dem wrote: >

Re: [DISCUSS] Merge script for parquet-java and parquet-format

2024-07-17 Thread Alkis Evlogimenos
+1 in using stock github On Wed, Jul 17, 2024 at 10:00 AM Fokko Driesprong wrote: > Hey Micah, > > Thanks for bringing this up. I'm a big fan of just doing things through > Github. Of course, it is not as customizable as a script, but I think it > can do the job. > > 1. Being able to link each

Re: [DISCUSS] Parquet extensions

2024-07-18 Thread Alkis Evlogimenos
ttps://github.com/apache/parquet-format/pull/254>. On Fri, Jun 28, 2024 at 10:02 AM Alkis Evlogimenos < alkis.evlogime...@databricks.com> wrote: > > I think we can at least have wording to encourage people doing > extensions to post them publicly and as part of the "reservation&

Re: [ANNOUNCE] New Parquet PMC Member: Micah Kornfield

2024-07-19 Thread Alkis Evlogimenos
Congrats! On Thu, Jul 18, 2024 at 10:00 PM Julien Le Dem wrote: > Congrats Micah! > > On Thu, Jul 18, 2024 at 8:09 AM Ian Cook wrote: > > > Congratulations! > > > > On Thu, Jul 18, 2024 at 10:32 Gang Wu wrote: > > > > > On behalf of the Parquet PMC, I'm pleased to announce that Micah > > > has

Re: [ANNOUNCE] New Parquet PMC Member: Antoine Pitrou

2024-07-19 Thread Alkis Evlogimenos
Congrats! On Thu, Jul 18, 2024 at 9:59 PM Julien Le Dem wrote: > Congratulations Antoine! > Welcome. > > > On Thu, Jul 18, 2024 at 7:12 AM Micah Kornfield > wrote: > > > Congrats. Well deserved. > > > > On Thursday, July 18, 2024, Xinli shang wrote: > > > > > Congratulations! Well deserved! >

Re: [DISCUSS] Parquet extensions

2024-07-31 Thread Alkis Evlogimenos
t this is amazing and should be shared > >> with the world at large to advance Parquet. " => "At some point, the > >> community decides this extension is ready and proposed for inclusion." > >> > >> > >> On Mon, Jul 22, 2024 at 10:11 PM

Re: [DISCUSS] Parquet extensions

2024-08-06 Thread Alkis Evlogimenos
> > I made a couple belated comments but they are pretty minor. In general > > this looks good to me. Thank you! > > > > Regards > > > > Antoine. > > > > > > On Fri, 21 Jun 2024 10:29:27 +0200 > > Alkis Evlogimenos > > > > wr

Re: Parquet Sync Notes July 31th 2024

2024-08-06 Thread Alkis Evlogimenos
; > >=> Mostly done > > > > Jira -> github migration > > > >- > > > >Getting started with github. Will follow up on the mailing list. > >- > > > >=> mostly closed discussion. Some follow up async on the discussion. >

Re: Parquet Sync Notes July 31th 2024

2024-08-06 Thread Alkis Evlogimenos
he.org/release/17.0.0.html). > I would speculate 18.0.0 will be public mid September. > > On Tue, Aug 6, 2024 at 3:20 PM Alkis Evlogimenos > wrote: > > > Thank you Julien. When can we expect a new arrow package release so that > I > > can compile a doc for customers to

Re: [DISCUSS] new Parquet footer experiments

2024-08-15 Thread Alkis Evlogimenos
Hi Julien. Thank you for reconnecting the threads. I have broken down my experiments in a narrative, commit by commit on how we can go from flatbuffers being ~2x larger than thrift to being smaller (and at times even half) the size of thrift. This is still on an internal branch, I will resume wor

Re: [DISCUSS] new Parquet footer experiments

2024-08-15 Thread Alkis Evlogimenos
t; > Andrew > > [1]: https://www.influxdata.com/blog/how-good-parquet-wide-tables/ > [2]: https://github.com/apache/arrow-rs/issues/5853 > > On Thu, Aug 15, 2024 at 4:26 AM Alkis Evlogimenos > wrote: > > > Hi Julien. > > > > Thank you for reconnecting the thread

Re: [DISCUSS] new Parquet footer experiments

2024-08-15 Thread Alkis Evlogimenos
t; > speed by 4x. > > > > This is not 25x of course, but 4x is non trivial. > > > > The fact that no one yet has bothered to invest the time to get the 4x > yet > > in open source implementations of parquet suggests to me that the parsing > > time may not

Re: [DISCUSS] new Parquet footer experiments

2024-08-18 Thread Alkis Evlogimenos
the > results that have been discussed or easily run their own experiments when > trying alternatives. We hope this will help facilitate the discussion with > easily shareable experiments. > > On Thu, Aug 15, 2024, 9:21 PM Alkis Evlogimenos > wrote: > > > > Alkis, can

Re: Parquet Sync Notes July 31th 2024

2024-08-19 Thread Alkis Evlogimenos
s for me. > @Alkis Evlogimenos When you open a PR > on parquet-benchmark, just make it clear how this binary got there and that > it is an unofficial build from the arrow project waiting for an official > release. > > > > On Tue, Aug 6, 2024 at 7:52 AM Rok Mihevc wrote

Re: Parquet Sync Notes July 31th 2024

2024-08-20 Thread Alkis Evlogimenos
okko > > > > Op ma 19 aug 2024 om 20:52 schreef Alkis Evlogimenos > > : > > > > > Hello Julien. I finally got around compiling binaries for the > > benchmarking > > > repo. Can you add an empty README.md in > > > https://github.com/apache/pa

flatbuffer metadata: work-in-progress

2024-08-22 Thread Alkis Evlogimenos
Hey folks. As promised I pushed a PR to the main repo with my attempt to use flatbuffers for metadata for parquet: https://github.com/apache/arrow/pull/43793 The PR builds on top of the metadata extensions in parquet https://github.com/apache/parquet-format/pull/254 and tests how fast we can pars

Re: flatbuffer metadata: work-in-progress

2024-08-26 Thread Alkis Evlogimenos
I have added some initial simple comments on the PR > that > > may help others who want to take a look. > > > > On Thu, Aug 22, 2024 at 5:46 PM Julien Le Dem wrote: > > > > > this looks great, > > > thank you for sharing. > > > &

[VOTE] Parquet binary protocol extensions

2024-08-26 Thread Alkis Evlogimenos
This vote is whether to adopt and merge the binary protocol extensions [1][2]. +1 Adopt and merge [1] +0 -1 Do not adopt because ... [1] https://github.com/apache/parquet-format/pull/254 [2] https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit

Re: flatbuffer metadata: work-in-progress

2024-08-28 Thread Alkis Evlogimenos
: > > Do you gain much from limiting row groups to 2^31 values and bytes? I > generally find 32-bit lengths to a bit an anti-pattern, as they require > dedicated logic in the writer to ensure sufficient chunking. > > Regards > > Antoine. > > > On Mon, 26 Aug 2024 1

Re: [DISCUSS] new Parquet footer experiments

2024-08-28 Thread Alkis Evlogimenos
w this, but you can use public datasets as a > source of real-world Parquet footers. > > For example, the GeoParquet website lists a couple data providers: > https://geoparquet.org/ > > Regards > > Antoine. > > > On Sun, 18 Aug 2024 14:20:28 +0200 > Alkis Evlogimen

Re: flatbuffers metadata: 32-bit vs. 64-bit sizes

2024-08-29 Thread Alkis Evlogimenos
go after > the smallest possible footprint? > > (but if we do, I would suggest perhaps LZ4-compress the Flatbuffers > metadata :-)) > > Regards > > Antoine. > > > > On Wed, 28 Aug 2024 11:24:49 +0200 > Alkis Evlogimenos > > wrote: > > Yes the gains

Re: [DISCUSS] Clarify min-max truncation in Parquet statistics

2024-09-06 Thread Alkis Evlogimenos
If we would do statistics again, could we specify it is unspecified if they are exact or not - in other words the consumer needs to assume they are inexact. Anyone did any experiments to see what this means in practice? How much filtering power do we lose? On Fri, Sep 6, 2024 at 2:08 PM wish mapl

Re: [DISCUSS] Clarify min-max truncation in Parquet statistics

2024-09-09 Thread Alkis Evlogimenos
t; > I think the point of knowing if they are exact has less to do with > > pruning/filtering power, and more to do with pushdown aggregates so one > can > > know if they can just use the statistics value provided or all values > need > > to be scanned. > > >