Re: [DISCUSS] Merge initial Implementation Status PR and incrementally improve it

2024-06-26 Thread Antoine Pitrou


IMHO, we should either start a dedicated discussion thread for
integration testing, or open a GH issue and discuss it there.

Regards

Antoine.



On Wed, 26 Jun 2024 09:21:33 +0200
Alkis Evlogimenos

wrote:
> It would be nice if the integration suite specifies how a driver can be
> executed. Then each implementation can provide a driver and the suite will
> use that for validation.
> 
> By specifying both reads and writes for the driver we get a lot more power.
> Given an input we can roundtrip all combination of readers/writers and
> verify they can roundtrip.
> 
> On Tue, Jun 25, 2024 at 6:42 PM Andrew Lamb 
>  wrote:
> 
> > FWIW I started hacking up a prototype[1] of what a parquet-testing
> > integration suite might look like if anyone is interested
> >
> >
> >
> > [1]: https://github.com/apache/arrow-rs/pull/5956
> >
> > On Tue, Jun 18, 2024 at 10:39 AM Alkis Evlogimenos
> >  wrote:
> >  
> > > +1.
> > >
> > > I would suggest you address the comments first? I went through the open
> > > ones and most of them make sense to me (and left few additional  
> > comments).  
> > >
> > > On Tue, Jun 18, 2024 at 12:42 PM Andrew Lamb 
> > > wrote:
> > >  
> > > > Thank you
> > > >
> > > > On Mon, Jun 17, 2024 at 11:40 PM Micah Kornfield <  
> > emkornfi...@gmail.com>  
> > > > wrote:
> > > >  
> > > > > Hi Andrew,
> > > > > I agree with this sentiment, I asked on the PR if there would be  
> > > another  
> > > > > pass and then I can merge it.
> > > > >
> > > > > Cheers,
> > > > > Micah
> > > > >
> > > > > On Fri, Jun 14, 2024 at 3:20 AM Andrew Lamb 
> > > > > wrote:
> > > > >  
> > > > > > Hello Parquet Devs,
> > > > > >
> > > > > > I propose we merge the first (admittedly bare bones)  
> > "Implementation  
> > > > > > Status" page PR [1] to the website soon. I think this page is vital 
> > > > > >  
> > > to  
> > > > > the  
> > > > > > Parquet community (and to any attempt to extend the format) so the  
> > > > sooner  
> > > > > > the better.
> > > > > >
> > > > > > The reason to merge the PR now is to have a base from which to  
> > build.  
> > > > > That  
> > > > > > PR is already over a year old and has so many comments it is hard  
> > to  
> > > > > follow  
> > > > > > or know what the path forward is. If we insist on sorting all the  
> > > > details  
> > > > > > out before we merge it I fear it will never merge.
> > > > > >
> > > > > > Once we have a page, I think the next steps are to add a preamble
> > > > > > explaining what it is for and to start trying to fill out the chart 
> > > > > >  
> > > for  
> > > > > an  
> > > > > > implementation (I am happy to try for parquet-rs). I suspect during 
> > > > > >  
> > > > that  
> > > > > > process we will have to adjust some of the charts more.
> > > > > >
> > > > > > Thank you for your consideration (and thank you for all the  
> > comments  
> > > so  
> > > > > > far)
> > > > > >
> > > > > > Andrew
> > > > > >
> > > > > > [1]: https://github.com/apache/parquet-site/pull/34
> > > > > >  
> > > > >  
> > > >  
> > >  
> >  
> 





Re: [DISCUSS] schema_index

2024-06-11 Thread Antoine Pitrou
On Wed, 5 Jun 2024 21:09:04 +0200
Alkis Evlogimenos

wrote:
> 
> In practice what we want is things to be performant. Sometimes O(1)
> matters, sometimes not.

+1, good point :-)

> (3) doing a pass over the metadata to guarantee (4) is O(1) does not fail
> the goal of being fast as long as the cost of doing (3) is a lot smaller
> than (1) + (2). In a future version, we would shrink footers by 2x and
> speed up parsing by 100x. Then the above would look like this:
> 
> 1. 30ms
> 2. 50us
> 3. 100us
> 4.  100ns/col

By the way, when using Flatbuffers, I would suggest that you
optionally call Flatbuffers verification when benchmarking the parsing
routine. This is because, in many cases, it is important to ensure that
untrusted files cannot wreak havoc (we do fuzz the Parquet C++ reader
to look out for such issues).

> It still doesn't matter if we do some lightweight postprocessing (3) given
> that fetching is so slow.

Yet, please be aware that not all fetching would happen on an object
store. Processing Parquet files locally is quite common as well, and in
this context fetching the footer can be extremely fast (Parquet is
frequently used as an efficient exchange format for large tabular data
-- for many people, it is a binary CSV on steroids).

Regards

Antoine.




Re: [DISCUSS] Improvements to File Footer metadata (v3 discussion follow-up)

2024-06-11 Thread Antoine Pitrou


Hi Micah,

On Wed, 5 Jun 2024 09:48:11 -0700
Micah Kornfield 
wrote:
> > 1. ratify https://github.com/apache/parquet-format/pull/254 as the
> > extension mechanism for parquet. With this we can experiment on new footers
> > without having to specify anything else.  
> 
> I think we have probably reached a lazy consensus that is reasonable.

There has been a lot of activity on this ML lately and I don't think
all interested parties have had time to take a detailed look at this. I
would suggest letting it rest a bit, and perhaps post a wake-up call in
~2 weeks to make sure other people can chime in.

Regards

Antoine.




Re: [DISCUSS] schema_index

2024-06-06 Thread Antoine Pitrou
On Wed, 5 Jun 2024 21:41:39 +0200
Alkis Evlogimenos

wrote:
> (2) would take unduly long - if the metadata decoder is not performant
> enough. The speed of the decoder strongly depends on the encoding of
> choice. If we choose flatbuffers, 100'000 columns would parse in a few ms
> (with verification) or some much less significant time without.

That's few 10s of ns for each column, which sounds small even for
flatbuffers. Did you actually measure this, or is it a guesstimate?

Regards

Antoine.




Re: [DISCUSS] Improvements to File Footer metadata (v3 discussion follow-up)

2024-06-05 Thread Antoine Pitrou
t; > >
> > > IMO, I think we should be doing 1,2, and 3.  I don't think we should be
> > > doing 4 (e.g. as a concrete example, see the discussion on
> > > PageEncodingStats [1]).
> > >
> > > If we want random access, we have to abolish the concept that the data  
> > > > in the columns array is in a different order than in the schema. Your  
> > PR  
> > > > [1] even added a new field schema_index for matching between  
> > > ColumnMetaData  
> > > > and schema position, but this kills random access.  
> > >
> > >
> > > I think this is a larger discussion that should be split off, as I don't
> > > think it should block the core work here.  This was adapted from another
> > > proposal, that I think had different ideas on how possible rework column
> > > selection (it seems this would be on a per RowGroup basis).
> > >
> > > [1] https://github.com/apache/parquet-format/pull/250/files#r1620984136
> > >
> > >
> > > On Mon, Jun 3, 2024 at 8:20 AM Antoine Pitrou   
> > wrote:  
> > >  
> > > >
> > > > Everything Jan said below aligns closely with my opinion.
> > > >
> > > > * +1 for going directly to Flatbuffers for the new footer format *if*
> > > >   there is a general agreement that moving to Flatbuffers at some point
> > > >   is desirable (including from a software ecosystem point of view).
> > > >
> > > > * I don't think there is value in providing a 1-to-1 mapping from the
> > > >   old footer encoding to the new encoding. On the contrary, this is the
> > > >   opportunity to clean up and correct some of the oddities that have
> > > >   accumulated in the past.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > On Mon, 3 Jun 2024 15:58:40 +0200
> > > > Jan Finis  wrote:  
> > > > > Interesting discussion so far, thanks for driving this Micah! A few  
> > > > points  
> > > > > from my side:
> > > > >
> > > > > When considering flatbuffers vs. lazy "binary" nested thrift, vs. own
> > > > > MetaDataPage format, let's also keep architectural simplicity in  
> > mind.  
> > > > >
> > > > > For example, introducing flatbuffers might sound like a big change at
> > > > > first, but at least it is then *one format* for everything. In  
> > > contrast,  
> > > > > thrift + custom MetaDataPage is two formats. My gut feeling estimate
> > > > > would be that it is probably easier to just introduce a flatbuffers  
> > > > reader  
> > > > > instead of special casing some thrift to instead need a custom  
> > > > MetaDataPage  
> > > > > reader.
> > > > >
> > > > > The lazy thrift "hack" is something in between the two. It is  
> > probably  
> > > > the  
> > > > > easiest to adopt, as no new reading logic needs to be written. The  
> > > thrift  
> > > > > decoder just has to be invoked recursively whenever such a lazy field 
> > > > >  
> > > is  
> > > > > required. This is nice, but since it doesn't give us random access  
> > into  
> > > > > lists, it's also only partially helpful.
> > > > >
> > > > > Given all this, from the implementation / architectural cleanliness  
> > > > side, I  
> > > > > guess I would prefer just using flatbuffers, unless we find big
> > > > > disadvantages with this. This also brings us closer to Arrow,  
> > although  
> > > > > that's not too important here.
> > > > >
> > > > >
> > > > >  
> > > > > > 1.  I think for an initial revision of metadata we should make it  
> > > > possible  
> > > > > > to have a 1:1 mapping between PAR1 footers and whatever is included 
> > > > > >  
> > > in  
> > > > the  
> > > > > > new footer.  The rationale for this is to let implementations that  
> > > > haven't  
> > > > > > abstracted out thrift structures an easy path to incorporating the  
> > > new  
> > > > > > footer (i.e. just do translation at the boundaries).
> > > > > >  
> > > > >
> > > > > I don't 

Re: [DISCUSS] schema_index

2024-06-04 Thread Antoine Pitrou
On Tue, 4 Jun 2024 10:52:54 +0200
Alkis Evlogimenos

wrote:
> >
> > Finally, one point I wanted to highlight here (I also mentioned it in the
> > PR): If we want random access, we have to abolish the concept that the data
> > in the columns array is in a different order than in the schema. Your PR
> > [1] even added a new field schema_index for matching between
> > ColumnMetaData and schema position, but this kills random access. If I want
> > to read the third column in the schema, then do a O(1) random access into
> > the third column chunk only to notice that it's schema index is totally
> > different and therefore I need a full exhaustive search to find the column
> > that actually belongs to the third column in the schema, then all our
> > random access efforts are in vain.  
> 
> `schema_index` is useful to implement
> https://issues.apache.org/jira/browse/PARQUET-183 which is more and more
> prevalent as schemata become wider.

But this means of scan of all column chunk metadata in a row group is
required to know if a particular column exists there? Or am I missing
something?

Regards

Antoine.




Re: [VOTE] Migration of parquet-cpp issues to Arrow's issue tracker

2024-06-04 Thread Antoine Pitrou



Correction: my vote is non-binding for Parquet.

Regards

Antoine.


Le 04/06/2024 à 02:23, Rok Mihevc a écrit :

Thanks all for voting. I tallied the votes (assuming simple +1 votes were
meant as +1 Parquet, +1 Arrow) and the vote succeeded with the following
results:

Parquet:
3x +1 binding (Gang Wu, Antoine Pitrou, Wes McKinney)
9x +1 non-binding (Micah Kornfield, Felipe Oliveira Carvalho, Fokko
Driesprong, Alenka Frim, Andy Grove, Raúl Cumplido, Sutou Kouhei, Jiashen
Zhang, Rok Mihevc)

Arrow:
6x +1 binding (Micah Kornfield, Antoine Pitrou, Andy Grove, Raúl Cumplido,
Wes McKinney, Sutou Kouhei)
6x +1 non-binding (Felipe Oliveira Carvalho, Fokko Driesprong, Gang Wu,
Alenka Frim, Jiashen Zhang, Rok Mihevc)

I'm not sure about formalities here, but perhaps one PMC per project could
confirm my count?

I'll start making preparations for the move and hopefully execute it
later this week.

Best,
Rok

On Tue, Jun 4, 2024 at 1:55 AM Rok Mihevc  wrote:


+1 (non-binding)

On Thu, May 30, 2024 at 6:13 PM Jiashen Zhang 
wrote:


+1 (non-binding)

On Wed, May 29, 2024 at 3:29 PM Sutou Kouhei  wrote:


+1 (binding for Arrow)

In 
   "[VOTE] Migration of parquet-cpp issues to Arrow's issue tracker" on
Wed, 29 May 2024 16:14:44 +0200,
   Rok Mihevc  wrote:


# sending this to both dev@arrow and dev@parquet

Hi all,

Following the ML discussion [1] I would like to propose a vote for
parquet-cpp issues to be moved from Parquet Jira [2] to Arrow's issue
tracker [3].

[1] https://lists.apache.org/thread/zklp0lwcbcsdzgxoxy6wqjwrvt6y4s9p
[2] https://issues.apache.org/jira/projects/PARQUET/issues/
[3] https://github.com/apache/arrow/issues/

The vote will be open for at least 72 hours.

[ ] +1 Migrate parquet-cpp issues
[ ] +0
[ ] -1 Do not migrate parquet-cpp issues because...


Rok





--
Thanks,
Jiashen







Re: [DISCUSS] Unify Record / Row terminology (to Row)

2024-06-03 Thread Antoine Pitrou
On Fri, 31 May 2024 16:33:24 -0400
Andrew Lamb 
wrote:

> I think the names of classes in the code can different than how the spec
> refers to the concepts, if the maintainers don't mind. In my mind, changing
> the parquet.thrift file to use consistent terminology doesn't change the
> spec, nor will it require (or prevent) implementations from changing their
> internal class names.

+1

Regards

Antoine.




Re: [DISCUSS] Improvements to File Footer metadata (v3 discussion follow-up)

2024-06-03 Thread Antoine Pitrou
hat Alkis Started:
> > >>
> > >> 3 is important if we strongly believe that we can get the best design  
> > >>> through testing prototypes on real data and measuring the effects vs
> > >>> designing changes in PRs. Along the same lines, I am requesting that  
> > you  
> > >>> ask through your contacts/customers (I will do the same) for scrubbed
> > >>> footers of particular interest (wide, deep, etc) so that we can build a
> > >>> set
> > >>> of real footers on which we can run benchmarks and drive design
> > >>> decisions.  
> > >>
> > >>
> > >> I agree with this sentiment. I think some others who have volunteered to
> > >> work on this have such data and I will see what I can do on my end.  I
> > >> think we should hold off more drastic changes/improvements until we can  
> > get  
> > >> better metrics.  But I also don't think we should let the "best" be the
> > >> enemy of the "good".  I hope we can ship a PAR3 footer sooner that gets  
> > us  
> > >> a large improvement over the status quo and have it adopted fairly  
> > widely  
> > >> sooner rather than waiting for an optimal design.  I also agree leaving
> > >> room for experimentation is a good idea (I think this can probably be  
> > done  
> > >> by combining the methods for embedding that have already been discussed  
> > to  
> > >> allow potentially 2 embedded footers).
> > >>
> > >> I think another question that Alkis's proposals raised is how policies  
> > on  
> > >> deprecation of fields (especially ones that are currently required in
> > >> PAR1).  I think this is probably a better topic for another thread, I'll
> > >> try to write a PR formalizing a proposal on feature evolution.
> > >>
> > >>
> > >>
> > >> [1]
> > >>  
> > https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit
> >   
> > >> [2] https://lists.apache.org/thread/zdpswrd4yxrj845rmoopqozhk0vrm6vo
> > >> [3] https://github.com/apache/parquet-format/pull/250
> > >>
> > >> On Tue, May 28, 2024 at 10:56 AM Micah Kornfield  > >  
> > >> wrote:
> > >>  
> > >>> Hi Antoine,
> > >>> Thanks for the great points.  Responses inline.
> > >>>
> > >>>  
> > >>>> I like your attempt to put the "new" file metadata after the legacy
> > >>>> one in https://github.com/apache/parquet-format/pull/250, and I hope  
> > it  
> > >>>> can actually be made to work (it requires current Parquet readers to
> > >>>> allow/ignore arbitrary padding at the end of the v1 Thrift metadata).  
> > >>>
> > >>>
> > >>> Thanks (I hope so too).  I think the idea is originally from Alkis.  If
> > >>> it doesn't work then there is always an option of doing a little more
> > >>> involved process of making the footer look like an unknown binary  
> > field (an  
> > >>> approach I know you have objections to).
> > >>>
> > >>> I'm biased, but I find it much cleaner to define new Thrift  
> > >>>>   structures (FileMetadataV3, etc.), rather than painstakinly document
> > >>>>   which fields are to be omitted in V3. That would achieve three  
> > goals:  
> > >>>>   1) make the spec easier to read (even though it would be physically
> > >>>>   longer); 2) make it easier to produce a conformant implementation
> > >>>>   (special rules increase the risks of misunderstandings and
> > >>>>   disagreements); 3) allow a later cleanup of the spec once we agree  
> > to  
> > >>>>   get rid of V1 structs.  
> > >>>
> > >>> There are trade-offs here.  I agree with the benefits you listed here.
> > >>> The benefits of reusing existing structs are:
> > >>> 1. Lowers the amount of boiler plate code mapping from one to the other
> > >>> (i.e. simpler initial implementation), since I expect it will be a  
> > while  
> > >>> before we have standalone PAR3 files.
> > >>> 2. Allows for lower maintenance burden if there is useful new metadata
> > >>> that we would like to see added to both structures original and "V3"
> > >

Re: [DISCUSS] Extensibility of Parquet

2024-05-30 Thread Antoine Pitrou
On Thu, 30 May 2024 00:07:35 -0700
Micah Kornfield 
wrote:
> > A "vendor" encoding would also allow candidate encodings to be shared
> > accross the ecosystem before they are eventually enchristened as regular
> > encodings in the Thrift metadata.  
> 
> 
> I'm not a huge fan of this for two reasons:
> 1.  I think it makes it much more complicated for end-users to get support
> if they happen to have a file with a custom encoding.  There are already
> enough rough edges in compatibility between implementations that this gives
> another degree of freedom where things could break.

Agreed, but how is this not a problem for "pluggable" encodings as well?

> 2.  From a software supply chain perspective I think this makes Parquet a
> lot riskier if it is going to arbitrarily load/invoke code from potentially
> unknown sources.

I'm not sure where that idea comes from. I did *not* suggest that
implementations load arbitrary code from third-party Github repositories
:-)

Regards

Antoine.




Re: [DISCUSS] Unify Record / Row terminology (to Row)

2024-05-29 Thread Antoine Pitrou


I agree that "row" is a more widespread terminology while "record" can
be a bit head-scratching.

Regards

Antoine.


On Wed, 29 May 2024 05:49:22 -0400
Andrew Lamb 
wrote:
> In the context of my PR trying to encode the consensus that records can't
> span page boundaries[1], Antoine brought up the excellent point[2] that the
> format[3] seems to use the terms "records" and "rows" to refer to the same
> concept.
> 
> I agree it would clarify the spec to use the same terminology throughout.
> Given there are several fields named `num_rows` I propose changing
> parquet.thrift to use the term "row" throughout.
> 
> I can make another PR to do so if this seems like a good idea.
> 
> Andrew
> (p.s the PR[1] is still waiting on some more review and merging :pray:)
> 
> [1] https://github.com/apache/parquet-format/pull/244
> [2] https://github.com/apache/parquet-format/pull/244#discussion_r1617320495
> [3]
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
> 





Re: [DISCUSS] Extensibility of Parquet

2024-05-29 Thread Antoine Pitrou


I'm not sure how people are envisioning 2) (pluggable encodings) to be
concretely represented in Thrift data, but perhaps an easy alternative
is to add a "vendor" encoding that would be described by a (name,
parameters) pair of arbitrary strings.

A "vendor" encoding would also allow candidate encodings to be shared
accross the ecosystem before they are eventually enchristened as regular
encodings in the Thrift metadata.

Finally, I agree that allowing for pluggable encodings will not
reduce the burden for implementors who want to support a given encoding.

Regards

Antoine.


On Wed, 29 May 2024 09:57:47 +0800
Gang Wu  wrote:
> I'm supportive of most of the points in this thread.
> 
> For 2), making encodings pluggable does not eliminate the work on
> implementation and interoperability. If people are worried about the
> lengthy process to promote a new encoding to the spec, perhaps we
> can preserve an encoding type for each new candidate in the spec
> at its early stage and then officially add or remove it once the idea
> gets mature.
> 
> Best,
> Gang
> 
> On Wed, May 29, 2024 at 1:37 AM Micah Kornfield 
> wrote:
> 
> > As a follow-up to the "V3" Discussions [1][2] there were some open
> > questions around extensibility and how it might be handled, so that readers
> > could determine if they supported the necessary features.
> >
> > I think the areas discussed are:
> > 1.  New encodings (In spec)
> > 2.  Pluggable encodings
> > 3.  Extensible logical types.
> > 4.  New/additional metadata information in footer.
> >
> > For 1) these are already handled by existing mechanisms at the column level
> > (based on page encodings in column metadata).
> > For 2) the consensus I inferred from PMC members that commented on the doc
> > is that in general this was not a direction we wanted to take (I also
> > concur with this sentiment). But if people want to make a more public
> > argument on why it should be considered we can do it on the ML to make it
> > official
> > For 3) Antoine started a new thread on this [3]
> > For 4) I think any new footer will have a bitmap that will handle changes
> > and extensibility will likely be limited here.
> >
> > If this doesn't cover the use-cases people were thinking of this would be a
> > good place to bring it up.
> >
> > Thanks,
> > Micah
> >
> >
> > [1] https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo
> > [2]
> >
> > https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit
> > [3] https://lists.apache.org/thread/9xo3mp4n23p8psmrhso5t9q899vxwfjt
> >  
> 





Re: [DISCUSS] Extension types in Parquet?

2024-05-29 Thread Antoine Pitrou
On Wed, 29 May 2024 10:27:02 +0800
Gang Wu  wrote:
> I think adding extension type support will make it easier for adding
> tensor or vector type, which is [1] trying to target.
> 
> However, the geometry type seems not easy to fit to the imagination
> of the extension type. It would be better to explicitly define geospatial
> statistics in the spec, otherwise we have to encode them like plain-encoded
> min/max values or even use thrift/protobuf to serialize them as binary data.

Let's remember here than PLAIN encoding for numeric scalars (such as
double or int64) is really a contiguous sequence of native
little-endian numbers, just like e.g. the Parquet footer length.
There's no need to explicitly invoke the PLAIN decoder, especially when
no def/rep levels are involved.

Regards

Antoine.




Re: [DISCUSS] Extension types in Parquet?

2024-05-28 Thread Antoine Pitrou


Hi Gabor,

Perhaps we can eschew this problem by having a separate "extension
statistics" field that does not mandate total ordering?

Regards

Antoine.


On Tue, 28 May 2024 16:54:49 +0200
Gábor Szádovszky  wrote:
> Hi Antoine,
> 
> One quick note about this. Parquet min/max statistics need a total ordering
> for each logical type. Without that we either use some default based on the
> primitive type (that might not be suitable for the related extension type)
> or we won't store min/max statistics for the related values. It means no
> min/max stats for the row group nor page indices.
> So, I guess, we would need a way to define total ordering for an extension
> type. Does not sound like an easy topic.
> 
> Cheers,
> Gabor
> 
> Antoine Pitrou  ezt írta (időpont: 2024. máj. 28., K,
> 16:45):
> 
> >
> > Hello,
> >
> > (NOTE: this comes in the context of
> > https://github.com/apache/parquet-format/pull/240 --
> > "PARQUET-2471: Add geometry logical type")
> >
> > I'd like to launch a discussion about the possible addition of
> > extension types in Parquet.
> >
> > Extension types are a concept borrowed from the Arrow type system [1].
> > They provide a standard way of conveying more precise information about
> > the intended type and usage of a given column, without requiring the
> > metadata format to have a dedicated serialization for that type.
> >
> > In Arrow, extension types are typically conveyed through two
> > string/binary parameters: 1) the extension type name; 2) the
> > type-specific serialization. The extension type name unambiguously
> > designates the abstract extension type (such as "Tensor"); the
> > serialization optionally encodes the extension type's parameters, if
> > it has any (such as the dimensionality for a "Tensor" type).
> >
> > Initially, Arrow extension types tended to be ad hoc and
> > application-specific, but there is a growing trend to standardize
> > "canonical extension types" to allow for better data interoperability
> > accross widely-used data types [2].
> >
> > From my experience as an Arrow PMC member, if Arrow didn't have
> > extension types, the barrier to propose and standardize new data types
> > would be much higher, especially for complex proposals such as the
> > fixed-shape and variable-shape tensor types.
> >
> >
> > For Parquet, extension types would be an alternative to enchristening
> > additional logical types in the Thrift specification. I can see several
> > advantages to extension types over additional logical types:
> >
> > 1) extension types would make it easier to experiment in dedicated
> > communities, trying to find out the best possible representation for
> > some kinds of data (example: the Geoparquet work)
> >
> > 2) extension types would allow "soft standardization": an extension type
> > could first be formally defined by a dedicated community, then
> > optionally find an official place under the Parquet project.
> >
> > 3) extension types would allow defining complex data representations
> > and semantics without imposing a large burden on the developers of
> > Parquet implementations, who may not be competent in the target domain.
> > This includes non-trivial statistics such as bounding boxes for
> > geospatial data.
> >
> >
> > Technically, I can imagine two possible ways of adding extension types
> > to the Parquet format:
> >
> > 1) as an additional logical type;
> > 2) as a separate type determination, in addition to the logical type.
> >
> > We should also ensure it is possible to express extension-specific
> > statistics (such as bounding boxes for geospatial data).
> >
> > What do you think?
> >
> > Regards
> >
> > Antoine.
> >
> >
> > [1] https://arrow.apache.org/docs/format/Columnar.html#extension-types
> >
> > [2]
> > https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html
> >
> >
> >
> >  
> 





[DISCUSS] Extension types in Parquet?

2024-05-28 Thread Antoine Pitrou


Hello,

(NOTE: this comes in the context of
https://github.com/apache/parquet-format/pull/240 --
"PARQUET-2471: Add geometry logical type")

I'd like to launch a discussion about the possible addition of
extension types in Parquet.

Extension types are a concept borrowed from the Arrow type system [1].
They provide a standard way of conveying more precise information about
the intended type and usage of a given column, without requiring the
metadata format to have a dedicated serialization for that type.

In Arrow, extension types are typically conveyed through two
string/binary parameters: 1) the extension type name; 2) the
type-specific serialization. The extension type name unambiguously
designates the abstract extension type (such as "Tensor"); the
serialization optionally encodes the extension type's parameters, if
it has any (such as the dimensionality for a "Tensor" type).

Initially, Arrow extension types tended to be ad hoc and
application-specific, but there is a growing trend to standardize
"canonical extension types" to allow for better data interoperability
accross widely-used data types [2].

From my experience as an Arrow PMC member, if Arrow didn't have
extension types, the barrier to propose and standardize new data types
would be much higher, especially for complex proposals such as the
fixed-shape and variable-shape tensor types.


For Parquet, extension types would be an alternative to enchristening
additional logical types in the Thrift specification. I can see several
advantages to extension types over additional logical types:

1) extension types would make it easier to experiment in dedicated
communities, trying to find out the best possible representation for
some kinds of data (example: the Geoparquet work)

2) extension types would allow "soft standardization": an extension type
could first be formally defined by a dedicated community, then
optionally find an official place under the Parquet project.

3) extension types would allow defining complex data representations
and semantics without imposing a large burden on the developers of
Parquet implementations, who may not be competent in the target domain.
This includes non-trivial statistics such as bounding boxes for
geospatial data.


Technically, I can imagine two possible ways of adding extension types
to the Parquet format:

1) as an additional logical type;
2) as a separate type determination, in addition to the logical type.

We should also ensure it is possible to express extension-specific
statistics (such as bounding boxes for geospatial data).

What do you think?

Regards

Antoine.


[1] https://arrow.apache.org/docs/format/Columnar.html#extension-types

[2]
https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html





Re: [DISCUSS] Improvements to File Footer metadata (v3 discussion follow-up)

2024-05-28 Thread Antoine Pitrou


Hello Micah,

First, kudos for doing this!

I like your attempt to put the "new" file metadata after the legacy
one in https://github.com/apache/parquet-format/pull/250, and I hope it
can actually be made to work (it requires current Parquet readers to
allow/ignore arbitrary padding at the end of the v1 Thrift metadata).

Some assorted comments on other changes that PR is doing:

- I'm biased, but I find it much cleaner to define new Thrift
  structures (FileMetadataV3, etc.), rather than painstakinly document
  which fields are to be omitted in V3. That would achieve three goals:
  1) make the spec easier to read (even though it would be physically
  longer); 2) make it easier to produce a conformant implementation
  (special rules increase the risks of misunderstandings and
  disagreements); 3) allow a later cleanup of the spec once we agree to
  get rid of V1 structs.

- The new encoding in that PR seems like it should be moved to a
  separate PR and be discussed in the encodings thread?

- I'm a bit skeptical about moving Thrift lists into data pages, rather
  than, say, just embed the corresponding Thrift serialization as
  binary fields for lazy deserialization.

Regards

Antoine.



On Mon, 27 May 2024 23:06:37 -0700
Micah Kornfield 
wrote:
> As a follow-up to the "V3" Discussions [1][2] I wanted to start a thread on
> improvements to the footer metadata.
> 
> Based on conversation so far, there have been a few proposals [3][4][5] to
> help better support files with wide schemas and many row-groups.  I think
> there are a lot of interesting ideas in each. It would be good to get
> further feedback on these to make sure we aren't missing anything and
> define a minimal first iteration for doing experimental benchmarking to
> prove out an approach.
> 
> I think the next steps would ideally be:
> 1.  Come to a consensus on the overall approach.
> 2.  Prototypes to Benchmark/test to validate the approaches defined (if we
> can't come to consensus in item #1, this might help choose a direction).
> 3.  Divide up any final approach into as fine-grained features as possible.
> 4.  Implement across parquet-java, parquet-cpp, parquet-rs (and any other
> implementations that we can get volunteers for).  Additionally, if new APIs
> are needed to make use of the new structure, it would be good to try to
> prototype against consumers of Parquet.
> 
> Knowing that we have enough people interested in doing #3 is critical to
> success, so if you have time to devote, it would be helpful to chime in
> here (I know some people already noted they could help in the original
> thread).
> 
> I think it is likely we will need either an in person sync or another more
> focused design document could help. I am happy to try to facilitate this
> (once we have a better sense of who wants to be involved and what time
> zones they are in I can schedule a sync if necessary).
> 
> Thanks,
> Micah
> 
> [1] https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo
> [2]
> https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit
> [3] https://github.com/apache/parquet-format/pull/242
> [4] https://github.com/apache/parquet-format/pull/248
> [5] https://github.com/apache/parquet-format/pull/250
> 





Re: [DISCUSS] Integration testing

2024-05-28 Thread Antoine Pitrou


Hello,

On Mon, 27 May 2024 22:46:45 -0700
Micah Kornfield 
wrote:
> 
> 2.  Is anybody interested in looking more deeply into developing
> integration tests between the different Parquet implementations and major
> down-stream consumers of Parquet?  I believe Apache arrow has a pretty good
> model [3][4] in a lot of respects with cross-language integration tests,
> and nightly (via crossbow) integration tests with other consumers, but
> there are a wide variety of things that would improve the current state.
> One other possible concern is the amount of CI resources this might
> consume, and if we will need contributions to fund it.

Caveat: Arrow has a lot less parameters to test for. The variability is
mostly one-dimensional and falls under the data type rubric. As a
matter of fact, other Arrow features such as compression or delta
dictionaries are less well-tested.

Testing Parquet interoperability could easily get into a combinatorial
explosion of optional features, encodings, etc.

I'm not saying that it shouldn't be done, but it may require a different
approach than Arrow's approach of building and testing all
implementations against each other in a single CI job.

Regards

Antoine.




Re: Typical data page size

2024-05-23 Thread Antoine Pitrou


Speaking of which and responding to my own question, parquet-java also
defaults to 1 MiB:
https://github.com/apache/parquet-java/blob/9b11410f15410b4d76d9f73f9545cf9110488517/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L49

Regards

Antoine.



On Thu, 23 May 2024 01:39:58 -1000
Jacques Nadeau  wrote:
> I've found that a variable page size based on expected read back number of
> columns is necessary since you'll need read back memory equal to number of
> columns times page size times number concurrent files being read. So if one
> is reading back 1000 columns one may  need 1gb+ of memory per file for
> reads. This resulted in sizing things down as width went up to avoid
> spending excessive budget on read memory. This often resulted in pages
> closer to 64k - 128k. (in the work I did, we typically expected many files
> to be concurrently read across many requested ops.)
> 
> On Wed, May 22, 2024, 11:50 PM Andrew Lamb 
>  wrote:
> 
> > The Rust implementation uses 1MB pages by default[1]
> >
> > Andrew
> >
> > [1]:
> >
> > https://github.com/apache/arrow-rs/blob/bd5d4a59db5d6d0e1b3bdf00644dbaf317f3be03/parquet/src/file/properties.rs#L28-L29
> >
> > On Thu, May 23, 2024 at 4:10 AM Fokko Driesprong 
> >  wrote:
> >  
> > > Hey Antoine,
> > >
> > > Thanks for raising this. In Iceberg we also use the 1 MiB page size:
> > >
> > >
> > >  
> > https://github.com/apache/iceberg/blob/b3c25fb7608934d975a054b353823ca001ca3742/core/src/main/java/org/apache/iceberg/TableProperties.java#L133
> >   
> > >
> > > Kind regards,
> > > Fokko
> > >
> > > Op do 23 mei 2024 om 10:06 schreef Antoine Pitrou 
> > > :
> > >  
> > > >
> > > > Hello,
> > > >
> > > > The Parquet format itself (or at least the README) recommends a 8 kiB
> > > > page size, suggesting that data pages are the unit of computation.
> > > >
> > > > However, Parquet C++ has long chosen a 1 MiB page size by default (*),
> > > > suggesting that data pages are considered as the unit of IO there.
> > > >
> > > > (*) even bumping it to 64 MiB at some point, perhaps by mistake:
> > > >
> > > >  
> > >  
> > https://github.com/apache/arrow/commit/4078b876e0cc7503f4da16693ce7901a6ae503d3
> >   
> > > >
> > > > What are the typical choices in other writers?
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > >  
> > >  
> >  
> 





Typical data page size

2024-05-23 Thread Antoine Pitrou


Hello,

The Parquet format itself (or at least the README) recommends a 8 kiB
page size, suggesting that data pages are the unit of computation.

However, Parquet C++ has long chosen a 1 MiB page size by default (*),
suggesting that data pages are considered as the unit of IO there.

(*) even bumping it to 64 MiB at some point, perhaps by mistake:
https://github.com/apache/arrow/commit/4078b876e0cc7503f4da16693ce7901a6ae503d3

What are the typical choices in other writers?

Regards

Antoine.




[DISCUSS] Parquet 3 "wide schema" metadata draft

2024-05-18 Thread Antoine Pitrou
On Fri, 17 May 2024 07:37:37 -0700
Julien Le Dem  wrote:
> This context should be added in the PR description itself.

Good point, I've added context in the PR description. Let me know if
that's sufficient.

> From a design process perspective, it makes more difficult to converge the
> discussion and build consensus if we start multiple threads rather than
> keeping the discussion on the original thread.

A single discussion thread won't be able to drive forward all the
potential changes that we're currently talking about (the Google doc is
enumerating *a lot* of potential changes).

However, I should have entitled this discussion appropriately.
The original title is misleading: my PR is only concerned with the "wide
schema" use case. Let me fix this here :-)

Regards

Antoine.




Re: Typical number of key-value metadata entries?

2024-05-17 Thread Antoine Pitrou


Hi Fokko,

So, if I understand correctly, you have a small number of key-value
metadata entries, but the values may be large?

Also, you actually need those metadata values to do anything with the
data (because they tell you the actual Iceberg schema), so on-demand
decoding of these values would probably not help for you?

(I'm not sure large string values are a problem with Thrift; I would
hope not)

Regards

Antoine.


On Thu, 16 May 2024 22:45:02 +0200
Fokko Driesprong  wrote:
> Hey Antoine,
> 
> First of all, love the recent uptake in activity on the Parquet side. I'm
> on holiday, but I'll definitly catch up when I return.
> 
> I wanted to respond to this particular mail since we do store various
> fields in the metadata for Apache Iceberg. For example:
> 
>- The JSON serialized Iceberg schema that was used when writing the
>file:
>
> https://github.com/apache/iceberg/blob/bd046f844a1cbad6c98919d8ea63176aeae78d33/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L274
>- I
>
> <https://github.com/apache/iceberg/blob/bd046f844a1cbad6c98919d8ea63176aeae78d33/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L274>n
>the case of delete files, we write the kind of file (positional or
>equality), and in the case of equality, also the field IDs:
>
> https://github.com/apache/iceberg/blob/bd046f844a1cbad6c98919d8ea63176aeae78d33/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L905-L910
> 
> This is mostly for debugging purposes. The schema could become quite big as
> it is proportional to the number of columns. The metadata is mostly set for
> debugging purposes and is not part of the official Iceberg spec.
> 
> I hope this helps!
> 
> Kind regards,
> Fokko
> 
> Op do 16 mei 2024 om 21:17 schreef Antoine Pitrou :
> 
> >
> > Hello,
> >
> > In https://github.com/apache/parquet-format/pull/242 the question came
> > of the size and overhead of key-value metadata entries in real world
> > Parquet files (basically, user-defined metadata attached either to the
> > entire file or to individual columns). Do people have insight to share
> > about the typical number of metadata entries in a file or column, and
> > their typical byte size?
> >
> > Regards
> >
> > Antoine.
> >
> >
> >  
> 





Re: [DISCUSS] Parquet 3 metadata draft / strawman proposal

2024-05-17 Thread Antoine Pitrou


Hi Julien,

Yes, I posted comments on Micah's document, and I referenced this PR in
those discussions. Personally, I feel more comfortable when I have some
concrete proposal to comment on, rather than abstract goals, and I
figured other people might be like me. Discussing actual Thrift
metadata makes it clearer to me where the friction points might reside,
and what the opportunities might be.

These changes might also later serve as an experimentation platform to
run crude benchmarks and try to validate what's really needed for the
wide-schema case to be handled efficiently.

They are not intended to be submitted for inclusion anytime soon, and
I'm not planning to push for them if someone comes up with something
better and more thought out.

All in all, this started as a personal investigation to understand
whether and how a "v3 schema" could be made backwards-compatible, and
when I saw that it seemed actually doable I decided it would be worth
posting the initial sketch instead of keeping it for myself.

Regards

Antoine.


On Thu, 16 May 2024 18:41:26 -0700
Julien Le Dem  wrote:
> Hi Antoine,
> 
> On the other thread Micah is collecting feedback in a document.
> https://lists.apache.org/thread/61z98xgq2f76jxfjgn5xfq1jhxwm3jwf
> 
> Would you mind putting your feedback there?
> We should collect the goals before jumping to solutions.
> It is a bit difficult to discuss those directly in the thrift metadata.
> 
> Thank you
> 
> 
> On Thu, May 16, 2024 at 4:13 AM Antoine Pitrou 
>  wrote:
> 
> >
> > Hello,
> >
> > In the light of recent discussions, I've put up a very rough proposal
> > of a Parquet 3 metadata format that allows both for light-weight
> > file-level metadata and backwards compatibility with legacy readers.
> >
> > For the sake of convenience and out of personal preference, I've made
> > this a PR to parquet-format rather than a Google Doc:
> > https://github.com/apache/parquet-format/pull/242
> >
> > Feel free to point any glaring mistakes or misunderstandings on my part,
> > or to comment on details.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >  
> 





Re: [DISCUSS] Parquet Reference Implementation ?

2024-05-17 Thread Antoine Pitrou


+1 (non-binding :-)) on the idea of having a shortlist of "accredited"
implementations.

I would suggest to add a third implementation such as parquet-rs, since
its authors are active here; especially as the Parquet Java and C++
teams seem to have some overlap historically, and a third
implementation helps bring different perspectives.

Regards

Antoine.


On Thu, 16 May 2024 17:37:35 -0700
Julien Le Dem  wrote:
> I would support it as long as we maintain a list of the implementations
> that we consider "accredited" to be reference implementations (we being a
> PMC vote here).
> Not all implementations are created equal from an adoption point of view.
> Originally the Impala implementation was the second implementation for
> interrop. Later on the parquet-cpp implementation was added as a standalone
> implementation in the Parquet project. This is the implementation that
> lives in the arrow repository.
> The parquet java implementation and the parquet cpp implementation in the
> arrow repo are on top of that list IMO.
> 
> 
> On Thu, May 16, 2024 at 6:17 AM Rok Mihevc 
>  wrote:
> 
> > I would support a "two interoperable open source implementations"
> > requirement.
> >
> > Rok
> >
> > On Thu, May 16, 2024 at 10:06 AM Antoine Pitrou 
> > wrote:
> >  
> > >
> > > I'm in (non-binding) agreement with Ed here. I would just add that the
> > > requirement for two interoperable implementations should mandate that
> > > these are open source implementations.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > On Tue, 14 May 2024 14:48:09 -0700
> > > Ed Seidl  wrote:  
> > > > Given the breadth of the parquet community at this point, I don't think
> > > > we should be singling out one or two "reference" implementations. Even
> > > > parquet-mr, AFAIK, still doesn't implement DELTA_LENGTH_BYTE_ARRAY
> > > > encoding in a user-accessible way (it's only available as part of the
> > > > DELTA_BYTE_ARRAY writer). There are many situations in which the
> > > > former would be the superior choice, and in fact the specification
> > > > documentation still lists DLBA as "always preferred over PLAIN for byte
> > > > array columns" [1]. Similarly, DELTA_BYTE_ARRAY encoding was only added
> > > > to parquet-cpp in the last year [2], and column indexes a few months
> > > > before that [3].
> > > >
> > > > Instead, I think we should leave out any mention of a reference
> > > > implementation,
> > > > and continue to require two, independent, interoperable implementations
> > > > before adopting a change to the spec. This, IMO, would go a long way  
> > > towards  
> > > > increasing excitement for Parquet outside the parquet-mr/arrow world.
> > > >
> > > > Just my (non-binding) two cents.
> > > >
> > > > Cheers,
> > > > Ed
> > > >
> > > > [1]
> > > >  
> > >  
> > https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
> >   
> > > > [2] https://github.com/apache/arrow/pull/14341
> > > > [3] https://github.com/apache/arrow/pull/34054
> > > >
> > > > On 5/14/24 9:44 AM, Julien Le Dem wrote:  
> > > > > I agree that parquet-mr implementation is a requirement to evolve the 
> > > > >  
> > > spec.  
> > > > > It makes sense to me that we call parquet-mr the reference  
> > > implementation  
> > > > > and make it a requirement to evolve the spec.
> > > > > I would add the requirement to implement it in the parquet cpp
> > > > > implementation that lives in apache Arrow:
> > > > > https://github.com/apache/arrow/tree/main/cpp/src/parquet
> > > > > This code used to live in the parquet-cpp repo in the Parquet  
> > project.  
> > > > > Being language agnostic is an important feature of the format.
> > > > > Interoperability tests should also be included.
> > > > >
> > > > > On Tue, May 14, 2024 at 9:31 AM Antoine Pitrou <  
> > > antoine-+zn9apsxkcednm+yrofe0a-xmd5yjdbdmrexy1tmh2...@public.gmane.org> 
> > > wrote:  
> > > > >  
> > > > >> AFAIK, the only Parquet implementation under the Apache Parquet  
> > > project  
> > > > >> is parquet-mr :-)
> > > > &

Re: [C++] Parquet and Arrow overlap

2024-05-17 Thread Antoine Pitrou


Hi Julien,

On Thu, 16 May 2024 18:23:33 -0700
Julien Le Dem  wrote:
> 
> As discussed, that code was moved in the arrow repo for convenience:
> https://lists.apache.org/thread/gkvbm6yyly1r4cg3f6xtnqkjz6ogn6o2
> 
> To take an excerpt of that original decision:
> 
> 4) The Parquet and Arrow C++ communities will collaborate to provide
> development workflows to enable contributors working exclusively on the
> Parquet core functionality to be able to work unencumbered with unnecessary
> build or test dependencies from the rest of the Arrow codebase. Note that
> parquet-cpp already builds a significant portion of Apache Arrow en route
> to creating its libraries 5) The Parquet community can create scripts to
> "cut" Parquet C++ releases by packaging up the appropriate components and
> ensuring that they can be built and installed independently as now

Unfortunately, these two points haven't happened at all. On the
contrary, the Arrow C++ dependency has infused much deeper in Parquet
C++ (I was not there at the beginning of Parquet C++, but I get the
impression there was originally an effort to have a Arrow-independent
Parquet C++ core; that "core" doesn't exist anymore).

Note that this doesn't mean that Parquet C++ forces you to read Parquet
files as Arrow-formatted data (*). It's just that Parquet C++ uses a
large number of assorted utilities that live in the Arrow C++ codebase.

(*) though I would argue that it's better to do so, as it's probably
more efficient, especially for BYTE_ARRAY data

> The alternative is to live up to the part where we agreed that the two
> communities collaborate on making it easy for the Parquet community to
> govern its code base in the arrow repo.
> Would you agree?

Yep. I don't think there has been any problem in that regard, TBH. It's
just that the situation is difficult to understand for people.

Regards

Antoine.




Re: [C++] Parquet and Arrow overlap

2024-05-17 Thread Antoine Pitrou
On Fri, 17 May 2024 07:48:18 +0200
Jean-Baptiste Onofré  wrote:
> Hi
> 
> Technically speaking moving back to parquet would be challenging short
> term.
> 
> In terms of governance, why not having some parquet maintainer/PMC member
> invited to arrow ? It would simplify the review and governance.

The Arrow and Parquet PMCs already have several members in common
(though some of them might be relatively inactive), so that's not a
problem.

Regards

Antoine.




Typical number of key-value metadata entries?

2024-05-16 Thread Antoine Pitrou


Hello,

In https://github.com/apache/parquet-format/pull/242 the question came
of the size and overhead of key-value metadata entries in real world
Parquet files (basically, user-defined metadata attached either to the
entire file or to individual columns). Do people have insight to share
about the typical number of metadata entries in a file or column, and
their typical byte size?

Regards

Antoine.




[DISCUSS] Parquet 3 metadata draft / strawman proposal

2024-05-16 Thread Antoine Pitrou


Hello,

In the light of recent discussions, I've put up a very rough proposal
of a Parquet 3 metadata format that allows both for light-weight
file-level metadata and backwards compatibility with legacy readers.

For the sake of convenience and out of personal preference, I've made
this a PR to parquet-format rather than a Google Doc:
https://github.com/apache/parquet-format/pull/242

Feel free to point any glaring mistakes or misunderstandings on my part,
or to comment on details.

Regards

Antoine.




[DISCUSS] Parquet C++ under which PMC?

2024-05-16 Thread Antoine Pitrou
On Thu, 16 May 2024 10:08:42 +0200
"Uwe L. Korn"  wrote:
> On Tue, May 14, 2024, at 6:30 PM, Antoine Pitrou wrote:
> > AFAIK, the only Parquet implementation under the Apache Parquet project
> > is parquet-mr :-)  
> 
> This is not true. The parquet-cpp that resides in the arrow repository is 
> still controlled by the Apache Parquet PMC. Back then, we only merged the 
> codebases but kept control of it with the Apache Parquet project. I know, it 
> is hard to understand, but at least I have never seen a vote that would move 
> it out of the Apache Parquet's project "control".

Ahah. Unfortunately, this doesn't match actual community practices. For
example, when it is decided to give (Arrow) commit rights to a frequent
Parquet C++ contributor, that decision is made among the Arrow PMC, not
the Parquet PMC.

Perhaps there would be value in aligning the legal situation on the
_de facto_ situation?

Regards

Antoine.


> 
> Best
> Uwe
> >
> >
> > On Tue, 14 May 2024 10:58:58 +0200
> > Rok Mihevc  wrote:  
> >> Second Raphael's point.
> >> Would it be reasonable to say specification change requires implementation
> >> in two parquet implementations within Apache Parquet project?
> >> 
> >> Rok
> >> 
> >> On Tue, May 14, 2024 at 10:50 AM Gang Wu 
> >>  wrote:
> >>   
> >> > IMHO, it looks more reasonable if a reference implementation is required
> >> > to support most (not all) elements from the specification.
> >> >
> >> > Another question is: should we discuss (and vote for) each candidate
> >> > one by one? We can start with parquet-mr which is most well-known
> >> > implementation.
> >> >
> >> > Best,
> >> > Gang
> >> >
> >> > On Tue, May 14, 2024 at 4:41 PM Raphael Taylor-Davies
> >> >  wrote:
> >> >
> >> > > Potentially it would be helpful to flip the question around. As Andrew
> >> > > articulates, a reference implementation is required to implement all
> >> > > elements from the specification, and therefore the major consequence of
> >> > > labeling parquet-mr thusly would be that any specification change would
> >> > > have to be implemented within parquet-mr as part of the standardisation
> >> > > process. It would be insufficient for it to be implemented in, for
> >> > > example, two of the parquet implementations maintained by the arrow
> >> > > project. I personally think that would be a shame and likely exclude
> >> > > many people who would otherwise be interested in evolving the parquet
> >> > > specification, but think that is at the core of this question.
> >> > >
> >> > > Kind Regards,
> >> > >
> >> > > Raphael
> >> > >
> >> > > On 13/05/2024 20:55, Andrew Lamb wrote:
> >> > > > Question: Should we label parquet-mr or any other parquet
> >> > implementations
> >> > > > "reference" implications"?
> >> > > >
> >> > > > This came up as part of Vinoo's great PR to list different parquet
> >> > > > reference implementations[1][2].
> >> > > >
> >> > > > The term "reference implementation" often has an official 
> >> > > > connotation.
> >> > > For
> >> > > > example the wikipedia definition is "a program that implements all
> >> > > > requirements from a corresponding specification. The reference
> >> > > > implementation ... should be considered the "correct" behavior of 
> >> > > > any
> >> > > other
> >> > > > implementation of it."[3]
> >> > > >
> >> > > > Given the close association of parquet-mr to the parquet standard, 
> >> > > > it
> >> > is
> >> > > a
> >> > > > natural candidate to label as "reference implementation." However, 
> >> > > > it
> >> > is
> >> > > > not clear to me if there is consensus that it should be thusly 
> >> > > > labeled.
> >> > > >
> >> > > > I have a strong opinion that a consensus on this question would be 
> >> > > > very
> >> > > > helpful. I don't actually have a strong opinion about the answer
> >> > > >
> >> > > > Andrew
> >> > > >
> >> > > >
> >> > > >
> >> > > > [1]:
> >> > > https://github.com/apache/parquet-site/pull/53#discussion_r1582882267  
> >> > >   
> >> > > > [2]:
> >> > > https://github.com/apache/parquet-site/pull/53#discussion_r1598283465  
> >> > >   
> >> > > > [3]:  https://en.wikipedia.org/wiki/Reference_implementation
> >> > > >
> >> > >
> >> >
> >>  
> 





Re: Interest in Parquet V3

2024-05-16 Thread Antoine Pitrou


Hi Wes,

On Wed, 15 May 2024 18:56:42 -0500
Wes McKinney  wrote:
> -- I am not sure how you fully make this problem go away in generality
> without doing away with Thrift at the footer level, but at that point you
> are making such a disruptive change that why not try to fix some other
> problems as well? If you go down that rabbit hole, you have created a new
> file format that is no longer Parquet, and so calling it ParquetV3 is
> probably misleading.

I agree that redesigning the metadata structure and encoding is
probably a new format entirely.

> - Parquet's data page format has worked well over time, but aside from
> fixing the metadata overhead issue, the data page itself needs to be
> extensible. There is DATA_PAGE_V2, but structurally it is the same as
> DATA_PAGE{_V1} with the repetition and definition levels kept outside of
> the compressed portion. You can kind of think of Parquet's data page
> structure as one possible choice of options in a general purpose nested
> encoding scheme (most implementations do dictionary+rle and falls back on
> plain encoding when the dictionary exceeds a certain size). We could create
> a DATA_PAGE_V3 that allows for an whole alternate -- and even pluggable --
> encoding scheme, without changing the metadata, and this would be valuable
> to the Parquet community, even if most mainstream Parquet users (e.g.
> Spark) will opt not to use it for a period of some years for compatibility
> reasons.

Do you mean allowing custom encodings just like Arrow has extension
types? It would indeed allow experimenting and slowly solidifying novel
encoding schemes.

A closely related thing that would be useful is extension types in
Parquet (instead of having all logical types reified in the Thrift
definitions). This was mentioned in the discussion for
https://github.com/apache/parquet-format/pull/240

> - Another problem that I haven't seen mentioned but maybe I just missed it
> is that Parquet is very painful to decode on accelerators like GPUs. RAPIDS
> has created a CUDA implementation of Parquet decoding (including decoding
> the Thrift data page headers on the GPU), but there are two primary
> problems 1) there is metadata that is necessary for control flow on the
> host side within the ColumnChunk in the row group and 2) there are not
> sufficient memory preallocation hints -- how much memory you need to
> allocate to fully decode a data page. This is also discussed in
> https://github.com/facebookincubator/nimble/discussions/50

The latest format additions should make this better. It would be good
to hear from GPU people if more metadata is needed:
https://github.com/apache/parquet-format/blob/079a2dff06e32b7d1ad8c9aa67f2e2128fb5ffa5/src/main/thrift/parquet.thrift#L194-L238

Regards

Antoine.




Re: [DISCUSS] Parquet Reference Implementation ?

2024-05-16 Thread Antoine Pitrou


I'm in (non-binding) agreement with Ed here. I would just add that the
requirement for two interoperable implementations should mandate that
these are open source implementations.

Regards

Antoine.


On Tue, 14 May 2024 14:48:09 -0700
Ed Seidl  wrote:
> Given the breadth of the parquet community at this point, I don't think
> we should be singling out one or two "reference" implementations. Even
> parquet-mr, AFAIK, still doesn't implement DELTA_LENGTH_BYTE_ARRAY
> encoding in a user-accessible way (it's only available as part of the
> DELTA_BYTE_ARRAY writer). There are many situations in which the
> former would be the superior choice, and in fact the specification
> documentation still lists DLBA as "always preferred over PLAIN for byte
> array columns" [1]. Similarly, DELTA_BYTE_ARRAY encoding was only added
> to parquet-cpp in the last year [2], and column indexes a few months
> before that [3].
> 
> Instead, I think we should leave out any mention of a reference 
> implementation,
> and continue to require two, independent, interoperable implementations
> before adopting a change to the spec. This, IMO, would go a long way towards
> increasing excitement for Parquet outside the parquet-mr/arrow world.
> 
> Just my (non-binding) two cents.
> 
> Cheers,
> Ed
> 
> [1] 
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
> [2] https://github.com/apache/arrow/pull/14341
> [3] https://github.com/apache/arrow/pull/34054
> 
> On 5/14/24 9:44 AM, Julien Le Dem wrote:
> > I agree that parquet-mr implementation is a requirement to evolve the spec.
> > It makes sense to me that we call parquet-mr the reference implementation
> > and make it a requirement to evolve the spec.
> > I would add the requirement to implement it in the parquet cpp
> > implementation that lives in apache Arrow:
> > https://github.com/apache/arrow/tree/main/cpp/src/parquet
> > This code used to live in the parquet-cpp repo in the Parquet project.
> > Being language agnostic is an important feature of the format.
> > Interoperability tests should also be included.
> >
> > On Tue, May 14, 2024 at 9:31 AM Antoine Pitrou 
> >  wrote:
> >  
> >> AFAIK, the only Parquet implementation under the Apache Parquet project
> >> is parquet-mr :-)
> >>
> >>
> >> On Tue, 14 May 2024 10:58:58 +0200
> >> Rok Mihevc  wrote:  
> >>> Second Raphael's point.
> >>> Would it be reasonable to say specification change requires  
> >> implementation  
> >>> in two parquet implementations within Apache Parquet project?
> >>>
> >>> Rok
> >>>
> >>> On Tue, May 14, 2024 at 10:50 AM Gang Wu <  
> >> ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:  
> >>>> IMHO, it looks more reasonable if a reference implementation is  
> >> required  
> >>>> to support most (not all) elements from the specification.
> >>>>
> >>>> Another question is: should we discuss (and vote for) each candidate
> >>>> one by one? We can start with parquet-mr which is most well-known
> >>>> implementation.
> >>>>
> >>>> Best,
> >>>> Gang
> >>>>
> >>>> On Tue, May 14, 2024 at 4:41 PM Raphael Taylor-Davies
> >>>>  wrote:
> >>>>  
> >>>>> Potentially it would be helpful to flip the question around. As  
> >> Andrew  
> >>>>> articulates, a reference implementation is required to implement all
> >>>>> elements from the specification, and therefore the major consequence  
> >> of  
> >>>>> labeling parquet-mr thusly would be that any specification change  
> >> would  
> >>>>> have to be implemented within parquet-mr as part of the  
> >> standardisation  
> >>>>> process. It would be insufficient for it to be implemented in, for
> >>>>> example, two of the parquet implementations maintained by the arrow
> >>>>> project. I personally think that would be a shame and likely exclude
> >>>>> many people who would otherwise be interested in evolving the parquet
> >>>>> specification, but think that is at the core of this question.
> >>>>>
> >>>>> Kind Regards,
> >>>>>
> >>>>> Raphael
> >>>>>
> >>>>> On 13/05/2024 20:55, Andrew Lamb wrote:  
> >>>>>>

Re: [C++] Parquet and Arrow overlap

2024-05-16 Thread Antoine Pitrou
On Tue, 14 May 2024 10:22:37 -0700
Julien Le Dem  wrote:
> 1. I think we should make it easy for people contributing to the C++
> codebase. (which is why I voted for the move at the time)
> 2. If merging repos removes the need to deal with the circular dependency
> between repos issue for the C++ code bases, it does it at the expense of
> making it easy to evolve the parquet spec and the java and c++
> implementations together.

Hmm... I'm not sure I understand your point here. The Parquet spec and
the Java implementation are already living in distinct repos and have
distinct versioning schemes. The main thing that they share in common is
the JIRA instance (while the C++ Parquet implementation mostly relies on
Arrow's GH issue tracker), but is that really important?

> parquet-cpp depends only on arrow-core that does not have to depend on
> parquet-cpp.

That is true.

> Other components like
> arrow-dataset and pyarrow can depend on parquet-cpp just like they depend
> on orc externally.

Ideally yes. In practice there are two problems:
1) it creates a circular dependency between *repositories*.
2) the C++ Arrow Datasets component is not built independently, it is an
optional component when building Arrow C++. So we would also have a
chicken-and-egg problem when building Arrow C++ and Parquet C++.

> I realize that would be work to make it happen, but the current location of
> the parquet-cpp codebase is a big trade-off of prioritizing quick iteration
> on the C++ implementations over iteration on the format.

Having recently worked on a format addition and its respective
implementations (in Java and C++), I haven't found the current setup
more difficult to work with for Parquet C++ than it was for Parquet
Java. Admittedly I'm biased, being a heavy contributor to Arrow C++,
but I'm curious why the current situation would be detrimental to
iteration on the format.

Regards

Antoine.




Re: [DISCUSS] Parquet Reference Implementation ?

2024-05-14 Thread Antoine Pitrou


AFAIK, the only Parquet implementation under the Apache Parquet project
is parquet-mr :-)


On Tue, 14 May 2024 10:58:58 +0200
Rok Mihevc  wrote:
> Second Raphael's point.
> Would it be reasonable to say specification change requires implementation
> in two parquet implementations within Apache Parquet project?
> 
> Rok
> 
> On Tue, May 14, 2024 at 10:50 AM Gang Wu 
>  wrote:
> 
> > IMHO, it looks more reasonable if a reference implementation is required
> > to support most (not all) elements from the specification.
> >
> > Another question is: should we discuss (and vote for) each candidate
> > one by one? We can start with parquet-mr which is most well-known
> > implementation.
> >
> > Best,
> > Gang
> >
> > On Tue, May 14, 2024 at 4:41 PM Raphael Taylor-Davies
> >  wrote:
> >  
> > > Potentially it would be helpful to flip the question around. As Andrew
> > > articulates, a reference implementation is required to implement all
> > > elements from the specification, and therefore the major consequence of
> > > labeling parquet-mr thusly would be that any specification change would
> > > have to be implemented within parquet-mr as part of the standardisation
> > > process. It would be insufficient for it to be implemented in, for
> > > example, two of the parquet implementations maintained by the arrow
> > > project. I personally think that would be a shame and likely exclude
> > > many people who would otherwise be interested in evolving the parquet
> > > specification, but think that is at the core of this question.
> > >
> > > Kind Regards,
> > >
> > > Raphael
> > >
> > > On 13/05/2024 20:55, Andrew Lamb wrote:  
> > > > Question: Should we label parquet-mr or any other parquet  
> > implementations  
> > > > "reference" implications"?
> > > >
> > > > This came up as part of Vinoo's great PR to list different parquet
> > > > reference implementations[1][2].
> > > >
> > > > The term "reference implementation" often has an official connotation.  
> > > For  
> > > > example the wikipedia definition is "a program that implements all
> > > > requirements from a corresponding specification. The reference
> > > > implementation ... should be considered the "correct" behavior of any  
> > > other  
> > > > implementation of it."[3]
> > > >
> > > > Given the close association of parquet-mr to the parquet standard, it  
> > is  
> > > a  
> > > > natural candidate to label as "reference implementation." However, it  
> > is  
> > > > not clear to me if there is consensus that it should be thusly labeled.
> > > >
> > > > I have a strong opinion that a consensus on this question would be very
> > > > helpful. I don't actually have a strong opinion about the answer
> > > >
> > > > Andrew
> > > >
> > > >
> > > >
> > > > [1]:  
> > > https://github.com/apache/parquet-site/pull/53#discussion_r1582882267  
> > > > [2]:  
> > > https://github.com/apache/parquet-site/pull/53#discussion_r1598283465  
> > > > [3]:  https://en.wikipedia.org/wiki/Reference_implementation
> > > >  
> > >  
> >  
> 





Re: Interest in Parquet V3

2024-05-14 Thread Antoine Pitrou
On Mon, 13 May 2024 16:10:24 +0100
Raphael Taylor-Davies

wrote:
> 
> I guess I wonder if rather than having a parquet format version 2, or 
> even a parquet format version 3, we could just document what features a 
> given parquet implementation actually supports. I believe Andrew intends 
> to pick up on where previous efforts here left off.

I also believe documenting implementation status is strongly desirable,
regardless of whether the discussion on "V3" goes anywhere.

Regards

Antoine.




Re: [C++] Parquet and Arrow overlap

2024-05-14 Thread Antoine Pitrou
1, 2024 at 8:46 AM Jacob Wujciak <  
> > assignu...@apache.org  
> > > >  
> > > > > > wrote:
> > > > > >  
> > > > > > > Hello Everyone!
> > > > > > >
> > > > > > > It seems there is general agreement on this topic, it would be  
> > > great  
> > > > > if a  
> > > > > > > committer/PMC could start a (lazy consensus) procedural vote.
> > > > > > >
> > > > > > > I will inquire how to handle the parquet-cpp component in jira  
> > > > (ideally  
> > > > > > > disabling it, not removing).
> > > > > > > There are currently only ~70 open tickets for parquet-cpp, with  
> > the  
> > > > > > change  
> > > > > > > in repo it is probably easier to just move open tickets but I'll  
> > > > leave  
> > > > > > that  
> > > > > > > to Rok who managed the transition of Arrows 20k+ tickets too :D
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jacob
> > > > > > >
> > > > > > > Arrow committer
> > > > > > >
> > > > > > > On 2024/04/25 05:31:18 Gang Wu wrote:  
> > > > > > > > I know we have some non-Java committers and PMCs. But after the 
> > > > > > > >  
> > > > > > > parquet-cpp  
> > > > > > > > donation, it seems that no one worked on Parquet from arrow  
> > (cpp,  
> > > > > rust,  
> > > > > > > go,  
> > > > > > > > etc.)
> > > > > > > > and other projects are promoted as a Parquet committer. It  
> > would  
> > > be  
> > > > > > > > inconvenient
> > > > > > > > for non-Java Parquet developers to work with  
> > > apache/parquet-format  
> > > > > and  
> > > > > > > > apache/parquet-testing repositories. Furthermore, votes from  
> > > these  
> > > > > > > > developers
> > > > > > > > are not binding for a format change in the ML.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Gang
> > > > > > > >
> > > > > > > > On Wed, Apr 24, 2024 at 8:42 PM Uwe L. Korn   
> > > > > wrote:  
> > > > > > > >  
> > > > > > > > > > Should we consider
> > > > > > > > > > Parquet developers from other projects than parquet-mr as  
> > > > Parquet  
> > > > > > > > > commuters?
> > > > > > > > >
> > > > > > > > > We are doing this (speaking as a Parquet PMC who didn't work  
> > on  
> > > > > > > > > parquet-mr, but parquet-cpp).
> > > > > > > > >
> > > > > > > > > Best
> > > > > > > > > Uwe
> > > > > > > > >
> > > > > > > > > On Wed, Apr 24, 2024, at 2:38 PM, Gang Wu wrote:  
> > > > > > > > > > +1 for moving parquet-cpp issues from Apache Jira to  
> > Arrow's  
> > > > > GitHub  
> > > > > > > > > issue.  
> > > > > > > > > >
> > > > > > > > > > Besides, I want to echo Will's question in the thread.  
> > Should  
> > > > we  
> > > > > > > consider  
> > > > > > > > > > Parquet developers from other projects than parquet-mr as  
> > > > Parquet  
> > > > > > > > > commiters?  
> > > > > > > > > > Currently apache/parquet-format and apache/parquet-testing  
> > > > > > > repositories  
> > > > > > > > > are  
> > > > > > > > > > solely governed by Apache Parquet PMC. It would be better  
> > for  
> > > > the  
> > > > > > > entire  
> > > > > > > > > > Parquet community if developers with sufficient  
> > contributions  
> > > > to  
> > > > > > open  
> > > > > > > > > source  
> > > > > > > > > > Parquet project

Re: Interest in Parquet V3

2024-05-13 Thread Antoine Pitrou


Same as Andrew.

1) the "v3" messaging is intuitively a turn-off as it's already not
obvious whether Parquet "v2" is usable with implementations currenly
found in the wild. Concretely, the "v2" branding is commonly confused
with the Parquet format version, and it's almost impossible to explain
how they relate and differ without diving into implementation minutiae.

2) the "v3" messaging doesn't say anything about compatibility or
features: is "v3" a functional superset of "v2"? is it a clean slate
redesign of the Parquet format? does it use different technologies (for
example Flatbuffers instead of Thrift)?

While I would be curious to see a list of proposed changes, I'm also not
very convinced that launching such an initiative is desirable nor
sustainable for the Parquet development community.

Regards

Antoine.


On Sun, 12 May 2024 05:30:57 -0400
Andrew Lamb 
wrote:
> My opinion is that most (if not all) of the proposed benefits from these
> new formats can be achieved using the currrent parquet format and improved
> implementations (possibly with some minor extensions such as user defined
> encoding schemes)[1]
> 
> Another reason people propose replacing parquet I think is the "what is V2
> and what supports it" confusion, along with a perception that the Apache
> Parquet community mostly focuses on parquet-mr and not the format or the
> myriad of other implementations. Thankfully this is starting to change[2]
> 
> Thus, I think the best response for the Parquet community to these new
> format proposals is to clarify the current implementation situation (which
> will indirectly lead to more investment in current implementations)
> 
> Note this doesn't preclude "v3" of parquet, but I think in order to
> drive V3 adoption we first need to get the existing communication in better
> working order
> 
> Andrew
> 
> [1] I realize I need some more data to back up that assertion, and I am
> working on it.
> [2] https://github.com/apache/parquet-site/pull/53
> 
> 
> 
> On Sun, May 12, 2024 at 4:48 AM Gang Wu 
>  wrote:
> 
> > Hi Micah,
> >
> > I have also noticed the emergence of these new file formats which are
> > challenging the popularity of Apache Parquet. It would always be good
> > to evolve Parquet to be competitive. Personally I'm +1 on this. I'm also
> > proposing adding a new geometry type to the specs: [1]. This seems
> > to align with the goal of V3 to some extent.
> >
> > On the other hand, I'm also concerned with some aspects:
> > 1. Are there sufficient developers to work on this? As a committer to both
> > parquet-cpp and parquet-mr, I can take part in the V3 but I'm not sure if
> > there are enough active contributors. It would be good if some companies
> > could have dedicated people to work on this and move things forward.
> > 2. Users may not be willing to adopt new formats if current businesses
> > do not have any issue. Especially for users from large enterprises. Think
> > about the current issues of V2 [2].
> >
> > All in all, I feel excited about V3.
> >
> > [1] https://lists.apache.org/thread/q20b8kjvs27ly0w2zzxld029nwkc5fhx
> > [2] https://lists.apache.org/thread/r8djjov7wyy8646qm2xzwn9p2olsk9wn
> >
> > Best,
> > Gang
> >
> > On Sun, May 12, 2024 at 6:59 AM Micah Kornfield 
> > wrote:
> >  
> > > Hi Parquet Dev,
> > > I wanted to start a conversation within the community about working on a
> > > new revision of Parquet.  For context there have been a bunch of new
> > > formats [1][2][3] that show there is decent room for improvement across
> > > data encodings and how metadata is organized.
> > >
> > > Specifically, in a new format revision I think we should be thinking  
> > about  
> > > the following areas for improvements:
> > > 1.  More efficient encodings that allow for data skipping and SIMD
> > > optimizations.
> > > 2.  More efficient metadata handling for deserialization and projection  
> > to  
> > > address areas when metadata deserialization time is not trivial [4].
> > > 3.  Possibly thinking about different encodings instead of
> > > repetition/definition for repeated and nested field
> > > 4.  Support for optimizing semi-structured data (e.g. JSON or Variant  
> > type)  
> > > that can shred elements into individual columns (a recent thread in  
> > Iceberg  
> > > mentions doing this at the metadata level [5])
> > >
> > > I think the goals of V3 would be to provide existing API compatibility as
> > > broadly as possible (possibly with some performance loss) and expose new
> > > API surface areas where appropriate to make use of new elements.  New
> > > encodings could be backported so they can be made use of without metadata
> > > changes.  I think unfortunately that for points 2 and 3 we would want to
> > > break file level compatibility.  More thought would be needed to consider
> > > whether 4 could be backported effectively.
> > >
> > > This is a non-trivial amount of work to get good coverage across
> > > implementations, so before putting together more 

[RESULT] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64

2024-03-18 Thread Antoine Pitrou


Hello,

With 3 +1 binding votes and 6 +1 non-binding, the vote passes.

The next steps will be to merge the format additions and the testing
data file. The C++ and Java PRs can then proceed with further reviewing
before being merged.

Regards

Antoine.


On Thu, 7 Mar 2024 14:15:18 +0100
Antoine Pitrou  wrote:

> Hello,
> 
> As discussed previously on this ML [1], I am proposing to expand
> the types supported by the BYTE_STREAM_SPLIT encoding. The currently
> supported types are FLOAT and DOUBLE. The proposal expands the
> supported types to INT32, INT64 and FIXED_LEN_BYTE_ARRAY.
> 
> The format addition is tracked on JIRA where some measurements on
> sample data are also published and discussed [2].
> 
> (please note that the original ML thread only discussed expanding
> to FIXED_LEN_BYTE_ARRAY; discussion on the JIRA issue led to the
> conclusion that it would also be beneficial to cover INT32 and INT64)
> 
> The format additions are submitted as a PR in [3].
> A data file for integration testing is submitted in [4].
> An implementation for Parquet C++ is submitted in [5].
> An implementation for parquet-mr is submitted in [6].
> 
> This vote will be open for at least 1 week.
> 
> +1: Accept the format additions
> +0: ...
> -1: Reject the format additions because ...
> 
> Regards
> 
> Antoine.
> 
> 
> [1] https://lists.apache.org/thread/5on7rnc141jnw2cdxtsfgm5xhhdmsb4q
> [2] https://issues.apache.org/jira/browse/PARQUET-2414
> [3] https://github.com/apache/parquet-format/pull/229
> [4] https://github.com/apache/parquet-testing/pull/46
> [5] https://github.com/apache/arrow/pull/40094
> [6] https://github.com/apache/parquet-mr/pull/1291
> 
> 
> 
> 





[VOTE] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64

2024-03-07 Thread Antoine Pitrou


Hello,

As discussed previously on this ML [1], I am proposing to expand
the types supported by the BYTE_STREAM_SPLIT encoding. The currently
supported types are FLOAT and DOUBLE. The proposal expands the
supported types to INT32, INT64 and FIXED_LEN_BYTE_ARRAY.

The format addition is tracked on JIRA where some measurements on
sample data are also published and discussed [2].

(please note that the original ML thread only discussed expanding
to FIXED_LEN_BYTE_ARRAY; discussion on the JIRA issue led to the
conclusion that it would also be beneficial to cover INT32 and INT64)

The format additions are submitted as a PR in [3].
A data file for integration testing is submitted in [4].
An implementation for Parquet C++ is submitted in [5].
An implementation for parquet-mr is submitted in [6].

This vote will be open for at least 1 week.

+1: Accept the format additions
+0: ...
-1: Reject the format additions because ...

Regards

Antoine.


[1] https://lists.apache.org/thread/5on7rnc141jnw2cdxtsfgm5xhhdmsb4q
[2] https://issues.apache.org/jira/browse/PARQUET-2414
[3] https://github.com/apache/parquet-format/pull/229
[4] https://github.com/apache/parquet-testing/pull/46
[5] https://github.com/apache/arrow/pull/40094
[6] https://github.com/apache/parquet-mr/pull/1291





Re: parquet-format status

2024-03-07 Thread Antoine Pitrou


Hello,

I am surprised that this is suggesting to deprecate or delete a
repository just because a website building procedure isn't properly
setup to deal with it.

ISTM the "right" solution would be for the Parquet website to
automatically update its contents based on the latest released version
of parquet-format. Perhaps using a git submodule or something.

Regards

Antoine.


On Tue, 5 Mar 2024 21:30:45 -0500
Vinoo Ganesh 
wrote:
> Hi Parquet Dev -
> 
> There have been some conversations about content stored on the
> parquet-format github repo vs. the website. Doing a cursory pass of the
> parquet-format  repo, it looks
> like, other than the markdown documentation stored in the repo, most of the
> core code was marked as deprecated here:
> https://github.com/apache/parquet-format/pull/105, content was moved to
> parquet-mr, and that entire repo really only exists to host this file:
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift.
> It's possible I'm missing something, but is my understanding correct?
> 
> If so, would it make sense to just deprecate parquet-format as a repo, move
> the content to be exclusively hosted on parquet-site
> , and host the thrift
> file elsewhere? This would solve the content duplication problem between
> parquet format and the website, and would cut down on having to manage a
> separate repo. I know there is benefit to having comments/discussions on
> PRs or issues on the repo, but we could also pretty easily port this to the
> site.
> 
> I'm sure this proposal will elicit some strong responses, but wanted to see
> if anyone had insights here / if I'm missing anything.
> 
> Thanks, Vinoo
> 
> 
> 
> 





Error building with IntelliJ

2024-01-18 Thread Antoine Pitrou


Hello all,

Thank you for the suggestions. I am trying to build parquet-mr from
IntelliJ now ("Build" -> "Build Project"), but I get the following
error:

/home/antoine/parquet/mr/parquet-common/src/test/java/org/apache/parquet/VersionTest.java:46:24
java: cannot find symbol
  symbol:   variable Version
  location: class org.apache.parquet.VersionTest


Am I missing something obvious? Does a separate step need to be run
first?

Regards

Antoine.



On Thu, 11 Jan 2024 18:48:20 +0100
Antoine Pitrou  wrote:

> Hello,
> 
> I'm trying to build parquet-mr and I'm unsure how to make the
> experience smooth enough for development. This is what I observe:
> 
> 1) running the tests is extremely long (they have been running for 10
> minutes already, with no sign of nearing completion)
> 
> 2) the output logs are a true firehose; there's a ton of extremely
> detailed (and probably superfluous) information being output, such as:
> 
> 2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd]
> 2024-01-11 18:45:33 INFO CodecPool - Got brand-new decompressor [.gz]
> 2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd]
> 2024-01-11 18:45:33 INFO CodecPool - Got brand-new decompressor [.gz]
> 2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd]
> 2024-01-11 18:45:33 INFO ParquetRewriter - Finish rewriting input file:
> file:/tmp/test12306662267168473656/test.parquet 2024-01-11 18:45:33
> INFO InternalParquetRecordReader - RecordReader initialized will read a
> total of 10 records. 2024-01-11 18:45:33 INFO
> InternalParquetRecordReader - at row 0. reading next block 2024-01-11
> 18:45:33 INFO CodecPool - Got brand-new decompressor [.zstd] 2024-01-11
> 18:45:33 INFO InternalParquetRecordReader - block read in memory in 1
> ms. row count = 100 2024-01-11 18:45:33 INFO
> InternalParquetRecordReader - Assembled and processed 100 records from
> 6 columns in 0 ms: Infinity rec/ms, Infinity cell/ms 2024-01-11
> 18:45:33 INFO InternalParquetRecordReader - time spent so far 100%
> reading (1 ms) and 0% processing (0 ms) 2024-01-11 18:45:33 INFO
> InternalParquetRecordReader - at row 100. reading next block 2024-01-11
> 18:45:33 INFO InternalParquetRecordReader - block read in memory in 0
> ms. row count = 100 2024-01-11 18:45:33 INFO
> InternalParquetRecordReader - Assembled and processed 200 records from
> 6 columns in 1 ms: 200.0 rec/ms, 1200.0 cell/ms 2024-01-11 18:45:33
> INFO InternalParquetRecordReader - time spent so far 50% reading (1 ms)
> and 50% processing (1 ms) 2024-01-11 18:45:33 INFO
> InternalParquetRecordReader - at row 200. reading next block 2024-01-11
> 18:45:33 INFO InternalParquetRecordReader - block read in memory in 0
> ms. row count = 100 2024-01-11 18:45:33 INFO
> InternalParquetRecordReader - Assembled and processed 300 records from
> 6 columns in 1 ms: 300.0 rec/ms, 1800.0 cell/ms 2024-01-11 18:45:33
> INFO InternalParquetRecordReader - time spent so far 50% reading (1 ms)
> and 50% processing (1 ms)
> 
> [etc.]
> 
> 
> 3) it seems the tests are leaving a lot of generated data files behind
> in /tmp/test..., though of course they might ultimately clean up at the
> end?
> 
> 
> How do people typically develop on parquet-mr? Do they have dedicated
> shell scripts that only build and test parts of the project? Do they
> use an IDE and select specific options there?
> 
> Regards
> 
> Antoine.
> 
> 
> 





Re: Pitch for Pcodec Encoding in Parquet

2024-01-15 Thread Antoine Pitrou


My personal sentiment is: not only its newness, but the fact that it is

1) highly non-trivial (it seems much more complicated than all other
Parquet encodings);
2) maintained by a single person, both the spec and the implementation
(please correct me if I'm wrong?); and
3) has little to no adoption currently (again, please correct me if
I'm wrong?).

Of course the adoption issue is a chicken-and-egg problem, but given
that Parquet files are used for long-term storage (not just transient
data), it's probably not a good idea to be an early adopter here.

And of course, if the encoding was simpler, points 2 and 3 wouldn't
really hurt.

This is just my opinion!

Regards

Antoine.


On Thu, 11 Jan 2024 22:02:02 -0500
Martin Loncaric 
wrote:
> To reach a conclusion on this thread, I understand the overall sentiment as:
> 
> Pco could technically work as a Parquet encoding, but people are wary of
> its newness and weak FFI support. It seems there is no immediate action to
> take, but would be worthwhile to consider this again further in the future.
> 
> On Thu, Jan 11, 2024 at 9:47 PM Martin Loncaric 
> wrote:
> 
> > I must admit I'm a bit surprised by these results. The first thing is  
> >> that the Pcodec results were actually obtained using dictionary
> >> encoding. Then I don't understand what is Pcodec-encoded: the dictionary
> >> values or the dictionary indices?  
> >
> >
> > No, pco cannot be dictionary encoded; it only goes from vec -> Bytes
> > and back. Some of Parquet's existing encodings are like this as well.
> >
> > The second thing is that the BYTE_STREAM_SPLIT + Zstd results are much  
> >> worse than the PLAIN + Zstd results, which is unexpected (though not
> >> impossible).  
> >
> >
> > I explained briefly in the blog post, but BYTE_STREAM_SPLIT does terribly
> > for this data because there is high correlation among each number's bytes.
> > For instance, if each double is a multiple of 0.1, then the 52 mantissa
> > bits will look like 011011011011011... (011 repeating). That means there
> > are only 3 possibilities (<2 bits of entropy) for the last 6+ bytes of each
> > number. BYTE_STREAM_SPLIT throws this away, requiring 6+ times as many bits
> > for them.
> >
> > On Mon, Jan 8, 2024 at 10:44 AM Antoine Pitrou 
> >  wrote:
> >  
> >>
> >> Hello Martin,
> >>
> >> On Sat, 6 Jan 2024 17:09:07 -0500
> >> Martin Loncaric 
> >> wrote:  
> >> > >
> >> > > It would be very interesting to expand the comparison against
> >> > > BYTE_STREAM_SPLIT + compression.  
> >> >
> >> > Antoine: I created one now, at the bottom of the post
> >> > <https://graphallthethings.com/posts/the-parquet-we-could-have>. In  
> >> this  
> >> > case, BYTE_STREAM_SPLIT did worse.  
> >>
> >> I must admit I'm a bit surprised by these results. The first thing is
> >> that the Pcodec results were actually obtained using dictionary
> >> encoding. Then I don't understand what is Pcodec-encoded: the dictionary
> >> values or the dictionary indices?
> >>
> >> The second thing is that the BYTE_STREAM_SPLIT + Zstd results are much
> >> worse than the PLAIN + Zstd results, which is unexpected (though not
> >> impossible).
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >>  
> 





Re: Guidelines for working on parquet-mr?

2024-01-11 Thread Antoine Pitrou


Update: I finally Ctrl-C'ed the tests; they had left around 14 GB of
data in /tmp.

Regards

Antoine.


On Thu, 11 Jan 2024 18:48:20 +0100
Antoine Pitrou  wrote:

> Hello,
> 
> I'm trying to build parquet-mr and I'm unsure how to make the
> experience smooth enough for development. This is what I observe:
> 
> 1) running the tests is extremely long (they have been running for 10
> minutes already, with no sign of nearing completion)
> 
> 2) the output logs are a true firehose; there's a ton of extremely
> detailed (and probably superfluous) information being output, such as:
> 
> 2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd]
> 2024-01-11 18:45:33 INFO CodecPool - Got brand-new decompressor [.gz]
> 2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd]
> 2024-01-11 18:45:33 INFO CodecPool - Got brand-new decompressor [.gz]
> 2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd]
> 2024-01-11 18:45:33 INFO ParquetRewriter - Finish rewriting input file:
> file:/tmp/test12306662267168473656/test.parquet 2024-01-11 18:45:33
> INFO InternalParquetRecordReader - RecordReader initialized will read a
> total of 10 records. 2024-01-11 18:45:33 INFO
> InternalParquetRecordReader - at row 0. reading next block 2024-01-11
> 18:45:33 INFO CodecPool - Got brand-new decompressor [.zstd] 2024-01-11
> 18:45:33 INFO InternalParquetRecordReader - block read in memory in 1
> ms. row count = 100 2024-01-11 18:45:33 INFO
> InternalParquetRecordReader - Assembled and processed 100 records from
> 6 columns in 0 ms: Infinity rec/ms, Infinity cell/ms 2024-01-11
> 18:45:33 INFO InternalParquetRecordReader - time spent so far 100%
> reading (1 ms) and 0% processing (0 ms) 2024-01-11 18:45:33 INFO
> InternalParquetRecordReader - at row 100. reading next block 2024-01-11
> 18:45:33 INFO InternalParquetRecordReader - block read in memory in 0
> ms. row count = 100 2024-01-11 18:45:33 INFO
> InternalParquetRecordReader - Assembled and processed 200 records from
> 6 columns in 1 ms: 200.0 rec/ms, 1200.0 cell/ms 2024-01-11 18:45:33
> INFO InternalParquetRecordReader - time spent so far 50% reading (1 ms)
> and 50% processing (1 ms) 2024-01-11 18:45:33 INFO
> InternalParquetRecordReader - at row 200. reading next block 2024-01-11
> 18:45:33 INFO InternalParquetRecordReader - block read in memory in 0
> ms. row count = 100 2024-01-11 18:45:33 INFO
> InternalParquetRecordReader - Assembled and processed 300 records from
> 6 columns in 1 ms: 300.0 rec/ms, 1800.0 cell/ms 2024-01-11 18:45:33
> INFO InternalParquetRecordReader - time spent so far 50% reading (1 ms)
> and 50% processing (1 ms)
> 
> [etc.]
> 
> 
> 3) it seems the tests are leaving a lot of generated data files behind
> in /tmp/test..., though of course they might ultimately clean up at the
> end?
> 
> 
> How do people typically develop on parquet-mr? Do they have dedicated
> shell scripts that only build and test parts of the project? Do they
> use an IDE and select specific options there?
> 
> Regards
> 
> Antoine.
> 
> 
> 





Guidelines for working on parquet-mr?

2024-01-11 Thread Antoine Pitrou


Hello,

I'm trying to build parquet-mr and I'm unsure how to make the
experience smooth enough for development. This is what I observe:

1) running the tests is extremely long (they have been running for 10
minutes already, with no sign of nearing completion)

2) the output logs are a true firehose; there's a ton of extremely
detailed (and probably superfluous) information being output, such as:

2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd]
2024-01-11 18:45:33 INFO CodecPool - Got brand-new decompressor [.gz]
2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd]
2024-01-11 18:45:33 INFO CodecPool - Got brand-new decompressor [.gz]
2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd]
2024-01-11 18:45:33 INFO ParquetRewriter - Finish rewriting input file:
file:/tmp/test12306662267168473656/test.parquet 2024-01-11 18:45:33
INFO InternalParquetRecordReader - RecordReader initialized will read a
total of 10 records. 2024-01-11 18:45:33 INFO
InternalParquetRecordReader - at row 0. reading next block 2024-01-11
18:45:33 INFO CodecPool - Got brand-new decompressor [.zstd] 2024-01-11
18:45:33 INFO InternalParquetRecordReader - block read in memory in 1
ms. row count = 100 2024-01-11 18:45:33 INFO
InternalParquetRecordReader - Assembled and processed 100 records from
6 columns in 0 ms: Infinity rec/ms, Infinity cell/ms 2024-01-11
18:45:33 INFO InternalParquetRecordReader - time spent so far 100%
reading (1 ms) and 0% processing (0 ms) 2024-01-11 18:45:33 INFO
InternalParquetRecordReader - at row 100. reading next block 2024-01-11
18:45:33 INFO InternalParquetRecordReader - block read in memory in 0
ms. row count = 100 2024-01-11 18:45:33 INFO
InternalParquetRecordReader - Assembled and processed 200 records from
6 columns in 1 ms: 200.0 rec/ms, 1200.0 cell/ms 2024-01-11 18:45:33
INFO InternalParquetRecordReader - time spent so far 50% reading (1 ms)
and 50% processing (1 ms) 2024-01-11 18:45:33 INFO
InternalParquetRecordReader - at row 200. reading next block 2024-01-11
18:45:33 INFO InternalParquetRecordReader - block read in memory in 0
ms. row count = 100 2024-01-11 18:45:33 INFO
InternalParquetRecordReader - Assembled and processed 300 records from
6 columns in 1 ms: 300.0 rec/ms, 1800.0 cell/ms 2024-01-11 18:45:33
INFO InternalParquetRecordReader - time spent so far 50% reading (1 ms)
and 50% processing (1 ms)

[etc.]


3) it seems the tests are leaving a lot of generated data files behind
in /tmp/test..., though of course they might ultimately clean up at the
end?


How do people typically develop on parquet-mr? Do they have dedicated
shell scripts that only build and test parts of the project? Do they
use an IDE and select specific options there?

Regards

Antoine.




Re: [Format] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY

2024-01-08 Thread Antoine Pitrou


Hello all,

Based on the response received, it seems this addition is
non-controversial and generally considered beneficial.

What should be the way forward? Should I submit a format update
and then one or two implementations thereof?

Regards

Antoine.


On Sun, 7 Jan 2024 23:40:11 -0800
Micah Kornfield 
wrote:
> I responded there but generally, this doesn't seem like it imposes a lot of
> implementation burden and can be useful.
> 
> On Thu, Dec 14, 2023 at 12:59 PM Antoine Pitrou 
>  wrote:
> 
> >
> > Hello,
> >
> > Just a heads up here so as to reach a wider audience: I've posted a
> > format addition proposal in
> > https://issues.apache.org/jira/browse/PARQUET-2414
> >
> > Excerpt:
> > """
> > This issue proposed to widen the types supported by the
> > BYTE_STREAM_SPLIT. By allowing the BYTE_STREAM_SPLIT on any
> > FIXED_LEN_BYTE_ARRAY column, we can automatically improve compression
> > efficiency on various column types including:
> >
> > half-float data
> > fixed-width decimal data
> >
> > [etc.]
> > """
> >
> > Feel free to comment here or on the JIRA issue.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >  
> 





Re: Pitch for Pcodec Encoding in Parquet

2024-01-08 Thread Antoine Pitrou


Hello Martin,

On Sat, 6 Jan 2024 17:09:07 -0500
Martin Loncaric 
wrote:
> >
> > It would be very interesting to expand the comparison against
> > BYTE_STREAM_SPLIT + compression.  
> 
> Antoine: I created one now, at the bottom of the post
> . In this
> case, BYTE_STREAM_SPLIT did worse.

I must admit I'm a bit surprised by these results. The first thing is
that the Pcodec results were actually obtained using dictionary
encoding. Then I don't understand what is Pcodec-encoded: the dictionary
values or the dictionary indices?

The second thing is that the BYTE_STREAM_SPLIT + Zstd results are much
worse than the PLAIN + Zstd results, which is unexpected (though not
impossible).

Regards

Antoine.




Re: Pitch for Pcodec Encoding in Parquet

2024-01-05 Thread Antoine Pitrou


Hello,

It would be very interesting to expand the comparison against
BYTE_STREAM_SPLIT + compression.

See https://issues.apache.org/jira/browse/PARQUET-2414 for a proposal
to extend the range of types supporting BYTE_STREAM_SPLIT.

Regards

Antoine.


On Wed, 3 Jan 2024 00:10:14 -0500
Martin Loncaric 
wrote:
> I'd like to propose and get feedback on a new encoding for numerical
> columns: pco. I just did a blog post demonstrating how this would perform
> on various real-world datasets
> . TL;DR: pco
> losslessly achieves much better compression ratio (44-158% higher) and
> slightly faster decompression speed than zstd-compressed Parquet. On the
> other hand, it compresses somewhat slower at default compression level, but
> I think this difference may disappear in future updates.
> 
> I think supporting this optional encoding would be an enormous win, but I'm
> not blind to the difficulties of implementing it:
> * Writing a good JVM implementation would be very difficult, so we'd
> probably have to make a JNI library.
> * Pco must be compressed one "chunk" (probably one per Parquet data page)
> at a time, with no way to estimate the encoded size until it has already
> done >50% of the compression work. I suspect the best solution is to split
> pco data pages based on unencoded size, which is different from existing
> encodings. I think this makes sense since pco fulfills the role usually
> played by compression in Parquet.
> 
> Please let me know what you think of this idea.
> 
> Thanks,
> Martin
> 





[Format] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY

2023-12-14 Thread Antoine Pitrou


Hello,

Just a heads up here so as to reach a wider audience: I've posted a
format addition proposal in
https://issues.apache.org/jira/browse/PARQUET-2414

Excerpt:
"""
This issue proposed to widen the types supported by the
BYTE_STREAM_SPLIT. By allowing the BYTE_STREAM_SPLIT on any
FIXED_LEN_BYTE_ARRAY column, we can automatically improve compression
efficiency on various column types including:

half-float data
fixed-width decimal data

[etc.]
"""

Feel free to comment here or on the JIRA issue.

Regards

Antoine.




[jira] [Updated] (PARQUET-2369) Clarify Support for Pages Compressed with Multiple GZIP Members

2023-11-15 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2369:

Fix Version/s: format-2.10.0

> Clarify Support for Pages Compressed with Multiple GZIP Members
> ---
>
> Key: PARQUET-2369
> URL: https://issues.apache.org/jira/browse/PARQUET-2369
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Raphael Taylor-Davies
>Priority: Major
> Fix For: format-2.10.0
>
>
> https://github.com/apache/parquet-testing/pull/41



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2369) Clarify Support for Pages Compressed with Multiple GZIP Members

2023-11-15 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2369:

Component/s: parquet-format

> Clarify Support for Pages Compressed with Multiple GZIP Members
> ---
>
> Key: PARQUET-2369
> URL: https://issues.apache.org/jira/browse/PARQUET-2369
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Raphael Taylor-Davies
>Priority: Major
>
> https://github.com/apache/parquet-testing/pull/41



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2369) Clarify Support for Pages Compressed with Multiple GZIP Members

2023-11-15 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2369:

Priority: Major  (was: Trivial)

> Clarify Support for Pages Compressed with Multiple GZIP Members
> ---
>
> Key: PARQUET-2369
> URL: https://issues.apache.org/jira/browse/PARQUET-2369
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Raphael Taylor-Davies
>Priority: Major
>
> https://github.com/apache/parquet-testing/pull/41



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Lossy compression of floating point data

2023-11-09 Thread Antoine Pitrou


Hello all,

I agree with the sentiments expressed by Micah.

* a lossy algorithm is more difficult to reason about for users;
* it cannot be enabled by default, for obvious reasons;
* the min/max statistics values should remain correct, that is: min
  should be a lower bound, max an upper bound;
* adding niche encodings does not seem particularly attractive for the
  Parquet ecosystem and the maintainers of the various Parquet
  implementations.

I would add that the encoding should either be very easy to understand
and implementation (such as BYTE_STREAM_SPLIT), or already
well-established in the software ecosystem.

Given the above, I also think there should be a clear proof that this
encoding brings very significant benefits over the statu quo. I would
suggest a comparison between the following combinations:

* PLAIN encoding
* PLAIN encoding + lz4 (or snappy)
* PLAIN encoding + zstd
* BYTE_STREAM_SPLIT encoding + lz4 (or snappy)
* BYTE_STREAM_SPLIT encoding + zstd
* SZ encoding
* SZ encoding + lz4 (or snappy)
* SZ encoding + zstd

The comparison should show the compression ratio,
encoding+compression speed, and decompression+decoding speed.

Regards

Antoine.



On Fri, 3 Nov 2023 15:04:29 +
Michael Bernardi  wrote:
> Dear all,
> 
> Myself and others at the Technical University of Munich are interested adding 
> a new lossy compression algorithm to the Parquet format to support the 
> compression of floating point data. This is a continuation of the work by 
> Martin Radev. Here are some related links:
> 
> Email thread: https://lists.apache.org/thread/5hz040d4dd4ctk51qy11wojp2v5k2kxn
> Report: https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv
> 
> This work ultimately resulted in the addition of the BYTE_STREAM_SPLIT 
> encoding, which allows the lossless compression algorithms to better compress 
> floating point data.
> 
> Martin's report also investigated lossy compressors which can be supplied 
> with an error bound and depending on this bound deliver much higher 
> compression ratios for similar computing time. The SZ compression library was 
> found to be quite promising, but it was discounted at the time due to issues 
> with thread safety and the API being immature. In the meantime these issues 
> have largely been resolved and it's now possible to use SZ with HDF5 (see the 
> link below). Therefore I'd like to reconsider adding it (or another similar 
> algorithm) to Parquet.
> 
> https://github.com/szcompressor/SZ3/tree/d2a03eae45730997be64126961d7abda0f950791/tools/H5Z-SZ3
> 
> Whatever lossy compression method we choose, it would probably have to be 
> implemented as a Parquet encoding rather than a compression for a couple 
> reasons:
> 
> 1) The algorithm can only compress a flat buffer of floating point data. It's 
> therefore not possible to use it for whole file compression and must be used 
> only on individual columns.
> 2) If it were implemented as a compression, it would conflict with underlying 
> encodings which would make the floating point values unreadable to the 
> algorithm.
> 
> Note that introducing lossy compression could introduce a situation where 
> values like min and max in the statistics page might not be found in the 
> decompressed data. There are probably other considerations here that I've 
> missed.
> 
> I look forward to reading your response.
> 
> Best regards,
> Michael Bernardi
> 
> 





[jira] [Updated] (PARQUET-1646) [C++] Use arrow::Buffer for buffered dictionary indices in DictEncoder instead of std::vector

2023-11-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1646:

Fix Version/s: cpp-15.0.0
   (was: cpp-14.0.0)

> [C++] Use arrow::Buffer for buffered dictionary indices in DictEncoder 
> instead of std::vector
> -
>
> Key: PARQUET-1646
> URL: https://issues.apache.org/jira/browse/PARQUET-1646
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
> Fix For: cpp-15.0.0
>
>
> Follow up to ARROW-6411



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2099) [C++] Statistics::num_values() is misleading

2023-11-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2099:

Fix Version/s: cpp-15.0.0
   (was: cpp-14.0.0)

> [C++] Statistics::num_values() is misleading 
> -
>
> Key: PARQUET-2099
> URL: https://issues.apache.org/jira/browse/PARQUET-2099
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Micah Kornfield
>Priority: Major
> Fix For: cpp-15.0.0
>
>
> num_values() in statistics seems to capture the number of encoded values.  
> This is misleading as everyplace else in parquet num_values() really 
> indicates all values (null and not-null, i.e. the number of levels).  
> We should likely remove this field, rename it or at the very least update the 
> documentation.
> CC [~zeroshade]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2321) allow customized buffer size when creating ArrowInputStream for a column PageReader

2023-11-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2321:

Fix Version/s: cpp-15.0.0
   (was: cpp-14.0.0)

> allow customized buffer size when creating ArrowInputStream for a column 
> PageReader
> ---
>
> Key: PARQUET-2321
> URL: https://issues.apache.org/jira/browse/PARQUET-2321
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Jinpeng Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-15.0.0
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> When buffered stream is enabled, all column chunks, regardless of their 
> actual sizes, are currently sharing the same buffer size which is stored in 
> the shared [read 
> properties]([https://github.com/apache/arrow/blob/main/cpp/src/parquet/file_reader.cc#L213).]
>   
> Given a limited memory budget, one may want to customize buffer size for 
> different column chunks based on their actual size, i.e., smaller chunks will 
> use consume less memory budget for its buffer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Support for Multiple GZIP Members in Page

2023-10-19 Thread Antoine Pitrou


To reiterate what I've already said on the GH issue page, I'm skeptical
about the use case presented by the original submitter (parallel
GZip compression of data pages).

1) GZip is technically obsolete compared to Zstd, Lz4 or even Snappy or
Brotli;

2) data pages are meant to be small (L1-cache sized, typically), so
splitting them in even smaller chunks for compression doesn't sound
like a terrific strategy;

3) systems using Parquet generally parallelize at a higher level
already (for example at the row group or column chunk level), so
probably wouldn't gain much by also parallelizing data compression.

I wouldn't mind the proposed spec addition, but for now this is
occurring because of a single person pushing for it on Github, so the
motivation seems rather weak.

Regards

Antoine.



On Thu, 19 Oct 2023 10:24:57 +0100
Raphael Taylor-Davies

wrote:
> Hi All,
> 
> Recently it was reported that many of the arrow parquet readers, 
> including arrow-cpp, pyarrow and arrow-rs, do not support GZIP 
> compressed pages containing multiple members [3]. It would also appear 
> other parquet implementations such as DuckDB have similar issues [4]. 
> This in turn led to some discussion as to whether this was permissible 
> according to the parquet specification [5], with the proposed compromise 
> to explicitly state that multiple members should be supported by 
> readers, but to recommend writers don't produce such pages by default 
> given the non-trivial install base where this will cause issues 
> including silent data corruption. I have tried to encode this in [6], 
> and welcome any feedback.
> 
> Kind Regards,
> 
> Raphael Taylor-Davies
> 
> [1]: https://github.com/apache/arrow/pull/38272
> [2]: https://github.com/apache/arrow-rs/pull/4951
> [3]: https://datatracker.ietf.org/doc/html/rfc1952
> [4]: 
> https://github.com/apache/parquet-testing/pull/41#issuecomment-1770410715
> [5]: https://github.com/apache/parquet-testing/pull/41
> [6]: https://github.com/apache/parquet-format/pull/218
> 
> 





Re: [VOTE][RESULT] Add Float16 type to specification

2023-10-19 Thread Antoine Pitrou


Thanks for doing this Ben!


On Fri, 13 Oct 2023 16:49:23 -0400
Ben Harkins 
wrote:
> With 5 +1 binding votes and 3 +1 non-binding, the vote passes.
> 
> Thank you to everyone who participated!
> 
> Votes:
> 
>- Antoine Pitrou - +1 (non-binding)
>- Xinli shang - +1
>- Ryan Blue - +1
>- Gábor Szádovszky - +1
>- Micah Kornfield - +1 (non-binding)
>- Gang Wu - +1 (non-binding)
>- Daniel Weeks - +1
>- Uwe L. Korn - +1
> 
> 





Re: [VOTE][Format] Add Float16 type to specification

2023-10-05 Thread Antoine Pitrou


Hello,

+1 from me (non-binding).

Regards

Antoine.


On Wed, 4 Oct 2023 16:14:00 -0400
Ben Harkins 
wrote:

> Hi everyone,
> 
> I would like to propose adding a half-precision floating point type to
> the Parquet format specification, in accordance with the active
> proposal here:
> 
> 
>- https://github.com/apache/parquet-format/pull/184
> 
> To summarize, the current proposal would introduce a Float16 logical
> type, represented by a little-endian 2-byte FixedLenByteArray. The
> value's encoding would adhere to the IEEE-754 standard [1].
> Furthermore, implementations should ensure that any value comparisons
> and ordering requirements (mainly for column statistics) emulate the
> behavior of native (i.e. physical) floating point types.
> 
> As for how this would look in practice, there are currently several
> implementations of this proposal that are more or less complete:
> 
> 
>- C++ (and Python): https://github.com/apache/arrow/pull/36073
>- Java: https://github.com/apache/parquet-mr/pull/1142
>- Go: https://github.com/apache/arrow/pull/37599
> 
> Of course, we're prepared to make adjustments to the implementations as
> needed, since the format additions will need to be approved before those
> PRs are merged. I should also note that naming conventions haven't been
> extensively discussed, so feel free to chime in if you have a strong
> preference for HALF or HALF_FLOAT over FLOAT16!
> 
> 
> This vote will be open for at least 72 hours.
> 
> [ ] +1 Add this type to the format specification
> [ ] +0
> [ ] -1 Do not add this type to the format specification because...
> 
> Thanks!
> 
> Ben
> 
> [1]: https://en.wikipedia.org/wiki/Half-precision_floating-point_format
> 
> 





[jira] [Resolved] (PARQUET-2238) Spec and parquet-mr disagree on DELTA_BYTE_ARRAY encoding

2023-09-26 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-2238.
-
Resolution: Duplicate

> Spec and parquet-mr disagree on DELTA_BYTE_ARRAY encoding
> -
>
> Key: PARQUET-2238
> URL: https://issues.apache.org/jira/browse/PARQUET-2238
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format, parquet-mr
>Reporter: Jan Finis
>Priority: Minor
>
> The spec in parquet-format specifies that [DELTA_BYTE_ARRAY is only supported 
> for the physical type 
> BYTE_ARRAY|https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array--6].
>  Yet, [parquet-mr also uses it to encode 
> FIXED_LEN_BYTE_ARRAY|https://github.com/apache/parquet-mr/blob/fd1326a8a56174320ea2f36d7c6c4e78105ab108/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L83].
> So, I guess the spec should be updated to include FIXED_LEN_BYTE_ARRAY in the 
> supported types of DELTA_BYTE_ARRAY encoding, or the code should be changed 
> to no longer write this encoding for FIXED_LEN_BYTE_ARRAY.
> I guess changing the spec is more prudent, given that 
> a) the encoding can make sense for FIXED_LEN_BYTE_ARRAY
> and
> b) there might already be countless files written with this encoding / type 
> combination.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-1646) [C++] Use arrow::Buffer for buffered dictionary indices in DictEncoder instead of std::vector

2023-08-24 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1646:

Fix Version/s: cpp-14.0.0
   (was: cpp-13.0.0)

> [C++] Use arrow::Buffer for buffered dictionary indices in DictEncoder 
> instead of std::vector
> -
>
> Key: PARQUET-1646
> URL: https://issues.apache.org/jira/browse/PARQUET-1646
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
> Fix For: cpp-14.0.0
>
>
> Follow up to ARROW-6411



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2321) allow customized buffer size when creating ArrowInputStream for a column PageReader

2023-08-24 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2321:

Fix Version/s: cpp-14.0.0
   (was: cpp-13.0.0)

> allow customized buffer size when creating ArrowInputStream for a column 
> PageReader
> ---
>
> Key: PARQUET-2321
> URL: https://issues.apache.org/jira/browse/PARQUET-2321
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Jinpeng Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-14.0.0
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> When buffered stream is enabled, all column chunks, regardless of their 
> actual sizes, are currently sharing the same buffer size which is stored in 
> the shared [read 
> properties]([https://github.com/apache/arrow/blob/main/cpp/src/parquet/file_reader.cc#L213).]
>   
> Given a limited memory budget, one may want to customize buffer size for 
> different column chunks based on their actual size, i.e., smaller chunks will 
> use consume less memory budget for its buffer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2099) [C++] Statistics::num_values() is misleading

2023-08-24 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2099:

Fix Version/s: cpp-14.0.0
   (was: cpp-13.0.0)

> [C++] Statistics::num_values() is misleading 
> -
>
> Key: PARQUET-2099
> URL: https://issues.apache.org/jira/browse/PARQUET-2099
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Micah Kornfield
>Priority: Major
> Fix For: cpp-14.0.0
>
>
> num_values() in statistics seems to capture the number of encoded values.  
> This is misleading as everyplace else in parquet num_values() really 
> indicates all values (null and not-null, i.e. the number of levels).  
> We should likely remove this field, rename it or at the very least update the 
> documentation.
> CC [~zeroshade]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[Request] Send automated notifications to a separate mailing-list

2023-08-21 Thread Antoine Pitrou


Hello,

I would like to request that automated notifications (from GitHub,
Jira... whatever) be sent to a separate mailing-list and GMane mirror.
Currently, the endless stream of automated notifications in this
mailing-list means that discussions between humans quickly get lost or
even unnoticed by other people.

For the record, we did this move in Apache Arrow and never came back.

Thanks in advance

Antoine.




[jira] [Updated] (PARQUET-2323) Use bit vector to store Prebuffered column chunk index

2023-07-28 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2323:

Fix Version/s: cpp-13.0.0
   (was: cpp-14.0.0)

> Use bit vector to store Prebuffered column chunk index
> --
>
> Key: PARQUET-2323
> URL: https://issues.apache.org/jira/browse/PARQUET-2323
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Jinpeng Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-13.0.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> In https://issues.apache.org/jira/browse/PARQUET-2316 we allow partial buffer 
> in parquet File Reader by storing prebuffered column chunk index in a hash 
> set, and make a copy of this hash set for each rowgroup reader
> In extreme conditions where numerous columns are prebuffered and multiple 
> rowgroup readers are created for the same row group , the hash set would 
> incur significant overhead. 
> Using bit vector would be a reasonsable mitigation, taking 4KB for 32K 
> columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-07-26 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17747306#comment-17747306
 ] 

Antoine Pitrou commented on PARQUET-:
-

bq. Should we just keep the specs as is and let the implementations decide 
which encoding to use for boolean values?

Makes sense. But can you please open an issue for these discussions? This is 
unrelated to the issue I originally reported, and which is fixed.

> [Format] RLE encoding spec incorrect for v2 data pages
> --
>
> Key: PARQUET-
> URL: https://issues.apache.org/jira/browse/PARQUET-
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>        Reporter: Antoine Pitrou
>Assignee: Xuwei Fu
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid:  
> length := length of the  in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-07-25 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746834#comment-17746834
 ] 

Antoine Pitrou commented on PARQUET-:
-

There are other implementations arond, so I would be a bit uneasy about 
changing the spec like this.
Perhaps we should simply switch to v2 data pages by default in parquet-cpp and 
parquet-mr at some point?

> [Format] RLE encoding spec incorrect for v2 data pages
> --
>
> Key: PARQUET-
> URL: https://issues.apache.org/jira/browse/PARQUET-
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>        Reporter: Antoine Pitrou
>Assignee: Xuwei Fu
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid:  
> length := length of the  in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2323) Use bit vector to store Prebuffered column chunk index

2023-07-19 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2323:

Fix Version/s: cpp-14.0.0
   (was: cpp-13.0.0)

> Use bit vector to store Prebuffered column chunk index
> --
>
> Key: PARQUET-2323
> URL: https://issues.apache.org/jira/browse/PARQUET-2323
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Jinpeng Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-14.0.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> In https://issues.apache.org/jira/browse/PARQUET-2316 we allow partial buffer 
> in parquet File Reader by storing prebuffered column chunk index in a hash 
> set, and make a copy of this hash set for each rowgroup reader
> In extreme conditions where numerous columns are prebuffered and multiple 
> rowgroup readers are created for the same row group , the hash set would 
> incur significant overhead. 
> Using bit vector would be a reasonsable mitigation, taking 4KB for 32K 
> columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2323) Use bit vector to store Prebuffered column chunk index

2023-07-19 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-2323.
-
Resolution: Fixed

Issue resolved by pull request 36649
https://github.com/apache/arrow/pull/36649

> Use bit vector to store Prebuffered column chunk index
> --
>
> Key: PARQUET-2323
> URL: https://issues.apache.org/jira/browse/PARQUET-2323
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Jinpeng Zhou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-13.0.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> In https://issues.apache.org/jira/browse/PARQUET-2316 we allow partial buffer 
> in parquet File Reader by storing prebuffered column chunk index in a hash 
> set, and make a copy of this hash set for each rowgroup reader
> In extreme conditions where numerous columns are prebuffered and multiple 
> rowgroup readers are created for the same row group , the hash set would 
> incur significant overhead. 
> Using bit vector would be a reasonsable mitigation, taking 4KB for 32K 
> columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-06-15 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733129#comment-17733129
 ] 

Antoine Pitrou edited comment on PARQUET- at 6/15/23 3:32 PM:
--

Resolved in https://github.com/apache/parquet-format/pull/193


was (Author: pitrou):
Resolved in https://github.com/apache/parquet-format/pull/211

> [Format] RLE encoding spec incorrect for v2 data pages
> --
>
> Key: PARQUET-
> URL: https://issues.apache.org/jira/browse/PARQUET-
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>        Reporter: Antoine Pitrou
>Assignee: Gang Wu
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid:  
> length := length of the  in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-06-15 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-.
-
Resolution: Fixed

Resolved in https://github.com/apache/parquet-format/pull/211

> [Format] RLE encoding spec incorrect for v2 data pages
> --
>
> Key: PARQUET-
> URL: https://issues.apache.org/jira/browse/PARQUET-
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>        Reporter: Antoine Pitrou
>Assignee: Gang Wu
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid:  
> length := length of the  in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2310) [Doc] Add implementation status / matrix

2023-06-15 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733100#comment-17733100
 ] 

Antoine Pitrou commented on PARQUET-2310:
-

This was originally proposed in https://github.com/apache/arrow/pull/36027

> [Doc] Add implementation status / matrix
> 
>
> Key: PARQUET-2310
> URL: https://issues.apache.org/jira/browse/PARQUET-2310
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>        Reporter: Antoine Pitrou
>Priority: Major
>
> In Apache Arrow we have a documentation page listed the feature status for 
> various implementations of Arrow: https://arrow.apache.org/docs/status.html
> It could be nice to have a similar page for the main Parquet implementations 
> (at least Java, C++, Rust).
> The main downside is that it needs to be kept up to date.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2310) [Doc] Add implementation status / matrix

2023-06-15 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733099#comment-17733099
 ] 

Antoine Pitrou commented on PARQUET-2310:
-

cc [~wgtmac] [~gszadovszky] [~alippai]

> [Doc] Add implementation status / matrix
> 
>
> Key: PARQUET-2310
> URL: https://issues.apache.org/jira/browse/PARQUET-2310
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>        Reporter: Antoine Pitrou
>Priority: Major
>
> In Apache Arrow we have a documentation page listed the feature status for 
> various implementations of Arrow: https://arrow.apache.org/docs/status.html
> It could be nice to have a similar page for the main Parquet implementations 
> (at least Java, C++, Rust).
> The main downside is that it needs to be kept up to date.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2310) [Doc] Add implementation status / matrix

2023-06-15 Thread Antoine Pitrou (Jira)
Antoine Pitrou created PARQUET-2310:
---

 Summary: [Doc] Add implementation status / matrix
 Key: PARQUET-2310
 URL: https://issues.apache.org/jira/browse/PARQUET-2310
 Project: Parquet
  Issue Type: Task
  Components: parquet-format
Reporter: Antoine Pitrou


In Apache Arrow we have a documentation page listed the feature status for 
various implementations of Arrow: https://arrow.apache.org/docs/status.html

It could be nice to have a similar page for the main Parquet implementations 
(at least Java, C++, Rust).

The main downside is that it needs to be kept up to date.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Bloom filters for full-text search and predicate pushdown

2023-06-15 Thread Antoine Pitrou


Hi,

This would require standardizing on a specific tokenization algorithm,
right? I'm not sure it's a good idea to add such complexity to the
Parquet spec (the tokenization might need to be language-specific
and/or corpus-specific).

I wonder if it would be more productive to try and find ways to build
e.g. a Lucene index over Parquet columns (perhaps it's already
possible?).

Regards

Antoine.



On Wed, 7 Jun 2023 18:01:32 +0800
Gang Wu  wrote:
> Hi Marco,
> 
> That sounds interesting!
> 
> However, this requires the parquet implementation to be able to tokenize
> both
> strings to write and literals in the filters. The actual efficiency depends
> on the
> data distribution. I am also concerned with the possible explosion of
> distinct
> values introduced by splitting words, which may result in a large bloom
> filter.
> 
> Have you tried any PoC to get a rough estimate of benefits in your use case?
> 
> Best,
> Gang
> 
> 
> 
> On Tue, Jun 6, 2023 at 5:06 PM Marco Colli 
>  wrote:
> 
> > Hello,
> >
> > I see that Parquet already supports Bloom filters.
> >
> > For my understanding, it currently uses them only on the entire value.
> >
> > Fo example, if I have a column "MovieTitle":
> >
> > - "The title of my movie"
> > - "Another movie title"
> > - "The best movie title"
> > - ...
> >
> > Then the current Bloom filters can be used to find only the column
> > chunks/pages that match an exact title. For example you can use the bloom
> > filter to search for "The best movie title".
> >
> > It would be interesting to have *a bloom filter on the specific words*,
> > instead of using the entire value: in this way you can search the word
> > "best" in the "MovieTitle" column and find the titles that contain that
> > specific word in an efficient way.
> >
> > It would enable a sort of full-text search of keywords inside text columns.
> > It would also allow predicate pushdown for searches based on keywords.
> >
> > Would make sense to have such an addition? Is there any strategy already
> > used by Parquet for fast keyword searches inside text columns?
> >
> >
> > Best regards,
> > Marco Colli
> > AbstractBrain srls
> >  
> 





[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-02-27 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693891#comment-17693891
 ] 

Antoine Pitrou commented on PARQUET-:
-

Yes, this is why I've filed this under parquet-format.

> [Format] RLE encoding spec incorrect for v2 data pages
> --
>
> Key: PARQUET-
> URL: https://issues.apache.org/jira/browse/PARQUET-
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>        Reporter: Antoine Pitrou
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid:  
> length := length of the  in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-02-27 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17693870#comment-17693870
 ] 

Antoine Pitrou commented on PARQUET-:
-

> I don't understand. Isn't length the part of encoding in spec?

What do you mean?

> And seems that DataPageV2 in parquet-mr is not in-prod?

What is that supposed to mean?

> [Format] RLE encoding spec incorrect for v2 data pages
> --
>
> Key: PARQUET-
> URL: https://issues.apache.org/jira/browse/PARQUET-
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>    Reporter: Antoine Pitrou
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid:  
> length := length of the  in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Fwd: [C++] Parquet and Arrow overlap

2023-02-02 Thread Antoine Pitrou




Hi Will,

Le 01/02/2023 à 20:27, Will Jones a écrit :


First, it's not obvious where issues are supposed to be open: In Parquet
Jira or Arrow GitHub issues. Looking back at some of the original
discussion, it looks like the intention was

* use PARQUET-XXX for issues relating to Parquet core

* use ARROW-XXX for issues relation to Arrow's consumption of Parquet
core (e.g. changes that are in parquet/arrow right now)


The README for the old parquet-cpp repo [3] states instead in it's
migration note:

  JIRA issues should continue to be opened in the PARQUET JIRA project.

Either way, it doesn't seem like this process is obvious to people. Perhaps
we could clarify this and add notices to Arrow's GitHub issues template?


I agree we should clarify this. I have no personal preference, but I will note
that Github issues decrease friction as having a GH account is already necessary
for submitting PRs.


Second, committer status is a little unclear. I am a committer on Arrow,
but not on Parquet right now. Does that mean I should only merge Parquet
C++ PRs for code changes in parquet/arrow? Or that I shouldn't merge
Parquet changes at all?


Since Parquet C++ is part of Arrow C++, you are allowed to merge Parquet C++
changes. As always you should ensure you have sufficient understanding of the
contribution, and that it follows established practices:
https://arrow.apache.org/docs/dev/developers/reviewing.html


Also, are the contributions to Arrow C++ Parquet being actively reviewed
for potential new committers?


I would certainly do.

Regards

Antoine.



[jira] [Resolved] (PARQUET-2231) [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY

2023-01-19 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-2231.
-
Resolution: Fixed

Closed by PR https://github.com/apache/parquet-format/pull/189

> [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY
> -
>
> Key: PARQUET-2231
> URL: https://issues.apache.org/jira/browse/PARQUET-2231
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>        Reporter: Antoine Pitrou
>    Assignee: Antoine Pitrou
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec says that DELTA_BYTE_ARRAY is only supported for BYTE_ARRAY, but in 
> parquet-mr it has been allowed for FIXED_LEN_BYTE_ARRAY as well since 2015.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-152) Encoding issue with fixed length byte arrays

2023-01-16 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-152:
---
Component/s: parquet-mr

> Encoding issue with fixed length byte arrays
> 
>
> Key: PARQUET-152
> URL: https://issues.apache.org/jira/browse/PARQUET-152
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Nezih Yigitbasi
>Assignee: Sergio Peña
>Priority: Minor
> Fix For: 1.8.0
>
>
> While running some tests against the master branch I hit an encoding issue 
> that seemed like a bug to me.
> I noticed that when writing a fixed length byte array and the array's size is 
> > dictionaryPageSize (in my test it was 512), the encoding falls back to 
> DELTA_BYTE_ARRAY as seen below:
> {noformat}
> Dec 17, 2014 3:41:10 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: 
> written 12,125B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 1,710B 
> raw, 1,710B comp, 5 pages, encodings: [DELTA_BYTE_ARRAY]
> {noformat}
> But then read fails with the following exception:
> {noformat}
> Caused by: parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is 
> only supported for type BINARY
>   at parquet.column.Encoding$7.getValuesReader(Encoding.java:193)
>   at 
> parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:534)
>   at 
> parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:574)
>   at 
> parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:54)
>   at 
> parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:518)
>   at 
> parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:510)
>   at parquet.column.page.DataPageV2.accept(DataPageV2.java:123)
>   at 
> parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:510)
>   at 
> parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:502)
>   at 
> parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:604)
>   at 
> parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:348)
>   at 
> parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
>   at 
> parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
>   at 
> parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:267)
>   at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
>   at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
>   at 
> parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
>   at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
>   at 
> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:129)
>   at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:198)
>   ... 16 more
> {noformat}
> When the array's size is < dictionaryPageSize, RLE_DICTIONARY encoding is 
> used and read works fine:
> {noformat}
> Dec 17, 2014 3:39:50 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: 
> written 50B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 3B raw, 3B 
> comp, 1 pages, encodings: [RLE_DICTIONARY, PLAIN], dic { 1 entries, 8B raw, 
> 1B comp}
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2231) [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY

2023-01-16 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677300#comment-17677300
 ] 

Antoine Pitrou commented on PARQUET-2231:
-

[~rok] [~shanhuang] [~muthunagappan] [~jinshang] FYI

> [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY
> -
>
> Key: PARQUET-2231
> URL: https://issues.apache.org/jira/browse/PARQUET-2231
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>        Reporter: Antoine Pitrou
>    Assignee: Antoine Pitrou
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec says that DELTA_BYTE_ARRAY is only supported for BYTE_ARRAY, but in 
> parquet-mr it has been allowed for FIXED_LEN_BYTE_ARRAY as well since 2015.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2231) [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY

2023-01-16 Thread Antoine Pitrou (Jira)
Antoine Pitrou created PARQUET-2231:
---

 Summary: [Format] Encoding spec incorrect for DELTA_BYTE_ARRAY
 Key: PARQUET-2231
 URL: https://issues.apache.org/jira/browse/PARQUET-2231
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: format-2.10.0


The spec says that DELTA_BYTE_ARRAY is only supported for BYTE_ARRAY, but in 
parquet-mr it has been allowed for FIXED_LEN_BYTE_ARRAY as well since 2015.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-152) Encoding issue with fixed length byte arrays

2023-01-16 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677297#comment-17677297
 ] 

Antoine Pitrou commented on PARQUET-152:


It would be nice if the encodings spec had been updated as well, because for 
now it mentions that DELTA_BYTE_ARRAY is only supported for BYTE_ARRAY columns, 
not FIXED_LEN_BYTE_ARRAY. See PARQUET-2231.

> Encoding issue with fixed length byte arrays
> 
>
> Key: PARQUET-152
> URL: https://issues.apache.org/jira/browse/PARQUET-152
> Project: Parquet
>  Issue Type: Bug
>Reporter: Nezih Yigitbasi
>Assignee: Sergio Peña
>Priority: Minor
> Fix For: 1.8.0
>
>
> While running some tests against the master branch I hit an encoding issue 
> that seemed like a bug to me.
> I noticed that when writing a fixed length byte array and the array's size is 
> > dictionaryPageSize (in my test it was 512), the encoding falls back to 
> DELTA_BYTE_ARRAY as seen below:
> {noformat}
> Dec 17, 2014 3:41:10 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: 
> written 12,125B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 1,710B 
> raw, 1,710B comp, 5 pages, encodings: [DELTA_BYTE_ARRAY]
> {noformat}
> But then read fails with the following exception:
> {noformat}
> Caused by: parquet.io.ParquetDecodingException: Encoding DELTA_BYTE_ARRAY is 
> only supported for type BINARY
>   at parquet.column.Encoding$7.getValuesReader(Encoding.java:193)
>   at 
> parquet.column.impl.ColumnReaderImpl.initDataReader(ColumnReaderImpl.java:534)
>   at 
> parquet.column.impl.ColumnReaderImpl.readPageV2(ColumnReaderImpl.java:574)
>   at 
> parquet.column.impl.ColumnReaderImpl.access$400(ColumnReaderImpl.java:54)
>   at 
> parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:518)
>   at 
> parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:510)
>   at parquet.column.page.DataPageV2.accept(DataPageV2.java:123)
>   at 
> parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:510)
>   at 
> parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:502)
>   at 
> parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:604)
>   at 
> parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:348)
>   at 
> parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
>   at 
> parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
>   at 
> parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:267)
>   at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
>   at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
>   at 
> parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
>   at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
>   at 
> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:129)
>   at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:198)
>   ... 16 more
> {noformat}
> When the array's size is < dictionaryPageSize, RLE_DICTIONARY encoding is 
> used and read works fine:
> {noformat}
> Dec 17, 2014 3:39:50 PM INFO: parquet.hadoop.ColumnChunkPageWriteStore: 
> written 50B for [flba_field] FIXED_LEN_BYTE_ARRAY: 5,000 values, 3B raw, 3B 
> comp, 1 pages, encodings: [RLE_DICTIONARY, PLAIN], dic { 1 entries, 8B raw, 
> 1B comp}
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


IMPORTANT: specification bugs around v2 data pages

2023-01-04 Thread Antoine Pitrou


Hello,

I would like to bring this list's attention to two alleged bugs in the
specification around v2 data pages:

- https://issues.apache.org/jira/browse/PARQUET-2221: Encoding spec
  incorrect for dictionary fallback

- https://issues.apache.org/jira/browse/PARQUET-: RLE encoding spec
  incorrect for v2 data pages

Regards

Antoine.




[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-01-04 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654524#comment-17654524
 ] 

Antoine Pitrou commented on PARQUET-:
-

cc [~julienledem] [~pnarang] [~rdblue] [~alexlevenson]

> [Format] RLE encoding spec incorrect for v2 data pages
> --
>
> Key: PARQUET-
> URL: https://issues.apache.org/jira/browse/PARQUET-
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>        Reporter: Antoine Pitrou
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec 
> (https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
>  has this:
> {code}
> rle-bit-packed-hybrid:  
> length := length of the  in bytes stored as 4 bytes little 
> endian (unsigned int32)
> {code}
> But the length is actually prepended only in v1 data pages, not in v2 data 
> pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-01-04 Thread Antoine Pitrou (Jira)
Antoine Pitrou created PARQUET-:
---

 Summary: [Format] RLE encoding spec incorrect for v2 data pages
 Key: PARQUET-
 URL: https://issues.apache.org/jira/browse/PARQUET-
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format
Reporter: Antoine Pitrou
 Fix For: format-2.10.0


The spec 
(https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
 has this:
{code}
rle-bit-packed-hybrid:  
length := length of the  in bytes stored as 4 bytes little endian 
(unsigned int32)
{code}

But the length is actually prepended only in v1 data pages, not in v2 data 
pages.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2218) [Format] Clarify CRC computation

2023-01-03 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-2218.
-
Resolution: Fixed

Fixed by PR https://github.com/apache/parquet-format/pull/188

> [Format] Clarify CRC computation
> 
>
> Key: PARQUET-2218
> URL: https://issues.apache.org/jira/browse/PARQUET-2218
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>        Reporter: Antoine Pitrou
>    Assignee: Antoine Pitrou
>Priority: Minor
> Fix For: format-2.10.0
>
>
> The format spec on CRC checksumming felt ambiguous when trying to implement 
> it in Parquet C++, so we should make the wording clearer.
> (see discussion on 
> https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 and 
> below)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2221) [Format] Encoding spec incorrect for dictionary fallback

2023-01-03 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654025#comment-17654025
 ] 

Antoine Pitrou commented on PARQUET-2221:
-

cc [~julienledem] [~pnarang] [~rdblue] [~alexlevenson]

> [Format] Encoding spec incorrect for dictionary fallback
> 
>
> Key: PARQUET-2221
> URL: https://issues.apache.org/jira/browse/PARQUET-2221
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>        Reporter: Antoine Pitrou
>Priority: Critical
> Fix For: format-2.10.0
>
>
> The spec for DICTIONARY_ENCODING states that:
> bq. If the dictionary grows too big, whether in size or number of distinct 
> values, the encoding will fall back to the plain encoding. 
> https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8
> However, the parquet-mr implementation was deliberately changed to a 
> different fallback mechanism in 
> https://issues.apache.org/jira/browse/PARQUET-52
> I'm assuming the parquet-mr implementation is authoritative here. But then 
> the spec is incorrect and should be fixed to reflect expected behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-52) Improve the encoding fall back mechanism for Parquet 2.0

2023-01-03 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-52?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-52:
--
Description: 
https://github.com/apache/incubator-parquet-mr/pull/74

-> moved to https://github.com/apache/parquet-mr/pull/74

  was:https://github.com/apache/incubator-parquet-mr/pull/74


> Improve the encoding fall back mechanism for Parquet 2.0
> 
>
> Key: PARQUET-52
> URL: https://issues.apache.org/jira/browse/PARQUET-52
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>Priority: Major
> Fix For: 1.6.0
>
>
> https://github.com/apache/incubator-parquet-mr/pull/74
> -> moved to https://github.com/apache/parquet-mr/pull/74



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2221) [Format] Encoding spec incorrect for dictionary fallback

2023-01-03 Thread Antoine Pitrou (Jira)
Antoine Pitrou created PARQUET-2221:
---

 Summary: [Format] Encoding spec incorrect for dictionary fallback
 Key: PARQUET-2221
 URL: https://issues.apache.org/jira/browse/PARQUET-2221
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format
Reporter: Antoine Pitrou
 Fix For: format-2.10.0


The spec for DICTIONARY_ENCODING states that:

bq. If the dictionary grows too big, whether in size or number of distinct 
values, the encoding will fall back to the plain encoding. 

https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8

However, the parquet-mr implementation was deliberately changed to a different 
fallback mechanism in https://issues.apache.org/jira/browse/PARQUET-52

I'm assuming the parquet-mr implementation is authoritative here. But then the 
spec is incorrect and should be fixed to reflect expected behavior.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-796) Delta Encoding is not used when dictionary enabled

2023-01-03 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-796:
---
Priority: Major  (was: Critical)

> Delta Encoding is not used when dictionary enabled
> --
>
> Key: PARQUET-796
> URL: https://issues.apache.org/jira/browse/PARQUET-796
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Jakub Liska
>Priority: Major
>
> Current code doesn't enable using both Delta Encoding and Dictionary 
> Encoding. If I instantiate ParquetWriter like this : 
> {code}
> val writer = new ParquetWriter[Group](outFile, new GroupWriteSupport, codec, 
> blockSize, pageSize, dictPageSize, enableDictionary = true, true, 
> ParquetProperties.WriterVersion.PARQUET_2_0, configuration)
> {code}
> Then this piece of code : 
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultValuesWriterFactory.java#L78-L86
> Causes that DictionaryValuesWriter is used instead of the inferred 
> DeltaLongEncodingWriter. 
> The original issue is here : 
> https://github.com/apache/parquet-mr/pull/154#issuecomment-266489768



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2218) [Format] Clarify CRC computation

2022-12-13 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2218:

Description: 
The format spec on CRC checksumming felt ambiguous when trying to implement it 
in Parquet C++, so we should make the wording clearer.

(see discussion on 
https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 and 
below)

  was:The format spec on CRC checksumming felt ambiguous when trying to 
implement it in Parquet C++, so we should make the wording clearer.


> [Format] Clarify CRC computation
> 
>
> Key: PARQUET-2218
> URL: https://issues.apache.org/jira/browse/PARQUET-2218
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>        Reporter: Antoine Pitrou
>    Assignee: Antoine Pitrou
>Priority: Minor
> Fix For: format-2.10.0
>
>
> The format spec on CRC checksumming felt ambiguous when trying to implement 
> it in Parquet C++, so we should make the wording clearer.
> (see discussion on 
> https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 and 
> below)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2218) [Format] Clarify CRC computation

2022-12-13 Thread Antoine Pitrou (Jira)
Antoine Pitrou created PARQUET-2218:
---

 Summary: [Format] Clarify CRC computation
 Key: PARQUET-2218
 URL: https://issues.apache.org/jira/browse/PARQUET-2218
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-format
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: format-2.10.0


The format spec on CRC checksumming felt ambiguous when trying to implement it 
in Parquet C++, so we should make the wording clearer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1629) Page-level CRC checksum verification for DataPageV2

2022-12-13 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17646612#comment-17646612
 ] 

Antoine Pitrou commented on PARQUET-1629:
-

[~mwish] for the record. Perhaps you would be interested in doing this, if you 
can do some Java.

> Page-level CRC checksum verification for DataPageV2
> ---
>
> Key: PARQUET-1629
> URL: https://issues.apache.org/jira/browse/PARQUET-1629
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Boudewijn Braams
>Priority: Major
>
> In https://jira.apache.org/jira/browse/PARQUET-1580 (Github PR: 
> https://github.com/apache/parquet-mr/pull/647) we implemented page level CRC 
> checksum verification for DataPageV1. As a follow up, we should add support 
> for DataPageV2 that follows the spec (see see 
> https://jira.apache.org/jira/browse/PARQUET-1539).
> What needs to be done:
> * Add writing out checksums for DataPageV2
> * Add checksum verification for DataPageV2
> * Create new test suite
> * Create new benchmarks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2204) TypedColumnReaderImpl::Skip should reuse scratch space

2022-12-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-2204.
-
Fix Version/s: cpp-11.0.0
   Resolution: Fixed

Issue resolved by pull request 14509
https://github.com/apache/arrow/pull/14509

> TypedColumnReaderImpl::Skip should reuse scratch space
> --
>
> Key: PARQUET-2204
> URL: https://issues.apache.org/jira/browse/PARQUET-2204
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: fatemah
>Assignee: fatemah
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-11.0.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> TypedColumnReaderImpl::Skip allocates scratch space on every call. The 
> scratch space is used to read rep/def levels and values and throw them away. 
> The memory allocation slows down the skip based on microbenchmarks. The 
> scratch space can be allocated once and re-used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-2204) TypedColumnReaderImpl::Skip should reuse scratch space

2022-12-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned PARQUET-2204:
---

Assignee: fatemah

> TypedColumnReaderImpl::Skip should reuse scratch space
> --
>
> Key: PARQUET-2204
> URL: https://issues.apache.org/jira/browse/PARQUET-2204
> Project: Parquet
>  Issue Type: Improvement
>Reporter: fatemah
>Assignee: fatemah
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> TypedColumnReaderImpl::Skip allocates scratch space on every call. The 
> scratch space is used to read rep/def levels and values and throw them away. 
> The memory allocation slows down the skip based on microbenchmarks. The 
> scratch space can be allocated once and re-used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2204) TypedColumnReaderImpl::Skip should reuse scratch space

2022-12-08 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2204:

Component/s: parquet-cpp

> TypedColumnReaderImpl::Skip should reuse scratch space
> --
>
> Key: PARQUET-2204
> URL: https://issues.apache.org/jira/browse/PARQUET-2204
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: fatemah
>Assignee: fatemah
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> TypedColumnReaderImpl::Skip allocates scratch space on every call. The 
> scratch space is used to read rep/def levels and values and throw them away. 
> The memory allocation slows down the skip based on microbenchmarks. The 
> scratch space can be allocated once and re-used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-12-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-1222:

Fix Version/s: format-2.10.0

> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Assignee: Micah Kornfield
>Priority: Critical
> Fix For: format-2.10.0
>
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-12-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-1222.
-
Resolution: Fixed

> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Assignee: Micah Kornfield
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-12-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned PARQUET-1222:
---

Assignee: Micah Kornfield

> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Assignee: Micah Kornfield
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-2215) Document how DELTA_BINARY_PACKED handles overflow for deltas

2022-11-23 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned PARQUET-2215:
---

Assignee: Antoine Pitrou

> Document how DELTA_BINARY_PACKED handles overflow for deltas
> 
>
> Key: PARQUET-2215
> URL: https://issues.apache.org/jira/browse/PARQUET-2215
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format
>Reporter: Rok Mihevc
>    Assignee: Antoine Pitrou
>Priority: Major
>  Labels: docs
>
> [Current 
> docs|https://github.com/apache/parquet-format/blob/master/Encodings.md?plain=1#L160]
>  do not explicitly state how overflow is handled.
> [See 
> discussion|https://github.com/apache/arrow/pull/14191#discussion_r1028298973] 
> for more details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2206) Microbenchmark for ColumnReadaer ReadBatch and Skip

2022-11-15 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved PARQUET-2206.
-
Fix Version/s: cpp-11.0.0
   Resolution: Fixed

Issue resolved by pull request 14523
[https://github.com/apache/arrow/pull/14523]

> Microbenchmark for ColumnReadaer ReadBatch and Skip
> ---
>
> Key: PARQUET-2206
> URL: https://issues.apache.org/jira/browse/PARQUET-2206
> Project: Parquet
>  Issue Type: Improvement
>Reporter: fatemah
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-11.0.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
>  Adding a micro benchmark for column reader ReadBatch and Skip. Later, I will 
> add benchmarks for RecordReader's ReadRecords and SkipRecords.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2206) Microbenchmark for ColumnReadaer ReadBatch and Skip

2022-11-15 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated PARQUET-2206:

Component/s: parquet-cpp

> Microbenchmark for ColumnReadaer ReadBatch and Skip
> ---
>
> Key: PARQUET-2206
> URL: https://issues.apache.org/jira/browse/PARQUET-2206
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: fatemah
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-11.0.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
>  Adding a micro benchmark for column reader ReadBatch and Skip. Later, I will 
> add benchmarks for RecordReader's ReadRecords and SkipRecords.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   >