Re: [VOTE][FORMAT] Add repetition, definition and variable length size metadata statistics

2023-11-06 Thread Ed Seidl
+1 (non-binding) Thanks! Ed

Re: [VOTE] Release Apache Parquet Format 2.10.0 RC0

2023-11-17 Thread Ed Seidl
+1 (non-binding) Thanks for pushing this Gang, can't wait for the new features! Ed

Parquet feature matrix

2024-05-06 Thread Ed Seidl
Hi all, Given the recent confusion on this list concerning Parquet V1 vs V2, I was wondering if there was any interest in the community to create a feature matrix that users could consult to see which implementations would work with features they consider important. My group is compiling our o

Re: Parquet feature matrix

2024-05-06 Thread Ed Seidl
On 5/6/24 6:21 PM, Gang Wu wrote: Hi, There was an effort on this: https://github.com/apache/parquet-site/pull/34 Thanks Gang for the added context! The PR is still open, but is it abandoned? It looks like a good starting point. It would be good if we can have something like what Apache Arro

Re: Repeated fields spec clarification

2024-05-10 Thread Ed Seidl
Fun stuff...have felt the pain ;-) Given that the glossary defines a row group as "[a] logical horizontal partitioning of the data into rows", emphasis "rows" and not "records", I think that pretty strongly implies that row groups, at least, must start on a row boundary. I too would be in su

Re: [ANNOUNCE] New Parquet PMC Member: Gang Wu

2024-05-11 Thread Ed Seidl
+1 :-)  Congrats, Gang! On 5/11/24 4:05 PM, Micah Kornfield wrote: Congrats Gang! On Sat, May 11, 2024 at 12:15 PM Vinoo Ganesh wrote: Congrats, Gang!! On Sat, May 11, 2024 at 8:45 PM Claire McGinty Congrats Gang!! Well deserved! - Claire On Sat, May 11, 2024 at 6:22 PM Fokko Driespr

Re: Interest in Parquet V3

2024-05-13 Thread Ed Seidl
I think the whole "V1" vs "V2" mess is unfortunate. IMO there is only one version of the Parquet file format. At its core, the data layout (row groups composed of column chunks composed of Dremel encoded pages) has never changed. Encodings/codecs/structures have been added to that core, but always

Re: [DISCUSS] Parquet Reference Implementation ?

2024-05-14 Thread Ed Seidl
Given the breadth of the parquet community at this point, I don't think we should be singling out one or two "reference" implementations. Even parquet-mr, AFAIK, still doesn't implement DELTA_LENGTH_BYTE_ARRAY encoding in a user-accessible way (it's only available as part of the DELTA_BYTE_ARRAY w

Re: Typical data page size

2024-05-23 Thread Ed Seidl
I haven't seen it mentioned in this thread, but for the curious the 2 row limit appears to come from a 2020 blog post by Cloudera [1]  (in the section "Testing with Parquet-MR"). Cheers, Ed [1] https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/ On 5/23/24 6:

Re: Typical data page size

2024-05-23 Thread Ed Seidl
t, so that also pages in nested columns abide by the same rules. I don't see a big argument why you would not want these limits for nested columns. Cheers, Jan Am Do., 23. Mai 2024 um 17:28 Uhr schrieb Ed Seidl : I haven't seen it mentioned in this thread, but for the curious the 2

BYTE_ARRAY vs binary in Parquet specification

2024-05-23 Thread Ed Seidl
Hi all, A question came up in the discussion of PARQUET-2474 [1] about the use of 'binary' in the LogicalTypes.md file [2]. Is the use of 'binary' where 'BYTE_ARRAY' is the physical type a holdover from an earlier version of the spec? How would you all feel about cleaning this up and using 'B

Re: [DISCUSS] Extension types in Parquet?

2024-05-28 Thread Ed Seidl
I like the idea of an EXTENSION logical type (Antoine's option 1). Perhaps the stats ordering could be left as an implementation detail...those implementations that understand the new type will implicitly know the proper ordering. Once the type graduates to full logical type status, the ColumnO

Re: [DISCUSS] Encoding improvements (follow-up from Parquet "V3" discussion)

2024-05-29 Thread Ed Seidl
Maybe this is putting the cart too far in front of the horse, but I'd be willing to implement an encoding like this to see if is a better alternative to PLAIN and DELTA_LENGTH_BYTE_ARRAY as a dictionary fallback for byte arrays, at least for GPU decoding. We might want to change the name since

ColumnMetaData location

2024-06-03 Thread Ed Seidl
Hi all, While investigating a parquet-java issue with the file_offset field in ColumnChunk [1] I discovered that it appears parquet java does not (and perhaps never did?) write a copy of the ColumnMetaData following the column chunk data. This IMO violates the specification[2]. Instead, parque

Re: ColumnMetaData location

2024-06-04 Thread Ed Seidl
rote: modifying the spec to state that the ColumnMetaData following the chunk data is also optional +1 on this adding language to the effect that if the value of file_offset is 0, then no such metadata is present in the file. What about marking this as deprecated and discouraged to use it?

Re: [VOTE] Migration of parquet-* issues from Jira to GitHub

2024-06-13 Thread Ed Seidl
+1 (non-binding) Thanks! Ed On 6/13/24 11:20 AM, Micah Kornfield wrote: +1 (non-binding) On Thu, Jun 13, 2024 at 11:14 AM Rok Mihevc wrote: Hi all, Following the ML discussion [1] I would like to propose a vote for parquet-java, parquet-format, parquet-testing and parquet-site issues to be

[DISCUSS] Can FIXED_LEN_BYTE_ARRAY be annotated with STRING?

2024-06-17 Thread Ed Seidl
Hi all, While discussing PARQUET-2485 a question was raised about the STRING annotation [1]. The current wording in the specification is "|STRING| may only be used to annotate the binary primitive type"; PARQUET-2485 would change that to "|STRING| may only be used to annotate the |BYTE_ARRAY|

Re: [External] Re: [DISCUSS] Can FIXED_LEN_BYTE_ARRAY be annotated with STRING?

2024-06-18 Thread Ed Seidl
tion in Parquet-java and parquet-cpp to see if they are in agreement on the matter and then make a decision from there. It doesn't seem too onerous to support FLBA as a String though if necessary? Cheers, Micah On Mon, Jun 17, 2024 at 12:15 PM Ed Seidl wrote: Hi all, While discussing PA

[DISCUSS] Deprecate file_offset in ColumnChunk struct

2024-06-24 Thread Ed Seidl
Resurrecting a thread from earlier in the month regarding inconsistent use of the file_offset field [1][2]. It seems like the preferred path forward is to deprecate this (AFAICT) unused field to prevent further confusion. If there are no violent objections, I'll submit a PR to do so in a few da

Re: [DISCUSS] Deprecate file_offset in ColumnChunk struct

2024-06-25 Thread Ed Seidl
y. On Mon, Jun 24, 2024 at 3:21 PM Ed Seidl wrote: Resurrecting a thread from earlier in the month regarding inconsistent use of the file_offset field [1][2]. It seems like the preferred path forward is to deprecate this (AFAICT) unused field to prevent further confusion. If there are no violent

Re: [DISCUSS] Deprecate file_offset in ColumnChunk struct

2024-06-25 Thread Ed Seidl
moved from the diagram altogether. Cheers, Ed On 6/25/24 7:29 AM, Ed Seidl wrote: The issue I have is that we're currently in a position where a file written to the letter of the specification will likely be readable by none of the major parquet implementations. (I'm going to test th

Re: [DISCUSS] Deprecate file_offset in ColumnChunk struct

2024-06-25 Thread Ed Seidl
FWIW, I've throw up an alternative strawman PR [1]. Ed [1] https://github.com/apache/parquet-format/pull/440 On 6/25/24 11:56 AM, Ed Seidl wrote: I've now tested three implemenations (parquet-java, pyarrow/parquet-cpp, arrow-rs) to see what they all do. For brevity, I'

Re: [ANNOUNCE] New Parquet Committer: Xuwei Fu

2024-07-11 Thread Ed Seidl
Congrats! And thanks! Ed Get Outlook for iOS From: Micah Kornfield Sent: Thursday, July 11, 2024 10:33:02 AM To: dev@parquet.apache.org Subject: Re: [ANNOUNCE] New Parquet Committer: Xuwei Fu Congrats! On Thursday, July 11, 2024, Julien L