[DISCUSS] Encoding improvements (follow-up from Parquet "V3" discussion)

2024-05-27 Thread Micah Kornfield
As a follow-up to the "V3" Discussions [1][2] I wanted to start a thread on improvements to encodings. There are several areas to pursue here: 1. Curating a standard set of benchmarks and criteria for determining if a new encoding is worth adding. 2. Developing new encodings 3. Better implement

Re: [DISCUSS] Encoding improvements (follow-up from Parquet "V3" discussion)

2024-05-28 Thread Micah Kornfield
BTW, I did propose a new RANDOM_ACCESS_BYTE_ARRAY encoding (effectively Arrow's representation) as part footer improvements [1] to help allow for O(1) access to particular column metadata, once a column is identified. [1] https://github.com/apache/parquet-format/pull/250 On Mon, May 27, 2024 at 1

Re: [DISCUSS] Encoding improvements (follow-up from Parquet "V3" discussion)

2024-05-29 Thread Ed Seidl
Maybe this is putting the cart too far in front of the horse, but I'd be willing to implement an encoding like this to see if is a better alternative to PLAIN and DELTA_LENGTH_BYTE_ARRAY as a dictionary fallback for byte arrays, at least for GPU decoding. We might want to change the name since

Re: [DISCUSS] Encoding improvements (follow-up from Parquet "V3" discussion)

2024-05-29 Thread Gang Wu
I'm interested in experimenting and implementing new encodings. Will follow up with concrete proposals or findings. Best, Gang On Thu, May 30, 2024 at 3:29 AM Ed Seidl wrote: > Maybe this is putting the cart too far in front of the horse, but I'd be > willing to implement an encoding like this

Re: [DISCUSS] Encoding improvements (follow-up from Parquet "V3" discussion)

2024-05-30 Thread Micah Kornfield
Great. BTW, I removed the encoding I referenced above from the PR to avoid putting too much into at once: I'm pasting the description below for posterity. /** Encoding for variable length binary data that allows random access of values. * * This encoding is designed for random access of B

Re: [DISCUSS] Encoding improvements (follow-up from Parquet "V3" discussion)

2024-05-30 Thread Steve Loughran
be good for a benchmark to be targetable at cloud storage; local stores, especially those with SSD, hide a lot of the costs of datalakes On Tue, 28 May 2024 at 07:17, Micah Kornfield wrote: > As a follow-up to the "V3" Discussions [1][2] I wanted to start a thread on > improvements to encodings.

Re: [DISCUSS] Encoding improvements (follow-up from Parquet "V3" discussion)

2024-05-31 Thread Julien Le Dem
Micah, would it make sense to start a google doc specifically to discuss: - the goals (there could be a few subsets) - the candidate encodings - the existing/future prototypes to validate candidates. On Thu, May 30, 2024 at 3:14 AM Steve Loughran wrote: > be good for a benchmark to be targetable

Re: [DISCUSS] Encoding improvements (follow-up from Parquet "V3" discussion)

2024-06-01 Thread Micah Kornfield
Hi Julien, Yes I a doc would be good, I didn't mean to officially propose the encoding here but wanted it kept for posterity. As I said above, I have my hands a little bit full at the moment on the other tracks and can't really devote meaningful time here. Hopefully, Gang and others can drive thi