Re: Interest in Parquet V3

Micah Kornfield Tue, 28 May 2024 10:39:46 -0700

Hi Jan,

> However, this whole discussion is moot if we in the end decide we do not
> want extensibility in the first place. So, do we want it? And if so, what
> extension points do we want? Should we create a separate discussion thread
> for this as well?



Great questions.  I started a new thread under "[DISCUSS] Extensibility of
Parquet".  In my mind as you mention below, I was under the impression that
the discussion was mostly moot as it seemed PMC members wanted to limit
extensibility for core functionality and I think the other aspects of
extensibility are reasonably covered, but lets move the conversation to
that thread.

Thanks,
Micah




On Tue, May 28, 2024 at 1:33 AM Jan Finis <jpfi...@gmail.com> wrote:

> Thanks Micah for driving the effort!
>
> One thing that I didn't see in your 3 points is a discussion about
> extensibility, and, depending on whether we will be quite extensible, a way
> to specify which features are used by a Parquet file.
>
> Somewhere down the discussion, it was mentioned that we can just use a bit
> set for all possible features and types, so any reader can do a quick
> comparison to check whether a file might use features not supported by it.
>
> This whole approach would break down if we enabled extensibility, as then
> the list of all possible features would not be known statically. Therefore,
> we would need to introduce a mechanism in which new extended features can
> be marked without clashing with each other (e.g. UUID per feature).
>
> However, this whole discussion is moot if we in the end decide we do not
> want extensibility in the first place. So, do we want it? And if so, what
> extension points do we want? Should we create a separate discussion thread
> for this as well?
>
> Cheers
> Jan
>
> Am Di., 28. Mai 2024 um 07:32 Uhr schrieb Micah Kornfield <
> emkornfi...@gmail.com>:
>
> > Hi Everyone,
> > Just to follow up, conversations on the summary doc [1] have largely
> slowed
> > down.  In my mind I think of roughly three different tracks, and I'll
> start
> > threads to get a sense of who is interested (please be on the lookout for
> > discussion threads).  I think as those conversations branch naturally,
> the
> > different tracks can figure out a way to schedule in person syncs as
> > necessary.
> >
> > 1.  Infrastructure/Release Issues.  Goals of this track:
> >     - Define how we can get newer features adopted (hopefully some
> variant
> > of what is proposed in the doc).  One main blocker here seems to have
> > volunteers to do releases at the right cadence.
> >     - Improved testing infrastructure.  Both cross implementation and for
> > downstream consumers.  I think Apache Arrow has done a lot of quality
> work
> > in this space so hopefully that can provide some inspiration.
> >     - Improved documentation on feature matrix.
> >
> > 2. Improved footer metadata for wide-schema and many row-groups.  There
> > have already been a few proposals (linked in the doc) and I will post
> these
> > in the new thread, but won't do so here to try to unify the discussion
> in 1
> > place in the new thread.
> >
> > 3. Improved encodings.  I think there are a few areas to think about
> here:
> >    - Improved heuristics for selecting encodings (most people still only
> > use V1 encodings).
> >    - Improved encodings (an evaluation framework is necessary)
> >    - Better supporting indexed/point lookups.
> >
> > Thanks everyone for a great conversation so far.  Hopefully, we can
> > continue to make progress.
> >
> > -Micah
> >
> >
> > [1]
> >
> >
> https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit
> >
> > On Wed, May 22, 2024 at 2:32 PM Weston Pace <weston.p...@gmail.com>
> wrote:
> >
> > > > *A row group can consist of one or many column chunks per column*
> > (while
> > > > before it could consist of only one)*.*
> > >
> > > Yes, that would work.  It's not a breaking change and makes good sense
> as
> > > an evolution.
> > >
> > > I was talking with Jacques Nadeau earlier and he mentioned another
> > > solution, which is to have support for a 2-pass writer.  First you
> write
> > > the data with row groups.  Then, in the second pass, you read the data
> > one
> > > column at a time, and write a file with only 1 row group.  This works
> too
> > > but I prefer your solution as a two-pass write might be an
> inconvenience
> > > for some.
> > >
> > > > Multiple concurrent decoders can share the same page
> > >
> > > Yes, this is pretty much required.  Coming from Arrow I'm pretty used
> to
> > > this kind of reference counting concept :).  That being said, your
> > overall
> > > point (a lance reader requires a lot of complexity to get parallelism)
> is
> > > very true but I don't know that this is a bad thing.
> > >
> > > > Page sharing assumes that you have a fast (hopefully O(1), at worst
> > > >  I'd say O(log n)) way to only read parts of a page. Whether this is
> > > >  possible depends on the encoding. E.g., Parquet has encodings where
> > this
> > > is
> > > >  possible (PLAIN) and ones where this is not possible (RLE, DELTA_*,
> > > etc.).
> > > >  With these encodings, two parallel read threads would need to do
> > > duplicate
> > > >  work to read only parts of a page.
> > >
> > > For plain / bit packed / ... pages this is easy, as you mention.
> > >
> > > RLE is a bit trickier (disclaimer: I have not yet implemented it yet)
> but
> > > should be doable.  Lance has a decode loop that goes through each
> column
> > > and figures out which pages (parts of pages) are needed for the next
> > batch
> > > (more about this in [1]).  The page decoders are stateful (they are
> > created
> > > "full" by the scheduler and then slowly drained) and so they keep a
> > pointer
> > > into the page (e.g. a pointer into the runs so we know which run we are
> > > currently decoding and how many values we've already taken from that
> run)
> > > and so it should still be O(1) in the RLE case.
> > >
> > > General compression / block compression, on the other hand, is pretty
> > > annoying :).  I can't easily split the decode of a compressed buffer
> into
> > > two different tasks (shoutout to niyue who's been looking into
> > compression
> > > and helped me realize this).  My current thinking for compression (and
> > > encryption) is to make them a layer in between the I/O and the decoder.
> > So
> > > basically, I have a "decompression / decrypting" thread pool and I have
> > > `schedule -> I/O -> decrypt/decompress -> decode`.  One benefit here is
> > > that we effectively parallelize decompression / decryption across
> > columns.
> > > One downside is that the thread that does the decryption /
> decompression
> > is
> > > probably not the same thread that is going to do the decoding and so we
> > > can't fuse decompression, decryption, decode, and first-query-pipeline
> > into
> > > a single pipeline (we would have decryption / decode as one pipeline
> and
> > > decode + first-query-pipeline as the second).
> > >
> > > [1] https://blog.lancedb.com/splitting-scheduling-from-decoding/
> > >
> > > On Wed, May 22, 2024 at 12:42 PM Jan Finis <jpfi...@gmail.com> wrote:
> > >
> > > > Thanks Weston for the answers, very insightful! I appreciate your
> input
> > > > very much.
> > > >
> > > > Some follow ups :):
> > > >
> > > >  My point is that one of
> > > > > these (compression chunk size) is a file format concern and the
> other
> > > one
> > > > > (encoding size) is an encoding concern.
> > > >
> > > >
> > > > The fact that Lance V2 still can have multiple grains is interesting
> > and
> > > so
> > > > far was not clear in our discussion, thanks for the clarification! As
> > you
> > > > describe, Lance V2 allows for further "sub pages" inside a page.
> > Parquet
> > > > does not have that. Thus, if we now just drop the restrictions that
> > > Parquet
> > > > pages (which - at least as of today - do not allow nesting of smaller
> > sub
> > > > pages) need to be contiguous in Parquet, we would end up with
> something
> > > > that is less powerful than Lance V2. I'm wondering how we can best
> > arrive
> > > > at something that is as powerful as the concept in Lance V2 without
> > > > throwing all Parquet concepts out of the window. Here is my first
> draft
> > > of
> > > > how this could look like:
> > > >
> > > > When you think it through in detail, a Lance V2 page is actually the
> > > > equivalent of a Parquet column chunk,* not of a Parquet page,* as the
> > > > former is the (preferred) unit of I/O while the latter is the unit of
> > > > encoding. And as you describe, a Lance V2 page is the unit of I/O,
> not
> > > the
> > > > unit of encoding. That is an important point to keep in mind in our
> > > > discussion, as this might otherwise lead to a lot of
> misunderstandings.
> > > >
> > > > Thus, to make Parquet column chunks as powerful as Lance V2 pages, we
> > > must
> > > > allow *multiple column chunks* per column inside a Parquet row group.
> > > This
> > > > should give us the desired equivalent of the advantages of Lance V2.
> > > Thus,
> > > > if we want this (and I think we do), we should do the following
> change
> > > for
> > > > Parquet V3:
> > > >
> > > > *A row group can consist of one or many column chunks per column*
> > (while
> > > > before it could consist of only one)*.*
> > > >
> > > > We should not say that Parquet pages do not need to be contiguous, as
> > > > Parquet pages are *not* the grain of I/O but the grain of encoding.
> > > >
> > > > Do you agree with this, or did I get something wrong?
> > > >
> > > > (I see that Lance V2 is even more flexible than that. But I guess
> this
> > > is a
> > > > good inbetween solution that allows us to leverage the advantages of
> > "non
> > > > contiguity" without needing to make all of Parquet's concepts
> > pluggable,
> > > > which could be too big of a change.)
> > > >
> > > >
> > > > We do parallelize inside a file.
> > > >
> > > >
> > > > Interesting, but this comes with challenges. How do you solve these?
> > > >
> > > > As page boundaries are not necessarily aligned between columns, this
> > > means
> > > > that mini batches can start and end in the middle of pages. This
> comes
> > > with
> > > > the following challenges:
> > > >
> > > >    - Multiple concurrent decoders can share the same page (e.g., one
> > > >    decodes the first part of it and the other the last part). Thus,
> > > either
> > > > we
> > > >    request it two times from storage, wasting I/O bandwidth, or we
> make
> > > >    readers share I/O requests. So we would need some kind of
> reference
> > > >    counting or a similar technique to check how many decoders still
> > need
> > > a
> > > >    page that has been fetched from I/O before we can free the read
> > > memory.
> > > >    This is possible, but it increases the complexity of the I/O +
> > decode
> > > > stack
> > > >    considerably. Does the Lance V2 format subsume a reader that is
> this
> > > >    clever? (This is not a real downside, as I am more than happy to
> > > > implement
> > > >    complex I/O sharing code if it leads to a more efficient format; I
> > > just
> > > >    wanted to mention this trade-off)
> > > >    - Page sharing assumes that you have a fast (hopefully O(1), at
> > worst
> > > >    I'd say O(log n)) way to only read parts of a page. Whether this
> is
> > > >    possible depends on the encoding. E.g., Parquet has encodings
> where
> > > > this is
> > > >    possible (PLAIN) and ones where this is not possible (RLE,
> DELTA_*,
> > > > etc.).
> > > >    With these encodings, two parallel read threads would need to do
> > > > duplicate
> > > >    work to read only parts of a page. Also black box compression
> (LZ4,
> > > > etc.)
> > > >    can only decompress a page as a whole. How do you solve this
> > challenge
> > > > in
> > > >    LanceDB?
> > > >
> > > >
> > > > @Steve
> > > >
> > > > Jan, assuming you are one of the authors of "Get Real: How Benchmarks
> > > Fail
> > > > > to Represent the Real World", can I get a preprint copy?  As it's
> > > > relevant
> > > > > here. My acm library access is gone until I upload another 10
> Banksy
> > > > mural
> > > > > photos to wikimedia.
> > > > >
> > > >
> > > > This link should work without any library access:
> > > > https://dl.acm.org/doi/pdf/10.1145/3209950.3209952 (I tried from my
> > > phone,
> > > > which definitely doesn't have any ACM access). You might need to
> press
> > > the
> > > > "PDF" button. Let me know if it doesn't work, then I'll try to find a
> > > > pre-print.
> > > >
> > > > Cheers,
> > > > Jan
> > > >
> > > > Am Mi., 22. Mai 2024 um 16:10 Uhr schrieb Steve Loughran
> > > > <ste...@cloudera.com.invalid>:
> > > >
> > > > > On Tue, 21 May 2024 at 22:40, Jan Finis <jpfi...@gmail.com> wrote:
> > > > >
> > > > > > Thanks Weston for posting here!
> > > > > >
> > > > > > I appreciate this a lot, as it gives us the opportunity to
> discuss
> > > > modern
> > > > > > formats in depth with the authors themselves, who probably know
> the
> > > > > design
> > > > > > trade-offs they took best and thus can give us a deeper
> > understanding
> > > > > what
> > > > > > certain features would mean for Parquet.
> > > > > >
> > > > >
> > > > > This is becoming an interesting question.
> > > > >
> > > > > >
> > > > > > I read both your linked posts. I read them with the mindset as if
> > > they
> > > > > were
> > > > > > the documentation for a file format that I myself would need to
> add
> > > to
> > > > > our
> > > > > > engine, so I always double checked whether I would agree with
> your
> > > > > > reasoning and where I would see problems in the implementation.
> > > > > >
> > > > > > I ended up with some points where I cannot follow your reasoning,
> > > yet,
> > > > or
> > > > > > where I feel clarification would be good. It would be nice if you
> > > could
> > > > > go
> > > > > > a bit into detail here:
> > > > > >
> > > > > > Regarding your "parallelism without row groups" post [2]:
> > > > > >
> > > > > > 1. Do I understand correctly that you basically replace row
> groups
> > > with
> > > > > > files. Thus, the task for reading row groups in parallel boils
> down
> > > to
> > > > > > reading files in parallel. Your post does *not* claim that the
> new
> > > > format
> > > > > > would be able to parallelize *inside* a row group/file, correct?
> > > > > >
> > > > > > 2. I do not fully understand what the proposed parallelism has to
> > do
> > > > with
> > > > > > the file format. As you mention yourself, files and row groups
> are
> > > > > > basically the same thing. As such, couldn't you do the same
> "Decode
> > > > Based
> > > > > > Parallelism" also with Parquet as it is today? E.g., the file
> > reader
> > > in
> > > > > our
> > > > > > engine looks basically exactly like what you propose, employing
> > what
> > > > you
> > > > > > call Mini Batches and not reading a whole row group as a whole
> > (which
> > > > > could
> > > > > > lead to out of memory in case a row group contains an insane
> amount
> > > of
> > > > > > rows, so it is a big no no anyway for us). It seems that the
> > > > shortcomings
> > > > > > of the code listed in "Our First Parallel File Reader" is solely
> a
> > > > > > shortcoming of that code, not of the underlying format.
> > > > > >
> > > > > > Regarding [1]:
> > > > > >
> > > > > > 3. This one is mostly about understanding your rationales:
> > > > > >
> > > > > > As one main argument for abolishing row groups, you mention that
> > > sizing
> > > > > > them well is hard (I fully agree!). But since you replace row
> > groups
> > > > with
> > > > > > files, don't you have the same problem for the file again? Small
> > row
> > > > > > groups/files are bad due to small I/O requests and metadata
> > > explosion,
> > > > > > agree! So let's use bigger ones. Here you argue that Parquet
> > readers
> > > > will
> > > > > > load the whole row group into memory and therefore suffer memory
> > > > issues.
> > > > > > This is a strawman IMHO, as this is just a shortcoming of the
> > reader,
> > > > not
> > > > > > of the format. Nothing in the Parquet spec forces a reader to
> read
> > a
> > > > row
> > > > > > group at once (and in fact, our implementation doesn't do this
> for
> > > > > exactly
> > > > > > the reasons you mentioned). Just like in LanceV2, Parquet readers
> > can
> > > > opt
> > > > > > to read only a few pages ahead of the decoding.
> > > > > >
> > > > >
> > > > > Parquet-mr now supports parallel GET requests on different ranges
> on
> > > s3;
> > > > > it'd be even better if object stores supported multiple range
> > requests
> > > > in a
> > > > > single GET.
> > > > > Doing all this reading in a single file is more efficient
> client-side
> > > > than
> > > > > having different files
> > > > > open.
> > > > >
> > > > >
> > > > > > On the writing side, I see your point that a Lance V2 writer
> never
> > > has
> > > > to
> > > > > > buffer more than a page and this is great! However, this seems to
> > be
> > > > > just a
> > > > > > result of allowing pages to not be contiguous, not of the fact
> that
> > > row
> > > > > > groups were abolished. You could still support multiple row
> groups
> > > with
> > > > > > non-contiguous pages and reap all the benefits you mention. Your
> > post
> > > > > > intermingles the two design choices "contiguous pages yes/no" and
> > > "row
> > > > > > groups as horizontal partitions within a file yes/no". I would
> > argue
> > > > that
> > > > > > the two features are basically fully orthogonal. You can have one
> > > > without
> > > > > > the other and vice versa.
> > > > > >
> > > > > > So all in all, do I see correctly that your main argument here
> > > > basically
> > > > > is
> > > > > > "don't force pages to be contiguous!". Doing away with row groups
> > is
> > > > just
> > > > > > added bonus for easier maintenance, as you can just use files
> > instead
> > > > of
> > > > > > row groups.
> > > > > >
> > > > > >
> > > > > > 4. Considering contiguous pages and I/O granularity:
> > > > > >
> > > > > > The format basically proposes to have pages as the only
> granularity
> > > > > below a
> > > > > > file (+ metadata & footer), while Parquet has two granularities:
> > Row
> > > > > group,
> > > > > > or rather Column Chunk, and Page. You argue that a page in Lance
> V2
> > > > > should
> > > > > > basically be as big as is necessary for good I/O performance
> (say,
> > 8
> > > > MiB
> > > > > > for Amazon S3).
> > > > >
> > > > >
> > > > > way bigger, please. small files are so expensive on s3, whenever
> you
> > > > start
> > > > > doing any directory operations, bulk copies etc. Same for the other
> > > > stores.
> > > > >
> > > > > Jan, assuming you are one of the authors of "Get Real: How
> Benchmarks
> > > > Fail
> > > > > to Represent the Real World", can I get a preprint copy?  As it's
> > > > relevant
> > > > > here. My acm library access is gone until I upload another 10
> Banksy
> > > > mural
> > > > > photos to wikimedia.
> > > > >
> > > > > Steve
> > > > >
> > > >
> > >
> >
>

Re: Interest in Parquet V3

Reply via email to