Re: Interest in Parquet V3

Weston Pace Wed, 22 May 2024 14:32:18 -0700

> *A row group can consist of one or many column chunks per column* (while
> before it could consist of only one)*.*


Yes, that would work.  It's not a breaking change and makes good sense as
an evolution.

I was talking with Jacques Nadeau earlier and he mentioned another
solution, which is to have support for a 2-pass writer.  First you write
the data with row groups.  Then, in the second pass, you read the data one
column at a time, and write a file with only 1 row group.  This works too
but I prefer your solution as a two-pass write might be an inconvenience
for some.

> Multiple concurrent decoders can share the same page

Yes, this is pretty much required.  Coming from Arrow I'm pretty used to
this kind of reference counting concept :).  That being said, your overall
point (a lance reader requires a lot of complexity to get parallelism) is
very true but I don't know that this is a bad thing.

> Page sharing assumes that you have a fast (hopefully O(1), at worst
>  I'd say O(log n)) way to only read parts of a page. Whether this is
>  possible depends on the encoding. E.g., Parquet has encodings where this
is
>  possible (PLAIN) and ones where this is not possible (RLE, DELTA_*,
etc.).
>  With these encodings, two parallel read threads would need to do
duplicate
>  work to read only parts of a page.

For plain / bit packed / ... pages this is easy, as you mention.

RLE is a bit trickier (disclaimer: I have not yet implemented it yet) but
should be doable.  Lance has a decode loop that goes through each column
and figures out which pages (parts of pages) are needed for the next batch
(more about this in [1]).  The page decoders are stateful (they are created
"full" by the scheduler and then slowly drained) and so they keep a pointer
into the page (e.g. a pointer into the runs so we know which run we are
currently decoding and how many values we've already taken from that run)
and so it should still be O(1) in the RLE case.

General compression / block compression, on the other hand, is pretty
annoying :).  I can't easily split the decode of a compressed buffer into
two different tasks (shoutout to niyue who's been looking into compression
and helped me realize this).  My current thinking for compression (and
encryption) is to make them a layer in between the I/O and the decoder.  So
basically, I have a "decompression / decrypting" thread pool and I have
`schedule -> I/O -> decrypt/decompress -> decode`.  One benefit here is
that we effectively parallelize decompression / decryption across columns.
One downside is that the thread that does the decryption / decompression is
probably not the same thread that is going to do the decoding and so we
can't fuse decompression, decryption, decode, and first-query-pipeline into
a single pipeline (we would have decryption / decode as one pipeline and
decode + first-query-pipeline as the second).

[1] https://blog.lancedb.com/splitting-scheduling-from-decoding/

On Wed, May 22, 2024 at 12:42 PM Jan Finis <jpfi...@gmail.com> wrote:

> Thanks Weston for the answers, very insightful! I appreciate your input
> very much.
>
> Some follow ups :):
>
>  My point is that one of
> > these (compression chunk size) is a file format concern and the other one
> > (encoding size) is an encoding concern.
>
>
> The fact that Lance V2 still can have multiple grains is interesting and so
> far was not clear in our discussion, thanks for the clarification! As you
> describe, Lance V2 allows for further "sub pages" inside a page. Parquet
> does not have that. Thus, if we now just drop the restrictions that Parquet
> pages (which - at least as of today - do not allow nesting of smaller sub
> pages) need to be contiguous in Parquet, we would end up with something
> that is less powerful than Lance V2. I'm wondering how we can best arrive
> at something that is as powerful as the concept in Lance V2 without
> throwing all Parquet concepts out of the window. Here is my first draft of
> how this could look like:
>
> When you think it through in detail, a Lance V2 page is actually the
> equivalent of a Parquet column chunk,* not of a Parquet page,* as the
> former is the (preferred) unit of I/O while the latter is the unit of
> encoding. And as you describe, a Lance V2 page is the unit of I/O, not the
> unit of encoding. That is an important point to keep in mind in our
> discussion, as this might otherwise lead to a lot of misunderstandings.
>
> Thus, to make Parquet column chunks as powerful as Lance V2 pages, we must
> allow *multiple column chunks* per column inside a Parquet row group. This
> should give us the desired equivalent of the advantages of Lance V2. Thus,
> if we want this (and I think we do), we should do the following change for
> Parquet V3:
>
> *A row group can consist of one or many column chunks per column* (while
> before it could consist of only one)*.*
>
> We should not say that Parquet pages do not need to be contiguous, as
> Parquet pages are *not* the grain of I/O but the grain of encoding.
>
> Do you agree with this, or did I get something wrong?
>
> (I see that Lance V2 is even more flexible than that. But I guess this is a
> good inbetween solution that allows us to leverage the advantages of "non
> contiguity" without needing to make all of Parquet's concepts pluggable,
> which could be too big of a change.)
>
>
> We do parallelize inside a file.
>
>
> Interesting, but this comes with challenges. How do you solve these?
>
> As page boundaries are not necessarily aligned between columns, this means
> that mini batches can start and end in the middle of pages. This comes with
> the following challenges:
>
>    - Multiple concurrent decoders can share the same page (e.g., one
>    decodes the first part of it and the other the last part). Thus, either
> we
>    request it two times from storage, wasting I/O bandwidth, or we make
>    readers share I/O requests. So we would need some kind of reference
>    counting or a similar technique to check how many decoders still need a
>    page that has been fetched from I/O before we can free the read memory.
>    This is possible, but it increases the complexity of the I/O + decode
> stack
>    considerably. Does the Lance V2 format subsume a reader that is this
>    clever? (This is not a real downside, as I am more than happy to
> implement
>    complex I/O sharing code if it leads to a more efficient format; I just
>    wanted to mention this trade-off)
>    - Page sharing assumes that you have a fast (hopefully O(1), at worst
>    I'd say O(log n)) way to only read parts of a page. Whether this is
>    possible depends on the encoding. E.g., Parquet has encodings where
> this is
>    possible (PLAIN) and ones where this is not possible (RLE, DELTA_*,
> etc.).
>    With these encodings, two parallel read threads would need to do
> duplicate
>    work to read only parts of a page. Also black box compression (LZ4,
> etc.)
>    can only decompress a page as a whole. How do you solve this challenge
> in
>    LanceDB?
>
>
> @Steve
>
> Jan, assuming you are one of the authors of "Get Real: How Benchmarks Fail
> > to Represent the Real World", can I get a preprint copy?  As it's
> relevant
> > here. My acm library access is gone until I upload another 10 Banksy
> mural
> > photos to wikimedia.
> >
>
> This link should work without any library access:
> https://dl.acm.org/doi/pdf/10.1145/3209950.3209952 (I tried from my phone,
> which definitely doesn't have any ACM access). You might need to press the
> "PDF" button. Let me know if it doesn't work, then I'll try to find a
> pre-print.
>
> Cheers,
> Jan
>
> Am Mi., 22. Mai 2024 um 16:10 Uhr schrieb Steve Loughran
> <ste...@cloudera.com.invalid>:
>
> > On Tue, 21 May 2024 at 22:40, Jan Finis <jpfi...@gmail.com> wrote:
> >
> > > Thanks Weston for posting here!
> > >
> > > I appreciate this a lot, as it gives us the opportunity to discuss
> modern
> > > formats in depth with the authors themselves, who probably know the
> > design
> > > trade-offs they took best and thus can give us a deeper understanding
> > what
> > > certain features would mean for Parquet.
> > >
> >
> > This is becoming an interesting question.
> >
> > >
> > > I read both your linked posts. I read them with the mindset as if they
> > were
> > > the documentation for a file format that I myself would need to add to
> > our
> > > engine, so I always double checked whether I would agree with your
> > > reasoning and where I would see problems in the implementation.
> > >
> > > I ended up with some points where I cannot follow your reasoning, yet,
> or
> > > where I feel clarification would be good. It would be nice if you could
> > go
> > > a bit into detail here:
> > >
> > > Regarding your "parallelism without row groups" post [2]:
> > >
> > > 1. Do I understand correctly that you basically replace row groups with
> > > files. Thus, the task for reading row groups in parallel boils down to
> > > reading files in parallel. Your post does *not* claim that the new
> format
> > > would be able to parallelize *inside* a row group/file, correct?
> > >
> > > 2. I do not fully understand what the proposed parallelism has to do
> with
> > > the file format. As you mention yourself, files and row groups are
> > > basically the same thing. As such, couldn't you do the same "Decode
> Based
> > > Parallelism" also with Parquet as it is today? E.g., the file reader in
> > our
> > > engine looks basically exactly like what you propose, employing what
> you
> > > call Mini Batches and not reading a whole row group as a whole (which
> > could
> > > lead to out of memory in case a row group contains an insane amount of
> > > rows, so it is a big no no anyway for us). It seems that the
> shortcomings
> > > of the code listed in "Our First Parallel File Reader" is solely a
> > > shortcoming of that code, not of the underlying format.
> > >
> > > Regarding [1]:
> > >
> > > 3. This one is mostly about understanding your rationales:
> > >
> > > As one main argument for abolishing row groups, you mention that sizing
> > > them well is hard (I fully agree!). But since you replace row groups
> with
> > > files, don't you have the same problem for the file again? Small row
> > > groups/files are bad due to small I/O requests and metadata explosion,
> > > agree! So let's use bigger ones. Here you argue that Parquet readers
> will
> > > load the whole row group into memory and therefore suffer memory
> issues.
> > > This is a strawman IMHO, as this is just a shortcoming of the reader,
> not
> > > of the format. Nothing in the Parquet spec forces a reader to read a
> row
> > > group at once (and in fact, our implementation doesn't do this for
> > exactly
> > > the reasons you mentioned). Just like in LanceV2, Parquet readers can
> opt
> > > to read only a few pages ahead of the decoding.
> > >
> >
> > Parquet-mr now supports parallel GET requests on different ranges on s3;
> > it'd be even better if object stores supported multiple range requests
> in a
> > single GET.
> > Doing all this reading in a single file is more efficient client-side
> than
> > having different files
> > open.
> >
> >
> > > On the writing side, I see your point that a Lance V2 writer never has
> to
> > > buffer more than a page and this is great! However, this seems to be
> > just a
> > > result of allowing pages to not be contiguous, not of the fact that row
> > > groups were abolished. You could still support multiple row groups with
> > > non-contiguous pages and reap all the benefits you mention. Your post
> > > intermingles the two design choices "contiguous pages yes/no" and "row
> > > groups as horizontal partitions within a file yes/no". I would argue
> that
> > > the two features are basically fully orthogonal. You can have one
> without
> > > the other and vice versa.
> > >
> > > So all in all, do I see correctly that your main argument here
> basically
> > is
> > > "don't force pages to be contiguous!". Doing away with row groups is
> just
> > > added bonus for easier maintenance, as you can just use files instead
> of
> > > row groups.
> > >
> > >
> > > 4. Considering contiguous pages and I/O granularity:
> > >
> > > The format basically proposes to have pages as the only granularity
> > below a
> > > file (+ metadata & footer), while Parquet has two granularities: Row
> > group,
> > > or rather Column Chunk, and Page. You argue that a page in Lance V2
> > should
> > > basically be as big as is necessary for good I/O performance (say, 8
> MiB
> > > for Amazon S3).
> >
> >
> > way bigger, please. small files are so expensive on s3, whenever you
> start
> > doing any directory operations, bulk copies etc. Same for the other
> stores.
> >
> > Jan, assuming you are one of the authors of "Get Real: How Benchmarks
> Fail
> > to Represent the Real World", can I get a preprint copy?  As it's
> relevant
> > here. My acm library access is gone until I upload another 10 Banksy
> mural
> > photos to wikimedia.
> >
> > Steve
> >
>

Re: Interest in Parquet V3

Reply via email to