Re: Interest in Parquet V3

Weston Pace Tue, 21 May 2024 18:40:56 -0700

Thank you for your questions!  I think your understanding is very solid.

> Do I understand correctly that you basically replace row groups with
> files. Thus, the task for reading row groups in parallel boils down to
> reading files in parallel.


Partly.  I recommend files for inter-process parallelism (e.g. distributing
a query across multiple worker processes).

> Your post does *not* claim that the new format
> would be able to parallelize *inside* a row group/file, correct?

We do parallelize inside a file.  Imagine a simple file with 3 int32
columns, 2 million rows per column and 1 8MiB page per column.

First we would issue 3 parallel I/O requests to read those 3 pages.  Once
those three requests complete we then create decode tasks.  I tend to think
about things in "query engine" terms.  Query engines often process data in
relatively small batches.  10k rows is a common batch size (probably
because 10k TPC-H rows fit nicely inside CPU caches).  For example, duckdb
uses either 10k or 2k (I think 10k for parquet and 2k for its internal
format, or maybe it breaks the 10k into 2k size chunks).  Datafusion uses
8k chunks.  Acero uses 32k but that is probably too big.

Since we have 2 million rows we can create 200 thread tasks, each with 10k
rows, which gives us plenty of parallelism.  The first thing these thread
tasks do is decode the data.  Larger files with more complex columns and
many pages follow pretty much the same strategy (but it gets quite
complicated)

> As such, couldn't you do the same "Decode Based
> Parallelism" also with Parquet as it is today?

Yes, absolutely, and I would recommend it.

> I do not fully understand what the proposed parallelism has to do with
> the file format.

Nothing, you are correct.  In my blog posts I am generally discussing
"designing a high performing file reader" and not "designing a new file
format".  My concerns with the parquet format are only the A/B/C points I
mentioned in my previous email.

> So all in all, do I see correctly that your main argument here basically
is
> "don't force pages to be contiguous!". Doing away with row groups is just
> added bonus for easier maintenance, as you can just use files instead of
> row groups.

Absolutely.  As long as pages don't have to be contiguous then I am happy.
I'd simply always create parquet files with 1 row group.

> As such, it seems that the two grains that Parquet has benefit us, as they
> give us flexibility of both being able to scan with large requests and
> doing point accesses without too much read amplification by using small
> single-page requests.

Again, you are spot on.  Both grains are needed.  My point is that one of
these (compression chunk size) is a file format concern and the other one
(encoding size) is an encoding concern.  In fact, there is a third grain,
which is the size used for zone maps in pushdown indices.  There may be
many other grains too (e.g. run end encoded arrays need skip tables for
point lookups)  Parquet forces all of these to be the same value (page
size).

The solution to this problem in Lance (if you want to do some kind of block
compression) is to create a single 8MB page.  That page would consist of
many compressed chunks stored contiguously.  If the compression has
variable sized blocks (not sure if this is a thing, I don't know
compression well) then you need to store block sizes in a column metadata
buffer.  If the blocks are fixed size then you need to store how many rows
are in each block in a column metadata buffer.  A compression encoder can
then pick whatever compression chunk size makes the most sense.

So, for example, you could have 64KB compression chunks, zone maps for
every 1024 rows, and 8MB pages for I/O.



On Tue, May 21, 2024 at 2:40 PM Jan Finis <jpfi...@gmail.com> wrote:

> Thanks Weston for posting here!
>
> I appreciate this a lot, as it gives us the opportunity to discuss modern
> formats in depth with the authors themselves, who probably know the design
> trade-offs they took best and thus can give us a deeper understanding what
> certain features would mean for Parquet.
>
> I read both your linked posts. I read them with the mindset as if they were
> the documentation for a file format that I myself would need to add to our
> engine, so I always double checked whether I would agree with your
> reasoning and where I would see problems in the implementation.
>
> I ended up with some points where I cannot follow your reasoning, yet, or
> where I feel clarification would be good. It would be nice if you could go
> a bit into detail here:
>
> Regarding your "parallelism without row groups" post [2]:
>
> 1. Do I understand correctly that you basically replace row groups with
> files. Thus, the task for reading row groups in parallel boils down to
> reading files in parallel. Your post does *not* claim that the new format
> would be able to parallelize *inside* a row group/file, correct?
>
> 2. I do not fully understand what the proposed parallelism has to do with
> the file format. As you mention yourself, files and row groups are
> basically the same thing. As such, couldn't you do the same "Decode Based
> Parallelism" also with Parquet as it is today? E.g., the file reader in our
> engine looks basically exactly like what you propose, employing what you
> call Mini Batches and not reading a whole row group as a whole (which could
> lead to out of memory in case a row group contains an insane amount of
> rows, so it is a big no no anyway for us). It seems that the shortcomings
> of the code listed in "Our First Parallel File Reader" is solely a
> shortcoming of that code, not of the underlying format.
>
> Regarding [1]:
>
> 3. This one is mostly about understanding your rationales:
>
> As one main argument for abolishing row groups, you mention that sizing
> them well is hard (I fully agree!). But since you replace row groups with
> files, don't you have the same problem for the file again? Small row
> groups/files are bad due to small I/O requests and metadata explosion,
> agree! So let's use bigger ones. Here you argue that Parquet readers will
> load the whole row group into memory and therefore suffer memory issues.
> This is a strawman IMHO, as this is just a shortcoming of the reader, not
> of the format. Nothing in the Parquet spec forces a reader to read a row
> group at once (and in fact, our implementation doesn't do this for exactly
> the reasons you mentioned). Just like in LanceV2, Parquet readers can opt
> to read only a few pages ahead of the decoding.
>
> On the writing side, I see your point that a Lance V2 writer never has to
> buffer more than a page and this is great! However, this seems to be just a
> result of allowing pages to not be contiguous, not of the fact that row
> groups were abolished. You could still support multiple row groups with
> non-contiguous pages and reap all the benefits you mention. Your post
> intermingles the two design choices "contiguous pages yes/no" and "row
> groups as horizontal partitions within a file yes/no". I would argue that
> the two features are basically fully orthogonal. You can have one without
> the other and vice versa.
>
> So all in all, do I see correctly that your main argument here basically is
> "don't force pages to be contiguous!". Doing away with row groups is just
> added bonus for easier maintenance, as you can just use files instead of
> row groups.
>
>
> 4. Considering contiguous pages and I/O granularity:
>
> The format basically proposes to have pages as the only granularity below a
> file (+ metadata & footer), while Parquet has two granularities: Row group,
> or rather Column Chunk, and Page. You argue that a page in Lance V2 should
> basically be as big as is necessary for good I/O performance (say, 8 MiB
> for Amazon S3). Thus, the Parquet counterpart of a Lance v2 page would
> actually be - at least in terms of I/O efficiency - a Parquet Column Chunk.
> A Parquet page can instead be quite small, as it does not need to be the
> grain of the I/O but just the grain of the encoding.
>
> The fact that Parquet has these two grains has advantages when considering
> a scan vs. a point look-up. When doing a scan, we can load whole column
> chunks at once, having large I/O requests to not overwhelm the I/O with too
> many requests. When doing a point access, we can use the page & offset
> index to find and load only the one page (per column) in which the row we
> are looking for is located.
>
> As such, it seems that the two grains that Parquet has benefit us, as they
> give us flexibility of both being able to scan with large requests and
> doing point accesses without too much read amplification by using small
> single-page requests. With Lance V2, either I make large pages to make
> scans take fewer I/O requests (e.g., 8 MiB), but then I will have large
> read amplification for point accesses, or I make my pages quite small to
> benefit point accesses, but then scans will need to emit tons of I/O
> operations, which is what you are trying to avoid. How does Lance V2 solve
> this challenge? Or did I understand the format wrong here?
>
> Cheers,
> Jan
>
> Am Di., 21. Mai 2024 um 18:07 Uhr schrieb Weston Pace <
> weston.p...@gmail.com
> >:
>
> > As the author of one of these new formats I'll chime in.  The main
> issues I
> > have with parquet are:
> >
> > A. Pages in a column chunk must be contiguous (this is Lance's biggest
> > issue with parquet)
> > B. Encodings should be extensible
> > C. Flexibility in what is considered data / metadata
> >
> > I outline my reasoning for these in [1] and so I'll avoid repeating that
> > here.  I think B has been discussed pretty thoroughly in this thread.
> >
> > As for C, a format should be flexible, and then it is pretty
> > straightforward.  If a file is likely to be used in "search" (very
> > selective filters, ability to cache, etc.) then lots of data should be
> put
> > in the column metadata.  If the file is mostly for cold full scans then
> > almost nothing should go in column metadata (either don't write the
> > metadata at all or, I guess, you can put it in the data pages).  The
> format
> > shouldn't force a choice.
> >
> > Personally, I am more excited about A than I am about B & C (though I do
> > think both B & C should be addressed if going through the trouble of a
> new
> > format).  Addressing A lets us get rid of row groups, allows for APIs
> such
> > as "array-at-a-time writing", lets us make large data pages, and
> generally
> > leads to more foolproof files.
> >
> > I agree with Andrew that any discussion of B & C is usually based on
> > assumptions rather than concrete measurements of reader performance.  In
> > the scattered profiling I've done of parquet-cpp and parquet-rs I've
> found
> > that poor parquet reader performance typically has very little to do
> with B
> > & C.  Actually, I would guess that the most widespread (though not
> > necessarily most important) obstacle to parquet has been user knowledge.
> > To get the best performance from a reader users need to be familiar not
> > just with the format but also with the features available in a particular
> > reader.  I think simplifying the user experience should be a secondary
> goal
> > for any new changes.
> >
> > At the risk of arrogant self-promotion I would recommend people read [1]
> > for inspiration if nothing else.  I'm also hoping to detail design
> > decisions and tradeoffs that we come across (starting in [2] and
> continuing
> > throughout the summer).
> >
> > [1] https://blog.lancedb.com/lance-v2/
> > [2]
> >
> >
> https://blog.lancedb.com/file-readers-in-depth-parallelism-without-row-groups/
> >
> > On Mon, May 20, 2024 at 11:06 AM Parth Chandra <par...@apache.org>
> wrote:
> >
> > > Hi Parquet team,
> > >
> > >  It is very exciting to see this effort. Thanks Micah for starting
> this.
> > >
> > >  For most use case that our team sees the broad areas for improvement
> > > appear to be -
> > >    1) Optimizing for cloud storage (latency is high, seeks are
> expensive)
> > >    2) Optimized metadata reading - we've seen 30% (sometimes more) of
> > > Spark's scan operator time spent in reading footers.
> > >    3) Anything that improves support for data lakes.
> > >
> > >   Also I'll be happy to help wherever I can.
> > >
> > > Parth
> > >
> > > On Sun, May 19, 2024 at 10:59 AM Xinli shang <sha...@uber.com.invalid>
> > > wrote:
> > >
> > > > Sorry I am late to the party! It's great to see this discussion!
> Thank
> > > you
> > > > everyone for the many good points and thank you, Micah, for starting
> > the
> > > > discussion and putting it together into a document, which is very
> > > helpful!
> > > > I agree with most of the points we discussed above, and we need to
> > > improve
> > > > Parquet and sometimes even speed up to catch up with industry
> changes.
> > > >
> > > > With that said, we need people to work on it, as Julien mentioned.
> The
> > > > document
> > > > <
> > > >
> > >
> >
> https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit
> > > > >
> > > > that Micah created covers pretty much everything we discussed here. I
> > > > encourage all of us to contribute by raising questions, providing
> > > > suggestions, adding missing functionality, etc. Once we reach a
> > consensus
> > > > on each topic, we can create different tracks and working streams to
> > kick
> > > > off the implementations.
> > > >
> > > > I believe continuously improving Parquet would benefit the industry
> > more
> > > > than creating a new format, which could add friction. These
> improvement
> > > > ideas are exciting opportunities. If you, your team members, or
> friends
> > > > have time and interest, please encourage them to contribute.
> > > >
> > > > Our Parquet community meeting is next week, on May 28, 2024. We can
> > have
> > > > discussions there if you can join. Currently, it is scheduled for
> 7:00
> > am
> > > > PDT, but I can change it according to the majority's availability.
> > > >
> > > > On Fri, May 17, 2024 at 3:58 PM Rok Mihevc <rok.mih...@gmail.com>
> > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I've discussed with my colleagues and we would dedicate two
> engineers
> > > for
> > > > > 4-6 months on tasks related to implementing the format changes.
> We're
> > > > > already active in design discussions and can help with C++, Rust
> and
> > C#
> > > > > implementations. I thought it'd be good to state this explicitly
> > FWIW.
> > > > >
> > > > > Our main areas of interest are efficient reads for tables with wide
> > > > schemas
> > > > > and faster random rowgroup access [1].
> > > > >
> > > > > To workaround the wide schemas issue we actually implemented an
> > > internal
> > > > > tool [3] for storing index information into a separate file which
> > > allows
> > > > > for reading only the necessary subset of metadata. We would offer
> > this
> > > > > approach for consideration as a possible approach to solve the wide
> > > > schema
> > > > > problem.
> > > > >
> > > > > [1] https://github.com/apache/arrow/issues/39676
> > > > > [2] https://github.com/G-Research/PalletJack
> > > > >
> > > > > Rok
> > > > >
> > > > > On Sun, May 12, 2024 at 12:59 AM Micah Kornfield <
> > > emkornfi...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Parquet Dev,
> > > > > > I wanted to start a conversation within the community about
> working
> > > on
> > > > a
> > > > > > new revision of Parquet.  For context there have been a bunch of
> > new
> > > > > > formats [1][2][3] that show there is decent room for improvement
> > > across
> > > > > > data encodings and how metadata is organized.
> > > > > >
> > > > > > Specifically, in a new format revision I think we should be
> > thinking
> > > > > about
> > > > > > the following areas for improvements:
> > > > > > 1.  More efficient encodings that allow for data skipping and
> SIMD
> > > > > > optimizations.
> > > > > > 2.  More efficient metadata handling for deserialization and
> > > projection
> > > > > to
> > > > > > address areas when metadata deserialization time is not trivial
> > [4].
> > > > > > 3.  Possibly thinking about different encodings instead of
> > > > > > repetition/definition for repeated and nested field
> > > > > > 4.  Support for optimizing semi-structured data (e.g. JSON or
> > Variant
> > > > > type)
> > > > > > that can shred elements into individual columns (a recent thread
> in
> > > > > Iceberg
> > > > > > mentions doing this at the metadata level [5])
> > > > > >
> > > > > > I think the goals of V3 would be to provide existing API
> > > compatibility
> > > > as
> > > > > > broadly as possible (possibly with some performance loss) and
> > expose
> > > > new
> > > > > > API surface areas where appropriate to make use of new elements.
> > New
> > > > > > encodings could be backported so they can be made use of without
> > > > metadata
> > > > > > changes.  I think unfortunately that for points 2 and 3 we would
> > want
> > > > to
> > > > > > break file level compatibility.  More thought would be needed to
> > > > consider
> > > > > > whether 4 could be backported effectively.
> > > > > >
> > > > > > This is a non-trivial amount of work to get good coverage across
> > > > > > implementations, so before putting together more formal proposal
> it
> > > > would
> > > > > > be nice to know if:
> > > > > >
> > > > > > 1.  If there is an appetite in the general community to consider
> > > these
> > > > > > changes
> > > > > > 2.  If anybody from the community is interested in collaborating
> on
> > > > > > proposals/implementation in this area.
> > > > > >
> > > > > > Thanks,
> > > > > > Micah
> > > > > >
> > > > > > [1] https://github.com/maxi-k/btrblocks
> > > > > > [2] https://github.com/facebookincubator/nimble
> > > > > > [3] https://blog.lancedb.com/lance-v2/
> > > > > > [4] https://github.com/apache/arrow/issues/39676
> > > > > > [5]
> > https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Xinli Shang
> > > >
> > >
> >
>

Re: Interest in Parquet V3

Reply via email to