Thanks for bringing up this topic!
This is an important topic to me and my team, as we maintain a proprietary
implementation of Parquet in addition to our own proprietary format [1]
that was designed around the same time as Parquet, so we always had
comparisons between formats. I also had conversations with other big
lakehouse players from industry and with folks from DuckDB, and had
personal conversations with the author of BtrBlocks. All in all, I have
spent countless hours discussing Parquet with others and asking: What would
we do differently, would we redesign a format like Parquet today?
I agree with the sentiment that Parquet has to innovate, or it will be
replaced by some successor format in the future. For now, it is still the
main format used by table formats like Iceberg and Delta Lake and it has
the big advantage that it is so ubiquitous, so any new format has to be
considerably better to stand any chance. However, given the many new
formats and encodings popping up lately, such a format will eventually be
released, and once it takes off data lake vendors will adopt it rapidly, if
the advantages over Parquet are large enough. In fact, I have already heard
in personal conversations that a player in the data lake space is working
on proposing such a format. We ourselves struggle with the disadvantages of
Parquet and would adopt any format that fixes those quickly.
Concerning the "It didn't already work with V2" argument, I believe the
failure of some future V3 format cannot be deduced from the missing
adoption of the V2 format. The V2 format just had some birth defects that
basically dug its own grave. Among others:
- It is not even prominently documented what V2 even is. When
implementing our proprietary Parquet reader, we often asked ourselves what
v2 is. We had to dig into implementation code to get any clues and even
then, it wasn't fully clear whether what the implementation does is the
general rule or just a choice of that implementation. E.g., is DataPageV2
part of Parquet V2? Apparently it is not. Is it just some new encodings? Or
is there more? There should have just been a clear document in the
parquet-format repo that outlines what Parquet v2 is.
- The encodings introduced in v2 just aren't that good. DELTA_BYTE_ARRAY
is horribly slow to decode and makes any random access impossible.
DELTA_BINARY_PACKED at least allows some vectorization, but also makes
random access hard. All in all, there is just no clear advantage of v2 over
v1. We consciously decided against using v2 in our lake even though our
engines can read it, since the encodings are just too slow to decode. There
isn't even any documentation that shows some experimental numbers comparing
the encodings. Why should people use these more complex encodings, when the
benefit is absolutely unclear (and often non-existent)?
- It seems that parquet itself discourages the use of v2, as "it is not
supported by many implementations". This somewhat defeatist stance is of
course not helpful to the cause.
- We have a chicken egg problem here: Since the format doesn't show
large benefits, almost no-one writes v2 files. And since therefore almost
no v2 files exist, no one feels the need to support v2.
Concerning the "any new feature will not be implemented anyway" argument, I
don't think this is true. I have seen this stance on this mailing list and
in the Parquet community a lot in the past years, and even if there might
be a speck of truth to it, it is again a defeatist stance that in the end
hurts Parquet. This might be true for v2, for the problems mentioned above.
But any new feature that displays tangible improvements will be adopted
rather quickly by implementations. My company would implement new encodings
that promise more compression while not making decoding slower with high
priority. And so would other data lake vendors. With this, the chicken egg
problem mentioned above would be resolved: The more vendors use new
encodings in their lakes, the more pressure to support these is put onto
all implementations.
One valid argument against v3 that was already brought up repeatedly is
that if we completely need to gut Parquet and replace many aspects of it to
reach the goals of v3, then the resulting format just isn't Parquet
anymore. So maybe we just need to move on one day to a format that is
completely different, but until then, I would love to see improvements in
Parquet. The good thing about making improvements in Parquet instead of
switching to a totally different format is that we can mix and match and
still retain the countless optimizations we have implemented for Parquet
over the years.
So, what are the shortcomings that should be fixed? A lot of good points
have already been mentioned. As yet another data point, I want to depict
the points that we struggle with, ordered by severity:
- Missing random access. Parquet isn't made for random access, and while
this is okay for most queries that just scan the whole file, there are many
scenarios where this is a problem. Queries can filter out many rows and if
the format then still requires doing a lot of work, this is a problem.
Also, things like secondary indexes are hard if you do not have random
access. For example, extracting a single row with a known row index from a
Parquet file requires an insane amount of work. In contrast, in our own
format [1], we have made sure that all encodings we use allow for O(1)
random access. This means that we cannot use some nice encodings (e.g.
RLE), but therefore we can access any value with just a few assembly
instructions. The good thing about Parquet is that it gives choices to the
user. Not all encodings need to allow fast random access, but there should
be some for all data types, so that users that require fast random access
can use these. Here are the top missing pieces IMHO:
- PLAIN encoding for strings doesn't allow random access, as it
interleaves string lengths with string data. This is just unnecessary, as
it is simple to have an encoding that does not have this without any real
drawbacks (e.g., see how Arrow does it with an offset array and separate
string data). We should propose such a new string PLAIN encoding and
deprecate the current one. Not only does the current one not allow random
access, it is also slow to decode as due to the interleaved lengths,
reading values has a data dependency on the length before, so the CPU
cannot out-of-order execute a scan.
- Metadata decoding is all-or-nothing, as already discussed. This
exacerbates the random I/O problem.
- To randomly access a column with NULL values, we first need prefix
sums over the D-Levels to know which encoded value is the one
we're looking
for. There should be a way to encode a column with NULLs in ways
where NULL
values are represented explicitly in the data. This increases memory
consumption, but allows fast random access. It's a trade-off,
but one that
we would like to have in Parquet.
- A lot of new encodings have been proposed lately having good
compression while allowing fast, vectorized, decompression. Many of them
also allow random access. It is hard to find a good list of encodings to
add, so we gain most benefits while not bloating the amount of encodings,
which would put undue implementation burden on each implementation.
- As discussed, a simple feature bitmap instead of a version would be
amazing, as it would allow us to quickly do a feature check with a binary
OR to see if our engine has all necessary features to read a Parquet file.
I agree that having a compatibility matrix in a prominent spot is an
important thing to have.
Thanks in advance to anyone willing to drive this! I'm happy to give more
input and collect further sentiments from our data lake folks.
Cheers,
Jan
[1] https://db.in.tum.de/downloads/publications/datablocks.pdf
Am Di., 14. Mai 2024 um 18:48 Uhr schrieb Julien Le Dem <[email protected]>:
> +1 on Micah starting a doc and following up by commenting in it.
>
> @Raphael, Wish Maple: agreed that changing the metadata representation is
> less important. Most engines can externalize and index metadata in some
> way. It is an option to propose a standard way to do it without changing
> the format. Adding new encodings or make existing encodings more
> parallelizable is something that needs to be in the format and more useful.
>
> On Tue, May 14, 2024 at 9:26 AM Antoine Pitrou <[email protected]> wrote:
>
> > On Mon, 13 May 2024 16:10:24 +0100
> > Raphael Taylor-Davies
> > <[email protected]>
> > wrote:
> > >
> > > I guess I wonder if rather than having a parquet format version 2, or
> > > even a parquet format version 3, we could just document what features a
> > > given parquet implementation actually supports. I believe Andrew
> intends
> > > to pick up on where previous efforts here left off.
> >
> > I also believe documenting implementation status is strongly desirable,
> > regardless of whether the discussion on "V3" goes anywhere.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>