Thank you Wes for the great summary and Jan for the thoughtful reply.

I think those are very valid points and areas for improvement.
There is clear pattern on a few areas that IMO we can work on building a
consensus on independently:
 - metadata: a easier way to read metadata that doesn't require
deserializing it as a whole. Either in the footer or not (page format etc).
 - more modern encoding: either for better parallelisation or random
access. multiple papers are proposing better encodings.

Parquet is an open source project that defines a format. As long as the
community is motivated to advance the project and evolve the format we can
tackle those.
To me the distinction between calling it a major revision with incompatible
changes vs a new format is a bit of a moot point. What matters is that the
Parquet community wants to improve it.

As there's momentum building around this idea, I think we should gather a
list of people interested in giving input or feedback on this and the
corresponding implementations/engines that they would want to build support
for those changes in. That would give us an idea of how we can plan the
validation of the changes and how they work for their use cases.



On Wed, May 15, 2024 at 7:27 PM Jan Finis <jpfi...@gmail.com> wrote:

> Thanks for bringing up this topic!
>
> This is an important topic to me and my team, as we maintain a proprietary
> implementation of Parquet in addition to our own proprietary format [1]
> that was designed around the same time as Parquet, so we always had
> comparisons between formats. I also had conversations with other big
> lakehouse players from industry and with folks from DuckDB, and had
> personal conversations with the author of BtrBlocks. All in all, I have
> spent countless hours discussing Parquet with others and asking: What would
> we do differently, would we redesign a format like Parquet today?
>
> I agree with the sentiment that Parquet has to innovate, or it will be
> replaced by some successor format in the future. For now, it is still the
> main format used by table formats like Iceberg and Delta Lake and it has
> the big advantage that it is so ubiquitous, so any new format has to be
> considerably better to stand any chance. However, given the many new
> formats and encodings popping up lately, such a format will eventually be
> released, and once it takes off data lake vendors will adopt it rapidly, if
> the advantages over Parquet are large enough. In fact, I have already heard
> in personal conversations that a player in the data lake space is working
> on proposing such a format. We ourselves struggle with the disadvantages of
> Parquet and would adopt any format that fixes those quickly.
>
> Concerning the "It didn't already work with V2" argument, I believe the
> failure of some future V3 format cannot be deduced from the missing
> adoption of the V2 format. The V2 format just had some birth defects that
> basically dug its own grave. Among others:
>
>    -  It is not even prominently documented what V2 even is. When
>    implementing our proprietary Parquet reader, we often asked ourselves
> what
>    v2 is. We had to dig into implementation code to get any clues and even
>    then, it wasn't fully clear whether what the implementation does is the
>    general rule or just a choice of that implementation. E.g., is
> DataPageV2
>    part of Parquet V2? Apparently it is not. Is it just some new
> encodings? Or
>    is there more? There should have just been a clear document in the
>    parquet-format repo that outlines what Parquet v2 is.
>    - The encodings introduced in v2 just aren't that good. DELTA_BYTE_ARRAY
>    is horribly slow to decode and makes any random access impossible.
>    DELTA_BINARY_PACKED at least allows some vectorization, but also makes
>    random access hard. All in all, there is just no clear advantage of v2
> over
>    v1. We consciously decided against using v2 in our lake even though our
>    engines can read it, since the encodings are just too slow to decode.
> There
>    isn't even any documentation that shows some experimental numbers
> comparing
>    the encodings. Why should people use these more complex encodings, when
> the
>    benefit is absolutely unclear (and often non-existent)?
>    - It seems that parquet itself discourages the use of v2, as "it is not
>    supported by many implementations". This somewhat defeatist stance is of
>    course not helpful to the cause.
>    - We have a chicken egg problem here: Since the format doesn't show
>    large benefits, almost no-one writes v2 files. And since therefore
> almost
>    no v2 files exist, no one feels the need to support v2.
>
> Concerning the "any new feature will not be implemented anyway" argument, I
> don't think this is true. I have seen this stance on this mailing list and
> in the Parquet community a lot in the past years, and even if there might
> be a speck of truth to it, it is again a defeatist stance that in the end
> hurts Parquet. This might be true for v2, for the problems mentioned above.
> But any new feature that displays tangible improvements will be adopted
> rather quickly by implementations. My company would implement new encodings
> that promise more compression while not making decoding slower with high
> priority. And so would other data lake vendors. With this, the chicken egg
> problem mentioned above would be resolved: The more vendors use new
> encodings in their lakes, the more pressure to support these is put onto
> all implementations.
>
> One valid argument against v3 that was already brought up repeatedly is
> that if we completely need to gut Parquet and replace many aspects of it to
> reach the goals of v3, then the resulting format just isn't Parquet
> anymore. So maybe we just need to move on one day to a format that is
> completely different, but until then, I would love to see improvements in
> Parquet. The good thing about making improvements in Parquet instead of
> switching to a totally different format is that we can mix and match and
> still retain the countless optimizations we have implemented for Parquet
> over the years.
>
> So, what are the shortcomings that should be fixed? A lot of good points
> have already been mentioned. As yet another data point, I want to depict
> the points that we struggle with, ordered by severity:
>
>
>    - Missing random access. Parquet isn't made for random access, and while
>    this is okay for most queries that just scan the whole file, there are
> many
>    scenarios where this is a problem. Queries can filter out many rows and
> if
>    the format then still requires doing a lot of work, this is a problem.
>    Also, things like secondary indexes are hard if you do not have random
>    access. For example, extracting a single row with a known row index
> from a
>    Parquet file requires an insane amount of work. In contrast, in our own
>    format [1], we have made sure that all encodings we use allow for O(1)
>    random access. This means that we cannot use some nice encodings (e.g.
>    RLE), but therefore we can access any value with just a few assembly
>    instructions. The good thing about Parquet is that it gives choices to
> the
>    user. Not all encodings need to allow fast random access, but there
> should
>    be some for all data types, so that users that require fast random
> access
>    can use these. Here are the top missing pieces IMHO:
>       - PLAIN encoding for strings doesn't allow random access, as it
>       interleaves string lengths with string data. This is just
> unnecessary, as
>       it is simple to have an encoding that does not have this without any
> real
>       drawbacks (e.g., see how Arrow does it with an offset array and
> separate
>       string data). We should propose such a new string PLAIN encoding and
>       deprecate the current one. Not only does the current one not allow
> random
>       access, it is also slow to decode as due to the interleaved lengths,
>       reading values has a data dependency on the length before, so the CPU
>       cannot out-of-order execute a scan.
>       - Metadata decoding is all-or-nothing, as already discussed. This
>       exacerbates the random I/O problem.
>       - To randomly access a column with NULL values, we first need prefix
>       sums over the D-Levels to know which encoded value is the one
> we're looking
>       for. There should be a way to encode a column with NULLs in ways
> where NULL
>       values are represented explicitly in the data. This increases memory
>       consumption, but allows fast random access. It's a trade-off,
> but one that
>       we would like to have in Parquet.
>    - A lot of new encodings have been proposed lately having good
>    compression while allowing fast, vectorized, decompression. Many of them
>    also allow random access. It is hard to find a good list of encodings to
>    add, so we gain most benefits while not bloating the amount of
> encodings,
>    which would put undue implementation burden on each implementation.
>    - As discussed, a simple feature bitmap instead of a version would be
>    amazing, as it would allow us to quickly do a feature check with a
> binary
>    OR to see if our engine has all necessary features to read a Parquet
> file.
>    I agree that having a compatibility matrix in a prominent spot is an
>    important thing to have.
>
> Thanks in advance to anyone willing to drive this! I'm happy to give more
> input and collect further sentiments from our data lake folks.
>
> Cheers,
> Jan
>
> [1] https://db.in.tum.de/downloads/publications/datablocks.pdf
>
> Am Di., 14. Mai 2024 um 18:48 Uhr schrieb Julien Le Dem <jul...@apache.org
> >:
>
> > +1 on Micah starting a doc and following up by commenting in it.
> >
> > @Raphael, Wish Maple: agreed that changing the metadata representation is
> > less important. Most engines can externalize and index metadata in some
> > way. It is an option to propose a standard way to do it without changing
> > the format. Adding new encodings or make existing encodings more
> > parallelizable is something that needs to be in the format and more
> useful.
> >
> > On Tue, May 14, 2024 at 9:26 AM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> > > On Mon, 13 May 2024 16:10:24 +0100
> > > Raphael Taylor-Davies
> > > <r.taylordav...@googlemail.com.INVALID>
> > > wrote:
> > > >
> > > > I guess I wonder if rather than having a parquet format version 2, or
> > > > even a parquet format version 3, we could just document what
> features a
> > > > given parquet implementation actually supports. I believe Andrew
> > intends
> > > > to pick up on where previous efforts here left off.
> > >
> > > I also believe documenting implementation status is strongly desirable,
> > > regardless of whether the discussion on "V3" goes anywhere.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> >
>

Reply via email to