Re: [DISCUSS] Open Variant Shredding Issues

Micah Kornfield Fri, 20 Dec 2024 12:23:44 -0800

>
> That is what I would expect -- and I would expect an error if some
> subsequent variant instance had a different data type.



IIUC, and I mostly agree with, the shredding spec as currently proposed
does not fail in this case it just loses performance benefits.  I think
this is a reasonable compromise, given we expect mostly identical schemas
but sometimes type conflicts can't be helped and want to be resilient to
them.

On Mon, Dec 16, 2024 at 3:10 AM Andrew Lamb <[email protected]> wrote:

> > This seems reasonable. I'd therefore expect that allowing reference
> > implementations to shred data by taking the schema of a field the first
> > time it appears as a reasonable heuristic?
>
> That is what I would expect -- and I would expect an error if some
> subsequent variant instance had a different data type.
>
> This is the same behavior I observed when trying to save json data into a
> parquet struct column using pyarrow. If some subsequent record contains a
> different schema than the first, a runtime error is thrown.
>
> Andrew
>
> On Mon, Dec 16, 2024 at 12:07 AM Micah Kornfield <[email protected]>
> wrote:
>
> > Hi Ryan,
> >
> > > In addition to being an important and basic guarantee of the format, I
> > > think there are a few other good reasons for this. Normalizing in the
> > > engine keeps the spec small while remaining flexible and expressive.
> For
> > > example, the value 12.00 (decimal(4,2)) is equivalent to the 12 (int8)
> > for
> > > some use cases, but not in others. If Parquet requires that 12.00 is
> > always
> > > equivalent to 12, then values can't be trusted for the cases that use
> > > decimals for exact precision. Even if normalization is optional, you
> > can't
> > > trust that it wasn't normalized at write time. In addition, the spec
> > would
> > > need a lot more detail because Parquet would need to document rules for
> > > normalization. For instance, when 12 is stored as an int16, should it
> be
> > > normalized at read time to an int8? What about storing 12 as 12.00
> > > (decimal(4,2))?
> >
> >
> > Could you clarify your concerns here, the specification appears to
> already
> > at least partially do exactly this via "Type equivalence class" (formally
> > known as Logical type) [1] of "exact numeric".  If we don't want to
> believe
> > parquet should be making this determination maybe it should be removed
> from
> > the spec?  I'm OK with the consensus expressed here with no normalization
> > and no extra metadata.  These can always be added in a follow-up revision
> > if we find the existing modelling needs to be improved.
> >
> >
> > But even if we were to allow Parquet to do this, we've already decided
> not
> > > to add similar optimizations that preserve types on the basis that they
> > are
> > > not very useful. Earlier in our discussions, I suggested allowing
> > multiple
> > > shredded types for a given field name. For instance, shredding to
> columns
> > > with different decimal scales. Other people pointed out that while this
> > > would be useful in theory, data tends to be fairly uniformly typed in
> > > practice and it wasn't worth the complexity.
> >
> >
> > This seems reasonable. I'd therefore expect that allowing reference
> > implementations to shred data by taking the schema of a field the first
> > time it appears as a reasonable heuristic?  More generally it might be
> good
> > to start discussing what API changes we expect are needed to support
> > shredding in reference implementations?
> >
> > Thanks,
> > Micah
> >
> >
> > [1]
> >
> >
> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types
> >
> >
> >
> > On Wed, Dec 11, 2024 at 9:18 AM Russell Spitzer <
> [email protected]
> > >
> > wrote:
> >
> > > For normalization I agree with Ryan. I was part of those other
> > discussions
> > > and I think
> > > it does seem like this is an engine concern and not a storage one.
> > >
> > > I'm also ok with basically getting no value from min/max of
> non-shredded
> > > fields.
> > >
> > > On Wed, Dec 11, 2024 at 4:35 AM Antoine Pitrou <[email protected]>
> > wrote:
> > >
> > > > On Mon, 9 Dec 2024 16:33:51 -0800
> > > > "[email protected]"
> > > > <[email protected]> wrote:
> > > > > I think that Parquet should exactly reproduce the data that is
> > written
> > > to
> > > > > files, rather than either allowing or requiring Parquet
> > implementations
> > > > to
> > > > > normalize types. To me, that's a fundamental guarantee of the
> storage
> > > > > layer. The compute layer can decide to normalize types and take
> > actions
> > > > to
> > > > > make storage more efficient, but storage should not modify the data
> > > that
> > > > is
> > > > > passed to it.
> > > >
> > > > FWIW, I agree with this.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Open Variant Shredding Issues

Reply via email to