> > That is what I would expect -- and I would expect an error if some > subsequent variant instance had a different data type.
IIUC, and I mostly agree with, the shredding spec as currently proposed does not fail in this case it just loses performance benefits. I think this is a reasonable compromise, given we expect mostly identical schemas but sometimes type conflicts can't be helped and want to be resilient to them. On Mon, Dec 16, 2024 at 3:10 AM Andrew Lamb <[email protected]> wrote: > > This seems reasonable. I'd therefore expect that allowing reference > > implementations to shred data by taking the schema of a field the first > > time it appears as a reasonable heuristic? > > That is what I would expect -- and I would expect an error if some > subsequent variant instance had a different data type. > > This is the same behavior I observed when trying to save json data into a > parquet struct column using pyarrow. If some subsequent record contains a > different schema than the first, a runtime error is thrown. > > Andrew > > On Mon, Dec 16, 2024 at 12:07 AM Micah Kornfield <[email protected]> > wrote: > > > Hi Ryan, > > > > > In addition to being an important and basic guarantee of the format, I > > > think there are a few other good reasons for this. Normalizing in the > > > engine keeps the spec small while remaining flexible and expressive. > For > > > example, the value 12.00 (decimal(4,2)) is equivalent to the 12 (int8) > > for > > > some use cases, but not in others. If Parquet requires that 12.00 is > > always > > > equivalent to 12, then values can't be trusted for the cases that use > > > decimals for exact precision. Even if normalization is optional, you > > can't > > > trust that it wasn't normalized at write time. In addition, the spec > > would > > > need a lot more detail because Parquet would need to document rules for > > > normalization. For instance, when 12 is stored as an int16, should it > be > > > normalized at read time to an int8? What about storing 12 as 12.00 > > > (decimal(4,2))? > > > > > > Could you clarify your concerns here, the specification appears to > already > > at least partially do exactly this via "Type equivalence class" (formally > > known as Logical type) [1] of "exact numeric". If we don't want to > believe > > parquet should be making this determination maybe it should be removed > from > > the spec? I'm OK with the consensus expressed here with no normalization > > and no extra metadata. These can always be added in a follow-up revision > > if we find the existing modelling needs to be improved. > > > > > > But even if we were to allow Parquet to do this, we've already decided > not > > > to add similar optimizations that preserve types on the basis that they > > are > > > not very useful. Earlier in our discussions, I suggested allowing > > multiple > > > shredded types for a given field name. For instance, shredding to > columns > > > with different decimal scales. Other people pointed out that while this > > > would be useful in theory, data tends to be fairly uniformly typed in > > > practice and it wasn't worth the complexity. > > > > > > This seems reasonable. I'd therefore expect that allowing reference > > implementations to shred data by taking the schema of a field the first > > time it appears as a reasonable heuristic? More generally it might be > good > > to start discussing what API changes we expect are needed to support > > shredding in reference implementations? > > > > Thanks, > > Micah > > > > > > [1] > > > > > https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types > > > > > > > > On Wed, Dec 11, 2024 at 9:18 AM Russell Spitzer < > [email protected] > > > > > wrote: > > > > > For normalization I agree with Ryan. I was part of those other > > discussions > > > and I think > > > it does seem like this is an engine concern and not a storage one. > > > > > > I'm also ok with basically getting no value from min/max of > non-shredded > > > fields. > > > > > > On Wed, Dec 11, 2024 at 4:35 AM Antoine Pitrou <[email protected]> > > wrote: > > > > > > > On Mon, 9 Dec 2024 16:33:51 -0800 > > > > "[email protected]" > > > > <[email protected]> wrote: > > > > > I think that Parquet should exactly reproduce the data that is > > written > > > to > > > > > files, rather than either allowing or requiring Parquet > > implementations > > > > to > > > > > normalize types. To me, that's a fundamental guarantee of the > storage > > > > > layer. The compute layer can decide to normalize types and take > > actions > > > > to > > > > > make storage more efficient, but storage should not modify the data > > > that > > > > is > > > > > passed to it. > > > > > > > > FWIW, I agree with this. > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > >
