Hey all,

> There might be an easy/light way to add this new metadata: we can persist
schema_id in the DataFile. It still adds some extra size to the manifest
file but should be negligible?

I do think it's probably negligible in terms of the size (at least in terms
of the value that we get out of having that field) but I think the main
downside of that approach is that it's still an additional spec change
which if we consider going down the route of Parquet manifests then that
field largely becomes moot/throw-away work at that point since the type
information would exist in Parquet.

> And I think there’s also another aspect to consider: whether the new type
promotion is compatible with partition transforms. Currently all the
partition transforms produce the same result for promoted types: int ->
long, float -> double. If we are adding a new type promotion, current
partition transforms will produce different results for type promotion such
as int/long -> string, so the partition() of DataFile will not hold for
promoted types. One possible way to fix that would be evolving the
PartitionSpec with a new one?

Yes, an important part of type promotion is validation that whatever
evolution is being attempted can actually happen if the column being
evolved is part of a partition transform! I was working on an
implementation for this and so far it's just a strict check that the field
being promoted is not part of the partition spec. The proposed way of
evolving the spec sounds plausible specifically for int/long -> string,
although I'd need to double check/think through implications.

Thanks,

Amogh Jahagirdar



On Mon, Aug 19, 2024 at 8:43 AM Xianjin YE <xian...@apache.org> wrote:

> Hey Fokko,
>
> > Distribute all the schemas to the executors, and we have to do the
> lookup and comparison there.
>
> I don’t think this would be a problem: the schema id in the DataFile
> should be only used in driver’s planning phase to determine the lower/upper
> bounds, so no extra schema except the current one should be distributed to
> the executor. Even if all schemas are required, they can be retrieved from
> SerializableTable’s lazyTable() method.
>
> > Not being able to prune old schema until they are not used anymore
> (including all historical snapshots).
>
> That’s indeed a problem. However we haven’t add the ability to prune
> unused schemas yet(which I would like to add too after RemoveUnusedSpecs),
> we can consider that when implementing. BTW, I think it might be worthy as
> we can always rewrite old data files to use the latest schema.
>
> On Aug 19, 2024, at 22:19, Fokko Driesprong <fo...@apache.org> wrote:
>
> Thanks Ryan for bringing this up, that's an interesting problem, let me
> think about this.
>
> we can persist schema_id in the DataFile
>
>
> This was also my first thought. The two drawbacks are:
>
>    - Distribute all the schemas to the executors, and we have to do the
>    lookup and comparison there.
>    - Not being able to prune old schema until they are not used anymore
>    (including all historical snapshots).
>
> If we are adding new type promotion, current partition transforms will
>> produce different result for type promotion such as int/long -> string, so
>> the partition() of DataFile will not hold for promoted types. One possible
>> way to fix that would be evolving the PartitionSpec with a new one?
>
>
> That's a good call! Currently, we create partition spec evaluators based
> on the partition-spec-id. Evolving the partition spec would fix it. When we
> decide to include the schema-id, we would be able to create the evaluator
> based on the (partition-spec-id, schema-id) tuple when evaluating the
> partitions.
>
> Kind regards,
> Fokko
>
>
>
> Op ma 19 aug 2024 om 15:59 schreef Xianjin YE <xian...@apache.org>:
>
>> Thanks Ryan for bringing this up.
>>
>> > int and long to string
>>
>>
>> Could you elaborate a bit on how we can support type promotion for `int`
>> and `long` to `string` if the upper and lower bounds are already encoded in
>> 4/8 bytes binary? It seems that we cannot add promotions to string as Piotr
>> pointed out?
>>
>>
>> > My rationale for not adding new information to track the bound types
>> at the time that the data file metadata is created is that it would inflate
>> the size of manifests and push out the timeline for getting v3 done.
>>
>> There might be an easy/light way to add this new metadata: we can persist
>> schema_id in the DataFile. It still adds some extra size to the manifest
>> file but should be negligible?
>>
>> And I think there’s also another aspect to consider: whether the new type
>> promotion is compatible with partition transforms. Currently all the
>> partition transforms produce the same result for promoted types: int ->
>> long, float -> double. If we are adding new type promotion, current
>> partition transforms will produce different result for type promotion such
>> as int/long -> string, so the partition() of DataFile will not hold for
>> promoted types. One possible way to fix that would be evolving the
>> PartitionSpec with a new one?
>>
>>
>> On Aug 17, 2024, at 07:00, Ryan Blue <b...@apache.org> wrote:
>>
>> I’ve recently been working on updating the spec for new types and type
>> promotion cases in v3.
>> I was talking to Micah and he pointed out an issue with type promotion:
>> the upper and lower bounds for data file columns that are kept in Avro
>> manifests don’t have any information about the type that was used to encode
>> the bounds.
>> For example, when writing to a table with a float column, 4: f, the
>> manifest’s lower_bounds and upper_bounds maps will have an entry with the
>> type ID (4) as the key and a 4-byte encoded float for the value. If column
>> f were later promoted to double, those maps aren’t changed. The way we
>> currently detect that the type was promoted is to check the binary value
>> and read it as a float if there are 4 bytes instead of 8. This prevents us
>> from adding int to double type promotion because when there are 4 bytes we
>> would not know whether the value was originally an int or a float.
>> Several of the type promotion cases from my previous email hit this
>> problem. Date/time types to string, int and long to string, and long to
>> timestamp are all affected. I think the best path forward is to add fewer
>> type promotion cases to v3 and support only these new cases:
>> • int and long to string
>> • date to timestamp
>> • null/unknown to any
>> • any to variant (if supported by the Variant spec)
>> That list would allow us to keep using the current strategy and not add
>> new metadata to track the type to our manifests. My rationale for not
>> adding new information to track the bound types at the time that the data
>> file metadata is created is that it would inflate the size of manifests and
>> push out the timeline for getting v3 done. Many of us would like to get v3
>> released to get the timestamp_ns and variant types out. And if we can get
>> at least some of the promotion cases out that’s better.
>> To address type promotion in the long term, I think that we should
>> consider moving to Parquet manifests. This has been suggested a few times
>> so that we can project just the lower and upper bounds that are needed for
>> scan planning. That would also fix type promotion because the manifest file
>> schema would include full type information for the stats columns. Given the
>> complexity of releasing Parquet manifests, I think it makes more sense to
>> get a few promotion cases done now in v3 and follow up with the rest in v4.
>> Ryan
>>
>> --
>> Ryan Blue
>>
>>
>>
>

Reply via email to