Hey Fokko,

> Distribute all the schemas to the executors, and we have to do the lookup and 
> comparison there.

I don’t think this would be a problem: the schema id in the DataFile should be 
only used in driver’s planning phase to determine the lower/upper bounds, so no 
extra schema except the current one should be distributed to the executor. Even 
if all schemas are required, they can be retrieved from SerializableTable’s 
lazyTable() method.

> Not being able to prune old schema until they are not used anymore (including 
> all historical snapshots).

That’s indeed a problem. However we haven’t add the ability to prune unused 
schemas yet(which I would like to add too after RemoveUnusedSpecs), we can 
consider that when implementing. BTW, I think it might be worthy as we can 
always rewrite old data files to use the latest schema.

> On Aug 19, 2024, at 22:19, Fokko Driesprong <fo...@apache.org> wrote:
> 
> Thanks Ryan for bringing this up, that's an interesting problem, let me think 
> about this.
> 
>> we can persist schema_id in the DataFile
> 
> This was also my first thought. The two drawbacks are:
> Distribute all the schemas to the executors, and we have to do the lookup and 
> comparison there.
> Not being able to prune old schema until they are not used anymore (including 
> all historical snapshots).
>> If we are adding new type promotion, current partition transforms will 
>> produce different result for type promotion such as int/long -> string, so 
>> the partition() of DataFile will not hold for promoted types. One possible 
>> way to fix that would be evolving the PartitionSpec with a new one?
> 
> That's a good call! Currently, we create partition spec evaluators based on 
> the partition-spec-id. Evolving the partition spec would fix it. When we 
> decide to include the schema-id, we would be able to create the evaluator 
> based on the (partition-spec-id, schema-id) tuple when evaluating the 
> partitions.
> 
> Kind regards,
> Fokko
> 
> 
> 
> Op ma 19 aug 2024 om 15:59 schreef Xianjin YE <xian...@apache.org 
> <mailto:xian...@apache.org>>:
>> Thanks Ryan for bringing this up.
>> 
>> > int and long to string
>> 
>> 
>> Could you elaborate a bit on how we can support type promotion for `int` and 
>> `long` to `string` if the upper and lower bounds are already encoded in 4/8 
>> bytes binary? It seems that we cannot add promotions to string as Piotr 
>> pointed out?
>> 
>> 
>> > My rationale for not adding new information to track the bound types at 
>> > the time that the data file metadata is created is that it would inflate 
>> > the size of manifests and push out the timeline for getting v3 done.
>> 
>> There might be an easy/light way to add this new metadata: we can persist 
>> schema_id in the DataFile. It still adds some extra size to the manifest 
>> file but should be negligible? 
>> 
>> And I think there’s also another aspect to consider: whether the new type 
>> promotion is compatible with partition transforms. Currently all the 
>> partition transforms produce the same result for promoted types: int -> 
>> long, float -> double. If we are adding new type promotion, current 
>> partition transforms will produce different result for type promotion such 
>> as int/long -> string, so the partition() of DataFile will not hold for 
>> promoted types. One possible way to fix that would be evolving the 
>> PartitionSpec with a new one?
>> 
>> 
>>> On Aug 17, 2024, at 07:00, Ryan Blue <b...@apache.org 
>>> <mailto:b...@apache.org>> wrote:
>>> 
>>> I’ve recently been working on updating the spec for new types and type 
>>> promotion cases in v3.
>>> I was talking to Micah and he pointed out an issue with type promotion: the 
>>> upper and lower bounds for data file columns that are kept in Avro 
>>> manifests don’t have any information about the type that was used to encode 
>>> the bounds.
>>> For example, when writing to a table with a float column, 4: f, the 
>>> manifest’s lower_bounds and upper_bounds maps will have an entry with the 
>>> type ID (4) as the key and a 4-byte encoded float for the value. If column 
>>> f were later promoted to double, those maps aren’t changed. The way we 
>>> currently detect that the type was promoted is to check the binary value 
>>> and read it as a float if there are 4 bytes instead of 8. This prevents us 
>>> from adding int to double type promotion because when there are 4 bytes we 
>>> would not know whether the value was originally an int or a float.
>>> Several of the type promotion cases from my previous email hit this 
>>> problem. Date/time types to string, int and long to string, and long to 
>>> timestamp are all affected. I think the best path forward is to add fewer 
>>> type promotion cases to v3 and support only these new cases:
>>>     • int and long to string
>>>     • date to timestamp
>>>     • null/unknown to any
>>>     • any to variant (if supported by the Variant spec)
>>> That list would allow us to keep using the current strategy and not add new 
>>> metadata to track the type to our manifests. My rationale for not adding 
>>> new information to track the bound types at the time that the data file 
>>> metadata is created is that it would inflate the size of manifests and push 
>>> out the timeline for getting v3 done. Many of us would like to get v3 
>>> released to get the timestamp_ns and variant types out. And if we can get 
>>> at least some of the promotion cases out that’s better.
>>> To address type promotion in the long term, I think that we should consider 
>>> moving to Parquet manifests. This has been suggested a few times so that we 
>>> can project just the lower and upper bounds that are needed for scan 
>>> planning. That would also fix type promotion because the manifest file 
>>> schema would include full type information for the stats columns. Given the 
>>> complexity of releasing Parquet manifests, I think it makes more sense to 
>>> get a few promotion cases done now in v3 and follow up with the rest in v4.
>>> Ryan
>>> 
>>> -- 
>>> Ryan Blue
>> 

Reply via email to