Hi Fokko,
> In this case, we still need to keep the schemas. As an example: The example you gave is close to what I was imagining (if we get to details I might have a slightly different organization). This might be semantic, but I don't see this as keeping "schemas", since all data is present in the current schema, and not in historical schemas (i.e. schemas with different IDs). My main concern is that this can explode in complexity, especially when we > end up with cycles: date → timestamp → string → datetime. With custom > formatting is it also easy to go: timestamp → string → timestamptz. To be clear I'm not advocating for introducing all of these elements, but I think leaving the door open for them is useful. On the topic of complexity, here is an example of where having explicit lineage modelled is useful: long->timestamp->string (timestamp to string is not proposed yet but has come up in private conversations). I assume we want timestamp->string to produce a human readable timestamp? If we don't store explicit lineage what will the process be to ensure we get "2024-08-20T09:51:51-07:00" instead of "1724172711000" if we encounter a file written when the data was a "long" I agree, and this is also mentioned in the proposal. But each granularity > has different types, eg. timestamp, timestamp_ns. So this is not really a > problem. Sorry I don't think I've seen the proposal ( https://github.com/apache/iceberg/issues/9065 is light on details and maybe my mailing list searches are failing to turn up anything on dev@). Given that TIMESTAMP_NS didn't exist prior to V3, I'd expect at least some people have been storing this data as long and might want to convert it to a proper timestamp column (assuming the long is in MILLISECONDs appears to preclude this)? Similarly, it seems like we would be ruling out casting "seconds". I think that's fair <https://github.com/apache/iceberg/pull/5151>, but > believe we should be opinionated about how we store data. Would we be able > to write hex-encoded strings? The issue here is that binary->string (utf-8) can fail and should ideally be handled. hex-encoded strings cannot fail. Right now I believe there is again ambiguity if we would apply bytes->string for default values, since we don't have enough metadata at the moment to distinguish between these two values in the JSON serialized initial value (currently bytes are written as hex encoded values, so if we read a file that did not have the existing column present, it would surface the hex encoded bytes and not Avro's/in place cast interpretation of the data). I'm hesitant to introduce this complexity since I think for example the > formatting or parsing of custom formats should be done in the engine and > not the format. Again, I'm not suggesting we need to introduce complexity. Given the examples raised above I think there is already an issue with transforms in scope. The proposed solution of new field IDs and explicit lineage solves short term problems, and if we wish gives more flexibility in the long run. Thanks, Micah On Tue, Aug 20, 2024 at 4:32 AM Fokko Driesprong <fo...@apache.org> wrote: > Yes, I was thinking it would be a recursive structure that tracked each >> change. Cleanup could be achieved by also tracking schema ID of the last >> time the field was present along with the schema ID of the written data >> files in manifests (as discussed on the other threads), and cleaning up >> lineage when no data files were written in the schema. >> > > In this case, we still need to keep the schemas. As an example: > > {"name": "col_1", "field-id": 1, "type": int} > > Schema after type alteration ("alter col_1 set type long"): > > {"name": "col_1", "field-id": 2, "initial-default": { > "function_name": to_long? > "input_argument": { > "column_id": 1, > "column_type": int > } > } > > Schema after type alteration ("alter col_1 set type string"): > > {"name": "col_1", "field-id": 3, "initial-default": { > "function_name": to_string > "input_argument": { > "column_id": 2, > "column_type": long > } > } > > If I understand correctly, you're suggesting: > > {"name": "col_1", "field-id": 3, "initial-default": { > "function_name": to_string > "input_argument": { > "column_id": 2, > "column_type": { > "function_name": to_long? > "input_argument": { > "column_id": 1, > "column_type": int > } > } > } > } > > My main concern is that this can explode in complexity, especially when we > end up with cycles: date → timestamp → string → datetime. With custom > formatting is it also easy to go: timestamp → string → timestamptz. > > * There is already an existing proposal >> <https://github.com/apache/iceberg/issues/9065> to promote from >> Long->Timestamp [2] which assumes milliseconds. This seems like an >> arbitrary choice, where one could reasonably have other time granularities >> stored in Long values. > > > I agree, and this is also mentioned in the proposal. But each granularity > has different types, eg. timestamp, timestamp_ns. So this is not really a > problem. > > * Will "bytes to string" be a candidate? There are two reasonable >> approaches for underlying data either using hex-encoded value or doing an >> in-place conversion assuming all data is UTF-8? > > > I think that's fair <https://github.com/apache/iceberg/pull/5151>, but > believe we should be opinionated about how we store data. Would we be able > to write hex-encoded strings? > > I'd argue that once a schema is going from "any type"->"string", >> something was fairly wrong with data modelling initially, providing more >> tools to help users fix these types of issues seems beneficial in the long >> run (again not something that needs to be done now but laying the >> ground-work is useful). > > > Similar to the point above we should be opinionated about this. For > example, historically we've been parsing dates strictly, as an example, see > DateTimeUtil > <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/util/DateTimeUtil.java>. > > > I'm hesitant to introduce this complexity since I think for example the > formatting or parsing of custom formats should be done in the engine and > not the format. > > Kind regards, > Fokko Driesprong > > > > Op di 20 aug 2024 om 05:58 schreef Micah Kornfield <emkornfi...@gmail.com > >: > >> Hi Xiangjin, >> >> Could you elaborate a bit more on how the Parquet manifest would fix the >>> type promotion problem? If the lower and upper bounds are still a map of >>> <int, binary>, I don't think we can perform column pruning on that, and the >>> type information of the stat column is still missing. >> >> >> I think the idea with Parquet files is one would no longer use a map to >> track these statistics but instead have a column per field-id/statistics >> pair. So for each column in the original schema that would wanted to have >> either: >> >> struct col1_stats { >> int min_value; >> int max_value; >> long null_count, >> etc. >> } >> >> or a struct per statistic with each column: >> >> struct min_stats { >> int col1; >> string col2; >> etc >> } >> >> This is similar to how partition values are stored today in Avro. And I >> don't think there is anything stopping from doing this in Avro either, >> except it is potentially less useful because you can't save much by >> selecting only columns in Avro. >> >> Cheers, >> Micah >> >> >> >> >> >> On Mon, Aug 19, 2024 at 7:50 PM xianjin <xian...@apache.org> wrote: >> >>> Hey Ryan, >>> >>> Thanks for the reply, it clears most things up. Some responses inline: >>> >>> > This ends up being a little different because we can detect more cases >>> when the bounds must have been strings — any time when the length of the >>> upper and lower bound is different. Because strings tend to have longer >>> values (we truncate to 16 chars by default) and are not typically a fixed >>> number of bytes, we don’t expect very many cases in which you’d have to >>> discard the string bounds. >>> >>> > That said, if we want to be safe we could also wait to introduce int >>> and long to string promotion, or put it behind a flag to opt in. >>> >>> Thanks for the clarification. I would prefer to put it behind a flag to >>> opt in, considering the complexity it involves and the potential interests >>> from users. >>> >>> > We’ve discussed this topic before and the consensus has been to allow >>> type promotion if there is no partition field that would be affected. >>> Here’s the wording I’ve added in the PR: >>> >>> Type promotion is not allowed for a field that is referenced by >>> source-id or source-ids of a partition field if the partition transform >>> would produce a different value after promoting the type. >>> >>> > I think that covers the cases. >>> >>> Yes, I think it's indeed a good solution. >>> >>>> > There might be an easy/light way to add this new metadata: we can >>>> persist schema_id in the DataFile. >>> >>> > Fokko already covered some of the challenges here and I agree with >>>> those. I think that we want to be able to discard older schemas or not send >>>> them to clients. And I think that since we can fix this in the next rev of >>>> the spec, adding metadata just for this purpose is not a good strategy. I’d >>>> much rather work on the long-term fix than add metadata and complicated >>>> logic to pass schemas through and detect the type. >>> >>> > I think it might be worthy as we can always rewrite old data files to >>>> use the latest schema. >>> >>> > I agree, but this doesn’t solve the problem with the current manifest >>>> files. This is something I would introduce when we add Parquet manifest >>>> files. >>> >>> >>> Could you elaborate a bit more on how the Parquet manifest would fix >>> the type promotion problem? If the lower and upper bounds are still a map >>> of <int, binary>, I don't think we can perform column pruning on that, and >>> the type information of the stat column is still missing. >>> >>> On Tue, Aug 20, 2024 at 1:02 AM Ryan Blue <b...@databricks.com.invalid> >>> wrote: >>> >>>> If the reader logic depends on the length of data in bytes, will this >>>> prevent us from adding any type promotions to string? >>>> >>>> This ends up being a little different because we can detect more cases >>>> when the bounds must have been strings — any time when the length of the >>>> upper and lower bound is different. Because strings tend to have longer >>>> values (we truncate to 16 chars by default) and are not typically a fixed >>>> number of bytes, we don’t expect very many cases in which you’d have to >>>> discard the string bounds. >>>> >>>> That said, if we want to be safe we could also wait to introduce int >>>> and long to string promotion, or put it behind a flag to opt in. >>>> >>>> There might be an easy/light way to add this new metadata: we can >>>> persist schema_id in the DataFile. >>>> >>>> Fokko already covered some of the challenges here and I agree with >>>> those. I think that we want to be able to discard older schemas or not send >>>> them to clients. And I think that since we can fix this in the next rev of >>>> the spec, adding metadata just for this purpose is not a good strategy. I’d >>>> much rather work on the long-term fix than add metadata and complicated >>>> logic to pass schemas through and detect the type. >>>> >>>> I think there’s also another aspect to consider: whether the new type >>>> promotion is compatible with partition transforms >>>> >>>> We’ve discussed this topic before and the consensus has been to allow >>>> type promotion if there is no partition field that would be affected. >>>> Here’s the wording I’ve added in the PR: >>>> >>>> Type promotion is not allowed for a field that is referenced by >>>> source-id or source-ids of a partition field if the partition >>>> transform would produce a different value after promoting the type. >>>> >>>> I think that covers the cases. >>>> >>>> I think it might be worthy as we can always rewrite old data files to >>>> use the latest schema. >>>> >>>> I agree, but this doesn’t solve the problem with the current manifest >>>> files. This is something I would introduce when we add Parquet manifest >>>> files. >>>> >>>> On Mon, Aug 19, 2024 at 9:26 AM Amogh Jahagirdar 2am...@gmail.com >>>> <http://mailto:2am...@gmail.com> wrote: >>>> >>>> Hey all, >>>>> >>>>> > There might be an easy/light way to add this new metadata: we can >>>>> persist schema_id in the DataFile. It still adds some extra size to the >>>>> manifest file but should be negligible? >>>>> >>>>> I do think it's probably negligible in terms of the size (at least in >>>>> terms of the value that we get out of having that field) but I think the >>>>> main downside of that approach is that it's still an additional spec >>>>> change >>>>> which if we consider going down the route of Parquet manifests then that >>>>> field largely becomes moot/throw-away work at that point since the type >>>>> information would exist in Parquet. >>>>> >>>>> > And I think there’s also another aspect to consider: whether the new >>>>> type promotion is compatible with partition transforms. Currently all the >>>>> partition transforms produce the same result for promoted types: int -> >>>>> long, float -> double. If we are adding a new type promotion, current >>>>> partition transforms will produce different results for type promotion >>>>> such >>>>> as int/long -> string, so the partition() of DataFile will not hold for >>>>> promoted types. One possible way to fix that would be evolving the >>>>> PartitionSpec with a new one? >>>>> >>>>> Yes, an important part of type promotion is validation that whatever >>>>> evolution is being attempted can actually happen if the column being >>>>> evolved is part of a partition transform! I was working on an >>>>> implementation for this and so far it's just a strict check that the field >>>>> being promoted is not part of the partition spec. The proposed way of >>>>> evolving the spec sounds plausible specifically for int/long -> string, >>>>> although I'd need to double check/think through implications. >>>>> >>>>> Thanks, >>>>> >>>>> Amogh Jahagirdar >>>>> >>>>> >>>>> >>>>> On Mon, Aug 19, 2024 at 8:43 AM Xianjin YE <xian...@apache.org> wrote: >>>>> >>>>>> Hey Fokko, >>>>>> >>>>>> > Distribute all the schemas to the executors, and we have to do the >>>>>> lookup and comparison there. >>>>>> >>>>>> I don’t think this would be a problem: the schema id in the DataFile >>>>>> should be only used in driver’s planning phase to determine the >>>>>> lower/upper >>>>>> bounds, so no extra schema except the current one should be distributed >>>>>> to >>>>>> the executor. Even if all schemas are required, they can be retrieved >>>>>> from >>>>>> SerializableTable’s lazyTable() method. >>>>>> >>>>>> > Not being able to prune old schema until they are not used anymore >>>>>> (including all historical snapshots). >>>>>> >>>>>> That’s indeed a problem. However we haven’t add the ability to prune >>>>>> unused schemas yet(which I would like to add too after >>>>>> RemoveUnusedSpecs), >>>>>> we can consider that when implementing. BTW, I think it might be worthy >>>>>> as >>>>>> we can always rewrite old data files to use the latest schema. >>>>>> >>>>>> On Aug 19, 2024, at 22:19, Fokko Driesprong <fo...@apache.org> wrote: >>>>>> >>>>>> Thanks Ryan for bringing this up, that's an interesting problem, let >>>>>> me think about this. >>>>>> >>>>>> we can persist schema_id in the DataFile >>>>>> >>>>>> >>>>>> This was also my first thought. The two drawbacks are: >>>>>> >>>>>> - Distribute all the schemas to the executors, and we have to do >>>>>> the lookup and comparison there. >>>>>> - Not being able to prune old schema until they are not used >>>>>> anymore (including all historical snapshots). >>>>>> >>>>>> If we are adding new type promotion, current partition transforms >>>>>>> will produce different result for type promotion such as int/long -> >>>>>>> string, so the partition() of DataFile will not hold for promoted types. >>>>>>> One possible way to fix that would be evolving the PartitionSpec with a >>>>>>> new >>>>>>> one? >>>>>> >>>>>> >>>>>> That's a good call! Currently, we create partition spec evaluators >>>>>> based on the partition-spec-id. Evolving the partition spec would fix it. >>>>>> When we decide to include the schema-id, we would be able to create the >>>>>> evaluator based on the (partition-spec-id, schema-id) tuple when >>>>>> evaluating >>>>>> the partitions. >>>>>> >>>>>> Kind regards, >>>>>> Fokko >>>>>> >>>>>> >>>>>> >>>>>> Op ma 19 aug 2024 om 15:59 schreef Xianjin YE <xian...@apache.org>: >>>>>> >>>>>>> Thanks Ryan for bringing this up. >>>>>>> >>>>>>> > int and long to string >>>>>>> >>>>>>> >>>>>>> Could you elaborate a bit on how we can support type promotion for >>>>>>> `int` and `long` to `string` if the upper and lower bounds are already >>>>>>> encoded in 4/8 bytes binary? It seems that we cannot add promotions to >>>>>>> string as Piotr pointed out? >>>>>>> >>>>>>> >>>>>>> > My rationale for not adding new information to track the bound >>>>>>> types at the time that the data file metadata is created is that it >>>>>>> would >>>>>>> inflate the size of manifests and push out the timeline for getting v3 >>>>>>> done. >>>>>>> >>>>>>> There might be an easy/light way to add this new metadata: we can >>>>>>> persist schema_id in the DataFile. It still adds some extra size to the >>>>>>> manifest file but should be negligible? >>>>>>> >>>>>>> And I think there’s also another aspect to consider: whether the new >>>>>>> type promotion is compatible with partition transforms. Currently all >>>>>>> the >>>>>>> partition transforms produce the same result for promoted types: int -> >>>>>>> long, float -> double. If we are adding new type promotion, current >>>>>>> partition transforms will produce different result for type promotion >>>>>>> such >>>>>>> as int/long -> string, so the partition() of DataFile will not hold for >>>>>>> promoted types. One possible way to fix that would be evolving the >>>>>>> PartitionSpec with a new one? >>>>>>> >>>>>>> >>>>>>> On Aug 17, 2024, at 07:00, Ryan Blue <b...@apache.org> wrote: >>>>>>> >>>>>>> I’ve recently been working on updating the spec for new types and >>>>>>> type promotion cases in v3. >>>>>>> I was talking to Micah and he pointed out an issue with type >>>>>>> promotion: the upper and lower bounds for data file columns that are >>>>>>> kept >>>>>>> in Avro manifests don’t have any information about the type that was >>>>>>> used >>>>>>> to encode the bounds. >>>>>>> For example, when writing to a table with a float column, 4: f, the >>>>>>> manifest’s lower_bounds and upper_bounds maps will have an entry with >>>>>>> the >>>>>>> type ID (4) as the key and a 4-byte encoded float for the value. If >>>>>>> column >>>>>>> f were later promoted to double, those maps aren’t changed. The way we >>>>>>> currently detect that the type was promoted is to check the binary value >>>>>>> and read it as a float if there are 4 bytes instead of 8. This prevents >>>>>>> us >>>>>>> from adding int to double type promotion because when there are 4 bytes >>>>>>> we >>>>>>> would not know whether the value was originally an int or a float. >>>>>>> Several of the type promotion cases from my previous email hit this >>>>>>> problem. Date/time types to string, int and long to string, and long to >>>>>>> timestamp are all affected. I think the best path forward is to add >>>>>>> fewer >>>>>>> type promotion cases to v3 and support only these new cases: >>>>>>> • int and long to string >>>>>>> • date to timestamp >>>>>>> • null/unknown to any >>>>>>> • any to variant (if supported by the Variant spec) >>>>>>> That list would allow us to keep using the current strategy and not >>>>>>> add new metadata to track the type to our manifests. My rationale for >>>>>>> not >>>>>>> adding new information to track the bound types at the time that the >>>>>>> data >>>>>>> file metadata is created is that it would inflate the size of manifests >>>>>>> and >>>>>>> push out the timeline for getting v3 done. Many of us would like to get >>>>>>> v3 >>>>>>> released to get the timestamp_ns and variant types out. And if we can >>>>>>> get >>>>>>> at least some of the promotion cases out that’s better. >>>>>>> To address type promotion in the long term, I think that we should >>>>>>> consider moving to Parquet manifests. This has been suggested a few >>>>>>> times >>>>>>> so that we can project just the lower and upper bounds that are needed >>>>>>> for >>>>>>> scan planning. That would also fix type promotion because the manifest >>>>>>> file >>>>>>> schema would include full type information for the stats columns. Given >>>>>>> the >>>>>>> complexity of releasing Parquet manifests, I think it makes more sense >>>>>>> to >>>>>>> get a few promotion cases done now in v3 and follow up with the rest in >>>>>>> v4. >>>>>>> Ryan >>>>>>> >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> >>>>>>> >>>>>>> >>>>>> -- >>>> Ryan Blue >>>> Databricks >>>> >>>