Thanks, Micah! On Thu, Nov 27, 2025 at 4:48 AM Micah Kornfield <[email protected]> wrote:
> Hi Joana, > Here are my thoughts, which are by no means the definitive answer here. > > >> 1. Given that variant can store any data type (both structured and >> primitive), I'm unclear when unknown would be preferred as similar >> behavior could be achieved by adding nullable variant columns? It seems >> like variant could handle most schema evolution scenarios. Are there >> specific situations where unknown is the better choice? > > > I think the point of the type is to not impose on a system the need have > to use a nullable variant column if it can't infer the type. The variant > type has more overhead and can't easily be narrowed solely based on a > metadata operation to other types (but a NullType can easily be widened to > any type as a metadata operation). > > The null type is generally meant from moving from schema-less systems to > ones with a schema. e.g. A CSV file that has an empty value for every > field in a particular column. I think Parquet's description of its > analogous type [1] is a good illustration: > > "Sometimes when discovering the schema of existing data, values are always > null and the physical type can't be determined. This annotation signals the > case where the physical type was guessed from all null values." > > That being said I don't think it is necessarily a bad idea if a system > wants to use Nullable variants for this use-case. > > 2. Also, is unknown intended for explicit use in DDL? Meaning, should >> users write DDL like: > > > In general, I don't think there is much of a use-case for allowing users > to set this through DDL, other than perhaps cloning it from an existing > table. As you pointed out if someone wishing to keep there options open is > likely better off using variant, or a type that can be widened later. > > There are probably multiple ways of handling evolution but two possible > workable alternatives (I don't think these belong in the iceberg spec): > 1. Automatically evolve the schema based on the first inserted non-null > value for the column. > 2. Block insertions that try to insert a non-null values in the column > until user explicitly alters the column to a specific type. > > Cheers, > Micah > > [1] > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L330 > > On Tue, Nov 18, 2025 at 4:45 AM Joana Hrotkó > <[email protected]> wrote: > >> Hi Iceberg Community, >> >> I'm working with Iceberg v3 and trying to understand the practical use >> cases for the unknown type, especially in relation to the variant type. >> >> The variant type handles both semi-structured data (JSON, nested >> objects/arrays) and primitive types (strings, integers, booleans, dates, >> timestamps, etc.) with efficient binary encoding. It supports schema >> evolution and provides good query performance. >> >> The unknown type is described as being for "evolving schemas without >> forcing immediate resolution" and must always default to null. >> >> 1. Given that variant can store any data type (both structured and >> primitive), I'm unclear when unknown would be preferred as similar >> behavior could be achieved by adding nullable variant columns? It seems >> like variant could handle most schema evolution scenarios. Are there >> specific situations where unknown is the better choice? >> >> 2. Also, is unknown intended for explicit use in DDL? Meaning, should >> users write DDL like: >> >> CREATE TABLE foo (col1 unknown)ALTER TABLE foo ADD COLUMN col2 unknown >> >> Or is unknown an internal type that engines use automatically during >> schema evolution? >> >> Cheers, >> >> Joana Hrotkó >> >
