Thanks, Micah!

On Thu, Nov 27, 2025 at 4:48 AM Micah Kornfield <[email protected]>
wrote:

> Hi Joana,
> Here are my thoughts, which are by no means the definitive answer here.
>
>
>> 1. Given that variant can store any data type (both structured and
>> primitive), I'm unclear when unknown would be preferred as similar
>> behavior could be achieved by adding nullable variant columns? It seems
>> like variant could handle most schema evolution scenarios. Are there
>> specific situations where unknown is the better choice?
>
>
> I think the point of the type is to not impose on a system the need have
> to use a nullable variant column if it can't infer the type.   The variant
> type has more overhead and can't easily be narrowed solely based on a
> metadata operation to other types (but a NullType can easily be widened to
> any type as a metadata operation).
>
> The null type is generally meant from moving from schema-less systems to
> ones with a schema.  e.g. A CSV file that has an empty value for every
> field in a particular column.  I think Parquet's description of its
> analogous type [1] is a good illustration:
>
> "Sometimes when discovering the schema of existing data, values are always
> null and the physical type can't be determined. This annotation signals the
> case where the physical type was guessed from all null values."
>
> That being said I don't think it is necessarily a bad idea if a system
> wants to use Nullable variants for this use-case.
>
> 2. Also, is unknown intended for explicit use in DDL? Meaning, should
>> users write DDL like:
>
>
> In general, I don't think there is much of a use-case for allowing users
> to set this through DDL, other than perhaps cloning it from an existing
> table. As you pointed out if someone wishing to keep there options open is
> likely better off using variant, or a type that can be widened later.
>
> There are probably multiple ways of handling evolution but two possible
> workable alternatives (I don't think these belong in the iceberg spec):
> 1.  Automatically evolve the schema based on the first inserted non-null
> value for the column.
> 2.  Block insertions that try to insert a non-null values in the column
> until user explicitly alters the column to a specific type.
>
> Cheers,
> Micah
>
> [1]
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L330
>
> On Tue, Nov 18, 2025 at 4:45 AM Joana Hrotkó
> <[email protected]> wrote:
>
>> Hi Iceberg Community,
>>
>> I'm working with Iceberg v3 and trying to understand the practical use
>> cases for the unknown type, especially in relation to the variant type.
>>
>> The variant type handles both semi-structured data (JSON, nested
>> objects/arrays) and primitive types (strings, integers, booleans, dates,
>> timestamps, etc.) with efficient binary encoding. It supports schema
>> evolution and provides good query performance.
>>
>> The unknown type is described as being for "evolving schemas without
>> forcing immediate resolution" and must always default to null.
>>
>> 1. Given that variant can store any data type (both structured and
>> primitive), I'm unclear when unknown would be preferred as similar
>> behavior could be achieved by adding nullable variant columns? It seems
>> like variant could handle most schema evolution scenarios. Are there
>> specific situations where unknown is the better choice?
>>
>> 2. Also, is unknown intended for explicit use in DDL? Meaning, should
>> users write DDL like:
>>
>> CREATE TABLE foo (col1 unknown)ALTER TABLE foo ADD COLUMN col2 unknown
>>
>> Or is unknown an internal type that engines use automatically during
>> schema evolution?
>>
>> Cheers,
>>
>> Joana Hrotkó
>>
>

Reply via email to