Hi Andrei,

Thanks for putting together the spec PR and the detailed write-up! The approach mostly looks solid to me. I have a few questions/initial thoughts regarding the changes you proposed (compared to the original proposal):

> 1 - Bounds store original values, not sort keys, tagged with a per-file collation version. ICU/CLDR sort keys aren't stable across versions, so storing keys ties every reader to one exact version; original values plus a per-file version (readers prune only on an exact match) degrade gracefully instead of breaking. The schema keeps the collation name unversioned so anyone can read

If I understand correctly, we’re talking about two separate things here:

1. Tagging a column vs a single file with a certain ICU version
2. Using collation keys vs original strings (the ones that will produce
   the min/max collation keys)


For 1, it comes down to a tradeoff:

 * If we tag the column with the ICU version, we’d force engines to
   support one agreed-on ICU version if they want to prune files.
   Engines would be able to prune every file (if they support the
   specific ICU version), or none at all, so there is more incentive to
   support a specific version
 * If we tag individual files with ICU versions, we gain the big
   advantage that engines do not need to agree on a single ICU version.
   However, if I understand correctly, this comes at the cost of
   “fractured” pruning - engines will only be able to prune files that
   were written by themselves (or rather, with the same ICU version).
   As a consequence, performance might not really be interoperable
   between different engines.

Regardless of the approach we choose, all engines should be able to read the data - they might just not be able to prune files.

For 2, I’m not sure if I understand the advantages of original strings yet. As you already pointed out, the collation keys depend on the ICU version. However, if I understand correctly, the same limitation would apply to the original strings: the sort order may (and does) change between different ICU versions, too. As a consequence, we can’t assume the original lower/upper bound strings we stored for version X will also be lower/upper bounds for version Y - at least in the general case. So if I understand correctly, we would not gain additional pruning opportunities compared to using collation keys. Or am I missing something here?

At the same time, sort keys do have advantages:

 * Iceberg allows the truncation of upper- and lower bounds. This is
   trivial for binary collation keys. For original strings, the task
   becomes significantly harder: truncating at a character boundary,
   for example, would lead to wrong results, as there are some
   context-sensitive sequences: e.g., with the CLDR root locale, abcเก
   < abcเ. I think it might be doable with ICU’s
   CollationElementIterator, but it will be tricky to get right.
 * Lower/upper bounds are computed once, but will be compared many
   times. With original strings, we would need to either convert to the
   collation key on the fly, or use ICU’s collator for a direct
   comparison. Both options will be slower than a raw byte-sequence
   comparison


There is one scenario where original values would shine, though. I analyzed the order-changes between various ICU versions: in many cases, only a small range of code points changes/is moved. If we had additional metadata about which code point ranges a file contains (e.g., whether it is ASCII only), engines might be able to prove that the original string bounds for version X are still valid for version Y. If I'm not mistaken, this could allow to prune across different ICU versions in certain situations, which I’d consider a point in favor of original values (and file-level ICU versions).



> 2 - A provider-qualified identifier (icu.en_US-ci), leaving room for non-ICU collations like Spark's UTF8_LCASE, rather than assuming ICU as the sole provider.

Adding a provider-mechanism sounds like a good approach to keep the spec open for future collations :slightly_smiling_face: I wonder whether we should restrict the set of allowed providers, though, similar to how it was done with geo <https://lists.apache.org/thread/r5x0do8f241bpf565rx8s5s3wc9ogp0f>. My main motivation for this proposal is interoperability. I worry that interoperability might suffer or vanish completely if every engine can come up with their own definitions.



Happy to hear your thoughts on this!

Best, Alex


On 6/27/26 01:37, Szehon Ho wrote:
Very nice direction, left some comments on the spec proposal.

Thanks to you folks for working on it !
Szehon

On Fri, Jun 26, 2026 at 3:29 AM Andrei Tserakhau via dev <[email protected]> wrote:

    Hi all,

    I've spend some cycle on the collation discussion and make
    something more concrete to react to: a spec-change PR plus
    reference implementations (go and java).

    - Spec change (apache/iceberg#16972): a "collation" annotation on
    string fields, and a data_file.collation_bounds field so collated
    columns stay prunable.
    - Reference implementation in iceberg-go (apache/iceberg-go#1318):
    the full path end to end - schema annotation, collation-aware
    comparison (CLDR/UCA), collation bounds in the manifest, and
    version-gated data-file pruning, with an Avro round-trip and
    pruning tests.
    - A lightweight Java POC (link below): the schema annotation plus
    a Collator-backed comparator, to match where the discussion is. I
    deliberately left the manifest/bounds side out of Java for now.

    The design follows the original proposal but takes a few different
    turns, mostly to adopt what we learned in Delta. The ones I'd most
    like input on:

    1 - Bounds store original values, not sort keys, tagged with a
    per-file collation version. ICU/CLDR sort keys aren't stable
    across versions, so storing keys ties every reader to one exact
    version; original values plus a per-file version (readers prune
    only on an exact match) degrade gracefully instead of breaking.
    The schema keeps the collation name unversioned so anyone can read.

    2 - A provider-qualified identifier (icu.en_US-ci), leaving room
    for non-ICU collations like Spark's UTF8_LCASE, rather than
    assuming ICU as the sole provider.

    3 - One structural question I don't have a strong opinion on yet:
    I put collation_bounds on data_file as a standalone v3 field, but
    field id 146 is already the v4 content_stats struct, and collation
    bounds might belong inside that typed-stats framework instead.
    Worth settling before we fix field ids.

    The full set of differences and the reader/writer rules are in the
    PR description and the write-up. Comments very welcome — both on
    the calls above and on whether the standalone-field vs
    content_stats direction is the right one.

    Best, Andrei

    - original proposal:
    
https://docs.google.com/document/d/1m8b7u97uteHYjXk-4DNglJSpQO8OcZOCzW2tApCNTW4/edit?tab=t.0
    - spec change: https://github.com/apache/iceberg/pull/16972
    - POC in go: https://github.com/apache/iceberg-go/pull/1318
    - java POC:
    https://github.com/laskoviymishka/iceberg/tree/prototype/collation-support

    On Mon, Mar 30, 2026 at 10:54 PM Alexander Löser
    <[email protected]> wrote:

        Hi Andrei,

        I'm glad you're interested. Looking forward to collaborate
        with you!
        Thanks for all the feedback here and in the doc. I only had a
        quick glance, but I think you raised some good points.  I'll
        address/respond to your comments as soon as  I get the chance,
        hopefully tomorrow.
        I think you also left some comments in this mail that are not
        yet in the doc - I'll move those to a dedicated section at the
        end of the doc, so we can use the doc as a single source of
        truth/discussion.

        > Happy to share our Delta design doc and implementation
        learnings in more detail.

        Sure, sounds good :)

        Best,
        Alex

        On 3/29/26 01:25, Andrei Tserakhau via dev wrote:
        Hi Alexander,

        This looks really interesting. We've been working on
        collation support in Delta and have shipped it in production
        for some time, so this is an area we care about a lot. If
        this proposal moves forward we'd be happy to collaborate on
        the design and implementation.

        The pseudo-field approach for collation metrics is clean and
        composes well with existing Iceberg infrastructure. The
        specifier coverage is comprehensive.

        A few areas worth discussing as this evolves:

        1 - Sort key stability and versioning

        ICU sort keys are not stable across versions, so a pinned ICU
        version bump in a future Iceberg release would invalidate all
        existing collation metrics. In multi-engine environments,
        requiring all engines to converge on one ICU version is
        unrealistic.

        We store original string values instead of sort keys and
        allow per-file version annotations -- worth discussing
        whether something similar could work here.

        2 - Provider abstraction

        The proposal assumes ICU as the sole provider, but Spark
        ships non-ICU collations like UTF8_LCASE that are widely
        used. A provider or namespace layer would prevent name
        collisions and support engine-specific collations without
        future spec changes.

        3 - Operational surface

        A few things that turned out correctness-critical in our
        implementation: partition transforms on collated columns
        (collation-equal but byte-distinct values in different
        directories), sort order semantics, equality deletes under
        collation, and Parquet filter pushdown (must be disabled
        since Parquet has no collation concept).

        These don't all need to be solved in v1 but would help to
        scope them.

        4 - Smaller items (nit's)

        UTF-8 bounds for the original field id should be "must write"
        not "should" -- otherwise backward compat breaks for
        non-aware engines. Engine fallback behavior (case-sensitive
        vs older ICU vs fail) could use a recommended preference
        order to avoid divergent results across engines. The
        collation specifier syntax would benefit from a formal grammar.

        ---

        Happy to share our Delta design doc and implementation
        learnings in more detail. Looking forward to the discussion.

        Best,
        Andrei

        On Sat, Mar 28, 2026 at 11:49 PM Alexander Löser
        <[email protected]> wrote:

            Hi everyone,

            this is my first interaction with the Iceberg community,
            so here a few words about myself:
            - I'm Alex, a Berlin-based software engineer
            - I've been working at Snowflake for 4 years now
            - I spend most of my time on data types, particularly
            binary, strings and collations.

            I'd like to start a discussion about adding collations to
            the Iceberg spec.

            Conceptually, collations are an annotation on the string
            data type. By default, most engines perform string
            operations case-sensitively.
            Collations allow specifying alternative comparison rules.
            This is useful for achieving, e.g., case- or
            accent-insensitive string operations, or
            language-specific string sorting.
            Collations are supported by many engines: Databricks
            
<https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-collation>,
            Spark
            
<https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.collate.html>,
            Snowflake
            <https://docs.snowflake.com/en/sql-reference/collation>,
            Oracle
            
<https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/COLLATION.html>
 - to
            name just a few - this list is not complete.

            In Snowflake, we see heavy use of the collation feature.
            Several users have approached us, mentioning they want to
            migrate to Iceberg tables, but are currently blocked by
            Iceberg's lack of collation support.

            Given the widespread support for collations across
            different engines, I believe introducing collations to
            Iceberg will increase interoperability and boost its
            adoption.
            I'd be curious about your thoughts.

            *Goal of the proposal*
            - Support collation specifications for columns
            - Define how collation bounds should be stored - UTF-8
            based bounds are not useful for collated columns

            *Required Changes*
            - Extend the schema to let (string) fields be annotated
            with a collation

            More details can be found in this doc
            
<https://docs.google.com/document/d/1m8b7u97uteHYjXk-4DNglJSpQO8OcZOCzW2tApCNTW4/edit?tab=t.0#heading=h.y1ant4w2163k>.

            I'm also hoping to present the idea in the next community
            sync.

            Best, Alex

Reply via email to