Re: [Discussion] Collation Support

Alexander Löser Mon, 30 Mar 2026 13:54:21 -0700

Hi Andrei,

I'm glad you're interested. Looking forward to collaborate with you!

Thanks for all the feedback here and in the doc. I only had a quickglance, but I think you raised some good points. I'll address/respondto your comments as soon as I get the chance, hopefully tomorrow.I think you also left some comments in this mail that are not yet in thedoc - I'll move those to a dedicated section at the end of the doc, sowe can use the doc as a single source of truth/discussion.

> Happy to share our Delta design doc and implementation learnings inmore detail.


Sure, sounds good :)

Best,
Alex

On 3/29/26 01:25, Andrei Tserakhau via dev wrote:

Hi Alexander,
This looks really interesting. We've been working on collation supportin Delta and have shipped it in production for some time, so this isan area we care about a lot. If this proposal moves forward we'd behappy to collaborate on the design and implementation.
The pseudo-field approach for collation metrics is clean and composeswell with existing Iceberg infrastructure. The specifier coverage iscomprehensive.
A few areas worth discussing as this evolves:

1 - Sort key stability and versioning
ICU sort keys are not stable across versions, so a pinned ICU versionbump in a future Iceberg release would invalidate all existingcollation metrics. In multi-engine environments, requiring all enginesto converge on one ICU version is unrealistic.
We store original string values instead of sort keys and allowper-file version annotations -- worth discussing whether somethingsimilar could work here.
2 - Provider abstraction
The proposal assumes ICU as the sole provider, but Spark ships non-ICUcollations like UTF8_LCASE that are widely used. A provider ornamespace layer would prevent name collisions and supportengine-specific collations without future spec changes.
3 - Operational surface
A few things that turned out correctness-critical in ourimplementation: partition transforms on collated columns(collation-equal but byte-distinct values in different directories),sort order semantics, equality deletes under collation, and Parquetfilter pushdown (must be disabled since Parquet has no collationconcept).
These don't all need to be solved in v1 but would help to scope them.

4 - Smaller items (nit's)
UTF-8 bounds for the original field id should be "must write" not"should" -- otherwise backward compat breaks for non-aware engines.Engine fallback behavior (case-sensitive vs older ICU vs fail) coulduse a recommended preference order to avoid divergent results acrossengines. The collation specifier syntax would benefit from a formalgrammar.
---
Happy to share our Delta design doc and implementation learnings inmore detail. Looking forward to the discussion.
Best,
Andrei
On Sat, Mar 28, 2026 at 11:49 PM Alexander Löser<[email protected]> wrote:
    Hi everyone,

    this is my first interaction with the Iceberg community, so here a
    few words about myself:
    - I'm Alex, a Berlin-based software engineer
    - I've been working at Snowflake for 4 years now
    - I spend most of my time on data types, particularly binary,
    strings and collations.

    I'd like to start a discussion about adding collations to the
    Iceberg spec.

    Conceptually, collations are an annotation on the string data
    type. By default, most engines perform string operations
    case-sensitively.
    Collations allow specifying alternative comparison rules. This is
    useful for achieving, e.g., case- or accent-insensitive string
    operations, or language-specific string sorting.
    Collations are supported by many engines: Databricks
    <https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-collation>,
    Spark
    
<https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.collate.html>,
    Snowflake <https://docs.snowflake.com/en/sql-reference/collation>,
    Oracle
    
<https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/COLLATION.html>
 - to
    name just a few - this list is not complete.

    In Snowflake, we see heavy use of the collation feature. Several
    users have approached us, mentioning they want to migrate to
    Iceberg tables, but are currently blocked by Iceberg's lack of
    collation support.

    Given the widespread support for collations across different
    engines, I believe introducing collations to Iceberg will increase
    interoperability and boost its adoption.
    I'd be curious about your thoughts.

    *Goal of the proposal*
    - Support collation specifications for columns
    - Define how collation bounds should be stored - UTF-8 based
    bounds are not useful for collated columns

    *Required Changes*
    - Extend the schema to let (string) fields be annotated with a
    collation

    More details can be found in this doc
    
<https://docs.google.com/document/d/1m8b7u97uteHYjXk-4DNglJSpQO8OcZOCzW2tApCNTW4/edit?tab=t.0#heading=h.y1ant4w2163k>.

    I'm also hoping to present the idea in the next community sync.

    Best, Alex

Re: [Discussion] Collation Support

Reply via email to