Hi Andrei,
I'm glad you're interested. Looking forward to collaborate with you!
Thanks for all the feedback here and in the doc. I only had a quick
glance, but I think you raised some good points. I'll address/respond
to your comments as soon as I get the chance, hopefully tomorrow.
I think you also left some comments in this mail that are not yet in the
doc - I'll move those to a dedicated section at the end of the doc, so
we can use the doc as a single source of truth/discussion.
> Happy to share our Delta design doc and implementation learnings in
more detail.
Sure, sounds good :)
Best,
Alex
On 3/29/26 01:25, Andrei Tserakhau via dev wrote:
Hi Alexander,
This looks really interesting. We've been working on collation support
in Delta and have shipped it in production for some time, so this is
an area we care about a lot. If this proposal moves forward we'd be
happy to collaborate on the design and implementation.
The pseudo-field approach for collation metrics is clean and composes
well with existing Iceberg infrastructure. The specifier coverage is
comprehensive.
A few areas worth discussing as this evolves:
1 - Sort key stability and versioning
ICU sort keys are not stable across versions, so a pinned ICU version
bump in a future Iceberg release would invalidate all existing
collation metrics. In multi-engine environments, requiring all engines
to converge on one ICU version is unrealistic.
We store original string values instead of sort keys and allow
per-file version annotations -- worth discussing whether something
similar could work here.
2 - Provider abstraction
The proposal assumes ICU as the sole provider, but Spark ships non-ICU
collations like UTF8_LCASE that are widely used. A provider or
namespace layer would prevent name collisions and support
engine-specific collations without future spec changes.
3 - Operational surface
A few things that turned out correctness-critical in our
implementation: partition transforms on collated columns
(collation-equal but byte-distinct values in different directories),
sort order semantics, equality deletes under collation, and Parquet
filter pushdown (must be disabled since Parquet has no collation
concept).
These don't all need to be solved in v1 but would help to scope them.
4 - Smaller items (nit's)
UTF-8 bounds for the original field id should be "must write" not
"should" -- otherwise backward compat breaks for non-aware engines.
Engine fallback behavior (case-sensitive vs older ICU vs fail) could
use a recommended preference order to avoid divergent results across
engines. The collation specifier syntax would benefit from a formal
grammar.
---
Happy to share our Delta design doc and implementation learnings in
more detail. Looking forward to the discussion.
Best,
Andrei
On Sat, Mar 28, 2026 at 11:49 PM Alexander Löser
<[email protected]> wrote:
Hi everyone,
this is my first interaction with the Iceberg community, so here a
few words about myself:
- I'm Alex, a Berlin-based software engineer
- I've been working at Snowflake for 4 years now
- I spend most of my time on data types, particularly binary,
strings and collations.
I'd like to start a discussion about adding collations to the
Iceberg spec.
Conceptually, collations are an annotation on the string data
type. By default, most engines perform string operations
case-sensitively.
Collations allow specifying alternative comparison rules. This is
useful for achieving, e.g., case- or accent-insensitive string
operations, or language-specific string sorting.
Collations are supported by many engines: Databricks
<https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-collation>,
Spark
<https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.collate.html>,
Snowflake <https://docs.snowflake.com/en/sql-reference/collation>,
Oracle
<https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/COLLATION.html>
- to
name just a few - this list is not complete.
In Snowflake, we see heavy use of the collation feature. Several
users have approached us, mentioning they want to migrate to
Iceberg tables, but are currently blocked by Iceberg's lack of
collation support.
Given the widespread support for collations across different
engines, I believe introducing collations to Iceberg will increase
interoperability and boost its adoption.
I'd be curious about your thoughts.
*Goal of the proposal*
- Support collation specifications for columns
- Define how collation bounds should be stored - UTF-8 based
bounds are not useful for collated columns
*Required Changes*
- Extend the schema to let (string) fields be annotated with a
collation
More details can be found in this doc
<https://docs.google.com/document/d/1m8b7u97uteHYjXk-4DNglJSpQO8OcZOCzW2tApCNTW4/edit?tab=t.0#heading=h.y1ant4w2163k>.
I'm also hoping to present the idea in the next community sync.
Best, Alex