Re: [Discussion] Collation Support

Alexander Löser Thu, 02 Jul 2026 15:22:38 -0700

Hi Andrei,

Thanks for putting together the spec PR and the detailed write-up! Theapproach mostly looks solid to me. I have a few questions/initialthoughts regarding the changes you proposed (compared to the originalproposal):

> 1 - Bounds store original values, not sort keys, tagged with aper-file collation version. ICU/CLDR sort keys aren't stable acrossversions, so storing keys ties every reader to one exact version;original values plus a per-file version (readers prune only on an exactmatch) degrade gracefully instead of breaking. The schema keeps thecollation name unversioned so anyone can read


If I understand correctly, we’re talking about two separate things here:

1. Tagging a column vs a single file with a certain ICU version
2. Using collation keys vs original strings (the ones that will produce
   the min/max collation keys)


For 1, it comes down to a tradeoff:

 * If we tag the column with the ICU version, we’d force engines to
   support one agreed-on ICU version if they want to prune files.
   Engines would be able to prune every file (if they support the
   specific ICU version), or none at all, so there is more incentive to
   support a specific version
 * If we tag individual files with ICU versions, we gain the big
   advantage that engines do not need to agree on a single ICU version.
   However, if I understand correctly, this comes at the cost of
   “fractured” pruning - engines will only be able to prune files that
   were written by themselves (or rather, with the same ICU version).
   As a consequence, performance might not really be interoperable
   between different engines.

Regardless of the approach we choose, all engines should be able to readthe data - they might just not be able to prune files.

For 2, I’m not sure if I understand the advantages of original stringsyet. As you already pointed out, the collation keys depend on the ICUversion. However, if I understand correctly, the same limitation wouldapply to the original strings: the sort order may (and does) changebetween different ICU versions, too. As a consequence, we can’t assumethe original lower/upper bound strings we stored for version X will alsobe lower/upper bounds for version Y - at least in the general case. Soif I understand correctly, we would not gain additional pruningopportunities compared to using collation keys. Or am I missingsomething here?


At the same time, sort keys do have advantages:

 * Iceberg allows the truncation of upper- and lower bounds. This is
   trivial for binary collation keys. For original strings, the task
   becomes significantly harder: truncating at a character boundary,
   for example, would lead to wrong results, as there are some
   context-sensitive sequences: e.g., with the CLDR root locale, abcเก
   < abcเ. I think it might be doable with ICU’s
   CollationElementIterator, but it will be tricky to get right.
 * Lower/upper bounds are computed once, but will be compared many
   times. With original strings, we would need to either convert to the
   collation key on the fly, or use ICU’s collator for a direct
   comparison. Both options will be slower than a raw byte-sequence
   comparison

There is one scenario where original values would shine, though. Ianalyzed the order-changes between various ICU versions: in many cases,only a small range of code points changes/is moved. If we had additionalmetadata about which code point ranges a file contains (e.g., whether itis ASCII only), engines might be able to prove that the original stringbounds for version X are still valid for version Y.If I'm not mistaken, this could allow to prune across different ICUversions in certain situations, which I’d consider a point in favor oforiginal values (and file-level ICU versions).

> 2 - A provider-qualified identifier (icu.en_US-ci), leaving room fornon-ICU collations like Spark's UTF8_LCASE, rather than assuming ICU asthe sole provider.

Adding a provider-mechanism sounds like a good approach to keep the specopen for future collations :slightly_smiling_face: I wonder whether weshould restrict the set of allowed providers, though, similar to how itwas done with geo<https://lists.apache.org/thread/r5x0do8f241bpf565rx8s5s3wc9ogp0f>. Mymain motivation for this proposal is interoperability. I worry thatinteroperability might suffer or vanish completely if every engine cancome up with their own definitions.




Happy to hear your thoughts on this!

Best, Alex


On 6/27/26 01:37, Szehon Ho wrote:

Very nice direction, left some comments on the spec proposal.

Thanks to you folks for working on it !
Szehon

On Fri, Jun 26, 2026 at 3:29 AM Andrei Tserakhau via dev<[email protected]> wrote:

Hi all,

I've spend some cycle on the collation discussion and make
something more concrete to react to: a spec-change PR plus
reference implementations (go and java).

- Spec change (apache/iceberg#16972): a "collation" annotation on
string fields, and a data_file.collation_bounds field so collated
columns stay prunable.
- Reference implementation in iceberg-go (apache/iceberg-go#1318):
the full path end to end - schema annotation, collation-aware
comparison (CLDR/UCA), collation bounds in the manifest, and
version-gated data-file pruning, with an Avro round-trip and
pruning tests.
- A lightweight Java POC (link below): the schema annotation plus
a Collator-backed comparator, to match where the discussion is. I
deliberately left the manifest/bounds side out of Java for now.

The design follows the original proposal but takes a few different
turns, mostly to adopt what we learned in Delta. The ones I'd most
like input on:

1 - Bounds store original values, not sort keys, tagged with a
per-file collation version. ICU/CLDR sort keys aren't stable
across versions, so storing keys ties every reader to one exact
version; original values plus a per-file version (readers prune
only on an exact match) degrade gracefully instead of breaking.
The schema keeps the collation name unversioned so anyone can read.

2 - A provider-qualified identifier (icu.en_US-ci), leaving room
for non-ICU collations like Spark's UTF8_LCASE, rather than
assuming ICU as the sole provider.

3 - One structural question I don't have a strong opinion on yet:
I put collation_bounds on data_file as a standalone v3 field, but
field id 146 is already the v4 content_stats struct, and collation
bounds might belong inside that typed-stats framework instead.
Worth settling before we fix field ids.

The full set of differences and the reader/writer rules are in the
PR description and the write-up. Comments very welcome — both on
the calls above and on whether the standalone-field vs
content_stats direction is the right one.

Best, Andrei

- original proposal:

https://docs.google.com/document/d/1m8b7u97uteHYjXk-4DNglJSpQO8OcZOCzW2tApCNTW4/edit?tab=t.0
- spec change: https://github.com/apache/iceberg/pull/16972
- POC in go: https://github.com/apache/iceberg-go/pull/1318
- java POC:
https://github.com/laskoviymishka/iceberg/tree/prototype/collation-support

On Mon, Mar 30, 2026 at 10:54 PM Alexander Löser
<[email protected]> wrote:

Hi Andrei,

I'm glad you're interested. Looking forward to collaborate
with you!
Thanks for all the feedback here and in the doc. I only had a
quick glance, but I think you raised some good points. I'll
address/respond to your comments as soon as I get the chance,
hopefully tomorrow.
I think you also left some comments in this mail that are not
yet in the doc - I'll move those to a dedicated section at the
end of the doc, so we can use the doc as a single source of
truth/discussion.

> Happy to share our Delta design doc and implementation
learnings in more detail.

Sure, sounds good :)

Best,
Alex

On 3/29/26 01:25, Andrei Tserakhau via dev wrote:

Hi Alexander,

This looks really interesting. We've been working on
collation support in Delta and have shipped it in production
for some time, so this is an area we care about a lot. If
this proposal moves forward we'd be happy to collaborate on
the design and implementation.

The pseudo-field approach for collation metrics is clean and
composes well with existing Iceberg infrastructure. The
specifier coverage is comprehensive.

A few areas worth discussing as this evolves:

1 - Sort key stability and versioning

ICU sort keys are not stable across versions, so a pinned ICU
version bump in a future Iceberg release would invalidate all
existing collation metrics. In multi-engine environments,
requiring all engines to converge on one ICU version is
unrealistic.

We store original string values instead of sort keys and
allow per-file version annotations -- worth discussing
whether something similar could work here.

2 - Provider abstraction

The proposal assumes ICU as the sole provider, but Spark
ships non-ICU collations like UTF8_LCASE that are widely
used. A provider or namespace layer would prevent name
collisions and support engine-specific collations without
future spec changes.

3 - Operational surface

A few things that turned out correctness-critical in our
implementation: partition transforms on collated columns
(collation-equal but byte-distinct values in different
directories), sort order semantics, equality deletes under
collation, and Parquet filter pushdown (must be disabled
since Parquet has no collation concept).

These don't all need to be solved in v1 but would help to
scope them.

4 - Smaller items (nit's)

UTF-8 bounds for the original field id should be "must write"
not "should" -- otherwise backward compat breaks for
non-aware engines. Engine fallback behavior (case-sensitive
vs older ICU vs fail) could use a recommended preference
order to avoid divergent results across engines. The
collation specifier syntax would benefit from a formal grammar.

---

Happy to share our Delta design doc and implementation
learnings in more detail. Looking forward to the discussion.

Best,
Andrei

On Sat, Mar 28, 2026 at 11:49 PM Alexander Löser
<[email protected]> wrote:

Hi everyone,

this is my first interaction with the Iceberg community,
so here a few words about myself:
- I'm Alex, a Berlin-based software engineer
- I've been working at Snowflake for 4 years now
- I spend most of my time on data types, particularly
binary, strings and collations.

I'd like to start a discussion about adding collations to
the Iceberg spec.

Conceptually, collations are an annotation on the string
data type. By default, most engines perform string
operations case-sensitively.
Collations allow specifying alternative comparison rules.
This is useful for achieving, e.g., case- or
accent-insensitive string operations, or
language-specific string sorting.
Collations are supported by many engines: Databricks

<https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-collation>,
Spark

<https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.collate.html>,
Snowflake
<https://docs.snowflake.com/en/sql-reference/collation>,
Oracle

<https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/COLLATION.html>
- to
name just a few - this list is not complete.

In Snowflake, we see heavy use of the collation feature.
Several users have approached us, mentioning they want to
migrate to Iceberg tables, but are currently blocked by
Iceberg's lack of collation support.

Given the widespread support for collations across
different engines, I believe introducing collations to
Iceberg will increase interoperability and boost its
adoption.
I'd be curious about your thoughts.

*Goal of the proposal*
- Support collation specifications for columns
- Define how collation bounds should be stored - UTF-8
based bounds are not useful for collated columns

*Required Changes*
- Extend the schema to let (string) fields be annotated
with a collation

More details can be found in this doc

<https://docs.google.com/document/d/1m8b7u97uteHYjXk-4DNglJSpQO8OcZOCzW2tApCNTW4/edit?tab=t.0#heading=h.y1ant4w2163k>.

I'm also hoping to present the idea in the next community
sync.

Best, Alex

Re: [Discussion] Collation Support

Reply via email to