Hi Andrei,
Thanks for putting together the spec PR and the detailed write-up! The
approach mostly looks solid to me. I have a few questions/initial
thoughts regarding the changes you proposed (compared to the original
proposal):
> 1 - Bounds store original values, not sort keys, tagged with a
per-file collation version. ICU/CLDR sort keys aren't stable across
versions, so storing keys ties every reader to one exact version;
original values plus a per-file version (readers prune only on an exact
match) degrade gracefully instead of breaking. The schema keeps the
collation name unversioned so anyone can read
If I understand correctly, we’re talking about two separate things here:
1. Tagging a column vs a single file with a certain ICU version
2. Using collation keys vs original strings (the ones that will produce
the min/max collation keys)
For 1, it comes down to a tradeoff:
* If we tag the column with the ICU version, we’d force engines to
support one agreed-on ICU version if they want to prune files.
Engines would be able to prune every file (if they support the
specific ICU version), or none at all, so there is more incentive to
support a specific version
* If we tag individual files with ICU versions, we gain the big
advantage that engines do not need to agree on a single ICU version.
However, if I understand correctly, this comes at the cost of
“fractured” pruning - engines will only be able to prune files that
were written by themselves (or rather, with the same ICU version).
As a consequence, performance might not really be interoperable
between different engines.
Regardless of the approach we choose, all engines should be able to read
the data - they might just not be able to prune files.
For 2, I’m not sure if I understand the advantages of original strings
yet. As you already pointed out, the collation keys depend on the ICU
version. However, if I understand correctly, the same limitation would
apply to the original strings: the sort order may (and does) change
between different ICU versions, too. As a consequence, we can’t assume
the original lower/upper bound strings we stored for version X will also
be lower/upper bounds for version Y - at least in the general case. So
if I understand correctly, we would not gain additional pruning
opportunities compared to using collation keys. Or am I missing
something here?
At the same time, sort keys do have advantages:
* Iceberg allows the truncation of upper- and lower bounds. This is
trivial for binary collation keys. For original strings, the task
becomes significantly harder: truncating at a character boundary,
for example, would lead to wrong results, as there are some
context-sensitive sequences: e.g., with the CLDR root locale, abcเก
< abcเ. I think it might be doable with ICU’s
CollationElementIterator, but it will be tricky to get right.
* Lower/upper bounds are computed once, but will be compared many
times. With original strings, we would need to either convert to the
collation key on the fly, or use ICU’s collator for a direct
comparison. Both options will be slower than a raw byte-sequence
comparison
There is one scenario where original values would shine, though. I
analyzed the order-changes between various ICU versions: in many cases,
only a small range of code points changes/is moved. If we had additional
metadata about which code point ranges a file contains (e.g., whether it
is ASCII only), engines might be able to prove that the original string
bounds for version X are still valid for version Y.
If I'm not mistaken, this could allow to prune across different ICU
versions in certain situations, which I’d consider a point in favor of
original values (and file-level ICU versions).
> 2 - A provider-qualified identifier (icu.en_US-ci), leaving room for
non-ICU collations like Spark's UTF8_LCASE, rather than assuming ICU as
the sole provider.
Adding a provider-mechanism sounds like a good approach to keep the spec
open for future collations :slightly_smiling_face: I wonder whether we
should restrict the set of allowed providers, though, similar to how it
was done with geo
<https://lists.apache.org/thread/r5x0do8f241bpf565rx8s5s3wc9ogp0f>. My
main motivation for this proposal is interoperability. I worry that
interoperability might suffer or vanish completely if every engine can
come up with their own definitions.
Happy to hear your thoughts on this!
Best, Alex
On 6/27/26 01:37, Szehon Ho wrote:
Very nice direction, left some comments on the spec proposal.
Thanks to you folks for working on it !
Szehon
On Fri, Jun 26, 2026 at 3:29 AM Andrei Tserakhau via dev
<[email protected]> wrote:
Hi all,
I've spend some cycle on the collation discussion and make
something more concrete to react to: a spec-change PR plus
reference implementations (go and java).
- Spec change (apache/iceberg#16972): a "collation" annotation on
string fields, and a data_file.collation_bounds field so collated
columns stay prunable.
- Reference implementation in iceberg-go (apache/iceberg-go#1318):
the full path end to end - schema annotation, collation-aware
comparison (CLDR/UCA), collation bounds in the manifest, and
version-gated data-file pruning, with an Avro round-trip and
pruning tests.
- A lightweight Java POC (link below): the schema annotation plus
a Collator-backed comparator, to match where the discussion is. I
deliberately left the manifest/bounds side out of Java for now.
The design follows the original proposal but takes a few different
turns, mostly to adopt what we learned in Delta. The ones I'd most
like input on:
1 - Bounds store original values, not sort keys, tagged with a
per-file collation version. ICU/CLDR sort keys aren't stable
across versions, so storing keys ties every reader to one exact
version; original values plus a per-file version (readers prune
only on an exact match) degrade gracefully instead of breaking.
The schema keeps the collation name unversioned so anyone can read.
2 - A provider-qualified identifier (icu.en_US-ci), leaving room
for non-ICU collations like Spark's UTF8_LCASE, rather than
assuming ICU as the sole provider.
3 - One structural question I don't have a strong opinion on yet:
I put collation_bounds on data_file as a standalone v3 field, but
field id 146 is already the v4 content_stats struct, and collation
bounds might belong inside that typed-stats framework instead.
Worth settling before we fix field ids.
The full set of differences and the reader/writer rules are in the
PR description and the write-up. Comments very welcome — both on
the calls above and on whether the standalone-field vs
content_stats direction is the right one.
Best, Andrei
- original proposal:
https://docs.google.com/document/d/1m8b7u97uteHYjXk-4DNglJSpQO8OcZOCzW2tApCNTW4/edit?tab=t.0
- spec change: https://github.com/apache/iceberg/pull/16972
- POC in go: https://github.com/apache/iceberg-go/pull/1318
- java POC:
https://github.com/laskoviymishka/iceberg/tree/prototype/collation-support
On Mon, Mar 30, 2026 at 10:54 PM Alexander Löser
<[email protected]> wrote:
Hi Andrei,
I'm glad you're interested. Looking forward to collaborate
with you!
Thanks for all the feedback here and in the doc. I only had a
quick glance, but I think you raised some good points. I'll
address/respond to your comments as soon as I get the chance,
hopefully tomorrow.
I think you also left some comments in this mail that are not
yet in the doc - I'll move those to a dedicated section at the
end of the doc, so we can use the doc as a single source of
truth/discussion.
> Happy to share our Delta design doc and implementation
learnings in more detail.
Sure, sounds good :)
Best,
Alex
On 3/29/26 01:25, Andrei Tserakhau via dev wrote:
Hi Alexander,
This looks really interesting. We've been working on
collation support in Delta and have shipped it in production
for some time, so this is an area we care about a lot. If
this proposal moves forward we'd be happy to collaborate on
the design and implementation.
The pseudo-field approach for collation metrics is clean and
composes well with existing Iceberg infrastructure. The
specifier coverage is comprehensive.
A few areas worth discussing as this evolves:
1 - Sort key stability and versioning
ICU sort keys are not stable across versions, so a pinned ICU
version bump in a future Iceberg release would invalidate all
existing collation metrics. In multi-engine environments,
requiring all engines to converge on one ICU version is
unrealistic.
We store original string values instead of sort keys and
allow per-file version annotations -- worth discussing
whether something similar could work here.
2 - Provider abstraction
The proposal assumes ICU as the sole provider, but Spark
ships non-ICU collations like UTF8_LCASE that are widely
used. A provider or namespace layer would prevent name
collisions and support engine-specific collations without
future spec changes.
3 - Operational surface
A few things that turned out correctness-critical in our
implementation: partition transforms on collated columns
(collation-equal but byte-distinct values in different
directories), sort order semantics, equality deletes under
collation, and Parquet filter pushdown (must be disabled
since Parquet has no collation concept).
These don't all need to be solved in v1 but would help to
scope them.
4 - Smaller items (nit's)
UTF-8 bounds for the original field id should be "must write"
not "should" -- otherwise backward compat breaks for
non-aware engines. Engine fallback behavior (case-sensitive
vs older ICU vs fail) could use a recommended preference
order to avoid divergent results across engines. The
collation specifier syntax would benefit from a formal grammar.
---
Happy to share our Delta design doc and implementation
learnings in more detail. Looking forward to the discussion.
Best,
Andrei
On Sat, Mar 28, 2026 at 11:49 PM Alexander Löser
<[email protected]> wrote:
Hi everyone,
this is my first interaction with the Iceberg community,
so here a few words about myself:
- I'm Alex, a Berlin-based software engineer
- I've been working at Snowflake for 4 years now
- I spend most of my time on data types, particularly
binary, strings and collations.
I'd like to start a discussion about adding collations to
the Iceberg spec.
Conceptually, collations are an annotation on the string
data type. By default, most engines perform string
operations case-sensitively.
Collations allow specifying alternative comparison rules.
This is useful for achieving, e.g., case- or
accent-insensitive string operations, or
language-specific string sorting.
Collations are supported by many engines: Databricks
<https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-collation>,
Spark
<https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.collate.html>,
Snowflake
<https://docs.snowflake.com/en/sql-reference/collation>,
Oracle
<https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/COLLATION.html>
- to
name just a few - this list is not complete.
In Snowflake, we see heavy use of the collation feature.
Several users have approached us, mentioning they want to
migrate to Iceberg tables, but are currently blocked by
Iceberg's lack of collation support.
Given the widespread support for collations across
different engines, I believe introducing collations to
Iceberg will increase interoperability and boost its
adoption.
I'd be curious about your thoughts.
*Goal of the proposal*
- Support collation specifications for columns
- Define how collation bounds should be stored - UTF-8
based bounds are not useful for collated columns
*Required Changes*
- Extend the schema to let (string) fields be annotated
with a collation
More details can be found in this doc
<https://docs.google.com/document/d/1m8b7u97uteHYjXk-4DNglJSpQO8OcZOCzW2tApCNTW4/edit?tab=t.0#heading=h.y1ant4w2163k>.
I'm also hoping to present the idea in the next community
sync.
Best, Alex