Re: [Discussion] Collation Support

Andrei Tserakhau via dev Fri, 03 Jul 2026 06:56:55 -0700

Hi Alex,

Thanks, these are the right questions. Let me answer them, but I think all
three are really facets of one decision worth pulling out, so I'll do that
at the end.


Original values vs sort keys. I don't think the two limitations are
symmetric. You're right that a bound stored under version X may not hold
under Y for either representation, and a naive reader prunes only on an
exact version match either way. But they degrade differently: a sort key
from X is incomparable under Y (nothing a Y reader can do with it) while an
original value is the actual string, so a Y reader can re-interpret it. In
the case you describe at the end of your mail, where only a small
code-point range moved and a file's values fall outside it, that reader can
prove the X bound still holds and prune across versions. Sort keys
foreclose that; original values keep it open. So original values are a
superset: worst case they match sort keys, best case they prune across
versions.

Your two sort-key advantages are real, I just don't think they belong in
the format. Truncation: agreed it's hard for collated strings (your abcเก
contraction case is exactly the trap), so I sidestepped it, collation
bounds must be tight, a writer that can't store the exact min/max omits the
bound. Collation-aware truncation with CollationElementIterator is a
possible later optimization. Compare cost: pruning is per-file at planning
time, not per-row, so collator vs byte compare is in the noise; and an
engine that wants the byte path can derive and cache the sort key from the
stored value. Original values don't block that, they just don't bake a
version-specific encoding into the format.

Column vs file-level version. As you say, every engine can read regardless,
so this is a pruning-performance choice, not correctness. In the
schema-registered-metrics design a file carries bounds under a declared
(collation, version), and a reader prunes any file with a metric for a
version it can produce, not only files it wrote. So convergence on one ICU
version gives full cross-engine pruning, same as column-level, and a writer
or compaction can populate several versions at once. It gives the
column-level benefit by convention without making a version bump a
format-breaking change.

[One data point from the engine side, since I'm coming at this from the
Databricks runtime: our runtime is effectively versionless (customers don't
pin an ICU version, and upgrades happen under them) so "the same table read
by clients on different ICU versions" isn't a corner case for us, it's the
default. That's what pushes me toward per-file versioning: pinning one
version per table or column means either forcing the whole fleet to upgrade
in lockstep or breaking pruning on every bump, and neither survives a
versionless fleet. And in practice most version bumps we've gone through
don't reorder the data in a given column at all, which is exactly why
keeping original values, and eventually your code-point-range idea, lets a
reader keep pruning across a bump instead of falling back to a scan.]

Providers. Agreed. I'll tighten the spec to a registered set like geo, icu
to start, utf8 reserved, non-ICU collations added by spec change rather
than ad hoc. Interop is the whole point and an open namespace undercuts it.

-----------
Stepping back: I think the three above collapse into one question that
needs broader alignment than the two of us:

> how much cross-engine pruning interoperability should the format
guarantee, versus leave to convention?

Original-vs-sortkey, column-vs-file version, and open-vs-restricted
providers are all that same tradeoff from different angles. It's a values
call more than a correctness one, and it binds every engine that hasn't
weighed in yet: Trino, Flink, Spark, PyIceberg, rust.

>From the DBR side I can say the multi-version case is real rather than
theoretical, but that's one engine's vantage point. I'd like to get the
interop question in front of the other implementers before we fix field
IDs, the dev list is probably enough for now, and a community sync is there
if it needs more than async. No rush on that; I'd rather let the thread
settle the mechanics first.

Best,
Andrei

On Fri, Jul 3, 2026 at 12:22 AM Alexander Löser <[email protected]>
wrote:

> Hi Andrei,
>
> Thanks for putting together the spec PR and the detailed write-up! The
> approach mostly looks solid to me. I have a few questions/initial thoughts
> regarding the changes you proposed (compared to the original proposal):
>
> > 1 - Bounds store original values, not sort keys, tagged with a per-file
> collation version. ICU/CLDR sort keys aren't stable across versions, so
> storing keys ties every reader to one exact version; original values plus a
> per-file version (readers prune only on an exact match) degrade gracefully
> instead of breaking. The schema keeps the collation name unversioned so
> anyone can read
>
> If I understand correctly, we’re talking about two separate things here:
>
>    1. Tagging a column vs a single file with a certain ICU version
>    2. Using collation keys vs original strings (the ones that will
>    produce the min/max collation keys)
>
>
> For 1, it comes down to a tradeoff:
>
>    - If we tag the column with the ICU version, we’d force engines to
>    support one agreed-on ICU version if they want to prune files. Engines
>    would be able to prune every file (if they support the specific ICU
>    version), or none at all, so there is more incentive to support a specific
>    version
>    - If we tag individual files with ICU versions, we gain the big
>    advantage that engines do not need to agree on a single ICU version.
>    However, if I understand correctly, this comes at the cost of “fractured”
>    pruning - engines will only be able to prune files that were written by
>    themselves (or rather, with the same ICU version). As a consequence,
>    performance might not really be interoperable between different engines.
>
> Regardless of the approach we choose, all engines should be able to read
> the data - they might just not be able to prune files.
>
> For 2, I’m not sure if I understand the advantages of original strings
> yet. As you already pointed out, the collation keys depend on the ICU
> version. However, if I understand correctly, the same limitation would
> apply to the original strings: the sort order may (and does) change between
> different ICU versions, too. As a consequence, we can’t assume the original
> lower/upper bound strings we stored for version X will also be lower/upper
> bounds for version Y - at least in the general case. So if I understand
> correctly, we would not gain additional pruning opportunities compared to
> using collation keys. Or am I missing something here?
>
> At the same time, sort keys do have advantages:
>
>    - Iceberg allows the truncation of upper- and lower bounds. This is
>    trivial for binary collation keys. For original strings, the task becomes
>    significantly harder: truncating at a character boundary, for example,
>    would lead to wrong results, as there are some context-sensitive sequences:
>    e.g., with the CLDR root locale, abcเก < abcเ. I think it might be doable
>    with ICU’s CollationElementIterator, but it will be tricky to get right.
>    - Lower/upper bounds are computed once, but will be compared many
>    times. With original strings, we would need to either convert to the
>    collation key on the fly, or use ICU’s collator for a direct comparison.
>    Both options will be slower than a raw byte-sequence comparison
>
>
> There is one scenario where original values would shine, though. I
> analyzed the order-changes between various ICU versions: in many cases,
> only a small range of code points changes/is moved. If we had additional
> metadata about which code point ranges a file contains (e.g., whether it is
> ASCII only), engines might be able to prove that the original string bounds
> for version X are still valid for version Y.
> If I'm not mistaken, this could allow to prune across different ICU
> versions in certain situations, which I’d consider a point in favor of
> original values (and file-level ICU versions).
>
>
>
> > 2 - A provider-qualified identifier (icu.en_US-ci), leaving room for
> non-ICU collations like Spark's UTF8_LCASE, rather than assuming ICU as the
> sole provider.
>
> Adding a provider-mechanism sounds like a good approach to keep the spec
> open for future collations [image: :slightly_smiling_face:] I wonder
> whether we should restrict the set of allowed providers, though, similar to
> how it was done with geo
> <https://lists.apache.org/thread/r5x0do8f241bpf565rx8s5s3wc9ogp0f>. My
> main motivation for this proposal is interoperability. I worry that
> interoperability might suffer or vanish completely if every engine can come
> up with their own definitions.
>
>
>
> Happy to hear your thoughts on this!
>
> Best, Alex
>
>
> On 6/27/26 01:37, Szehon Ho wrote:
>
> Very nice direction, left some comments on the spec proposal.
>
> Thanks to you folks for working on it !
> Szehon
>
> On Fri, Jun 26, 2026 at 3:29 AM Andrei Tserakhau via dev <
> [email protected]> wrote:
>
>> Hi all,
>>
>> I've spend some cycle on the collation discussion and make something more
>> concrete to react to: a spec-change PR plus reference implementations (go
>> and java).
>>
>> - Spec change (apache/iceberg#16972): a "collation" annotation on string
>> fields, and a data_file.collation_bounds field so collated columns stay
>> prunable.
>> - Reference implementation in iceberg-go (apache/iceberg-go#1318): the
>> full path end to end - schema annotation, collation-aware comparison
>> (CLDR/UCA), collation bounds in the manifest, and version-gated data-file
>> pruning, with an Avro round-trip and pruning tests.
>> - A lightweight Java POC (link below): the schema annotation plus a
>> Collator-backed comparator, to match where the discussion is. I
>> deliberately left the manifest/bounds side out of Java for now.
>>
>> The design follows the original proposal but takes a few different turns,
>> mostly to adopt what we learned in Delta. The ones I'd most like input on:
>>
>> 1 - Bounds store original values, not sort keys, tagged with a per-file
>> collation version. ICU/CLDR sort keys aren't stable across versions, so
>> storing keys ties every reader to one exact version; original values plus a
>> per-file version (readers prune only on an exact match) degrade gracefully
>> instead of breaking. The schema keeps the collation name unversioned so
>> anyone can read.
>>
>> 2 - A provider-qualified identifier (icu.en_US-ci), leaving room for
>> non-ICU collations like Spark's UTF8_LCASE, rather than assuming ICU as the
>> sole provider.
>>
>> 3 - One structural question I don't have a strong opinion on yet: I put
>> collation_bounds on data_file as a standalone v3 field, but field id 146 is
>> already the v4 content_stats struct, and collation bounds might belong
>> inside that typed-stats framework instead. Worth settling before we fix
>> field ids.
>>
>> The full set of differences and the reader/writer rules are in the PR
>> description and the write-up. Comments very welcome — both on the calls
>> above and on whether the standalone-field vs content_stats direction is the
>> right one.
>>
>> Best, Andrei
>>
>> - original proposal:
>> https://docs.google.com/document/d/1m8b7u97uteHYjXk-4DNglJSpQO8OcZOCzW2tApCNTW4/edit?tab=t.0
>> - spec change: https://github.com/apache/iceberg/pull/16972
>> - POC in go: https://github.com/apache/iceberg-go/pull/1318
>> - java POC:
>> https://github.com/laskoviymishka/iceberg/tree/prototype/collation-support
>>
>> On Mon, Mar 30, 2026 at 10:54 PM Alexander Löser <[email protected]>
>> wrote:
>>
>>> Hi Andrei,
>>>
>>> I'm glad you're interested. Looking forward to collaborate with you!
>>> Thanks for all the feedback here and in the doc. I only had a quick
>>> glance, but I think you raised some good points.  I'll address/respond to
>>> your comments as soon as  I get the chance, hopefully tomorrow.
>>> I think you also left some comments in this mail that are not yet in the
>>> doc - I'll move those to a dedicated section at the end of the doc, so we
>>> can use the doc as a single source of truth/discussion.
>>>
>>> > Happy to share our Delta design doc and implementation learnings in
>>> more detail.
>>>
>>> Sure, sounds good :)
>>>
>>> Best,
>>> Alex
>>> On 3/29/26 01:25, Andrei Tserakhau via dev wrote:
>>>
>>> Hi Alexander,
>>>
>>> This looks really interesting. We've been working on collation support
>>> in Delta and have shipped it in production for some time, so this is an
>>> area we care about a lot. If this proposal moves forward we'd be happy to
>>> collaborate on the design and implementation.
>>>
>>> The pseudo-field approach for collation metrics is clean and composes
>>> well with existing Iceberg infrastructure. The specifier coverage is
>>> comprehensive.
>>>
>>> A few areas worth discussing as this evolves:
>>>
>>> 1 - Sort key stability and versioning
>>>
>>> ICU sort keys are not stable across versions, so a pinned ICU version
>>> bump in a future Iceberg release would invalidate all existing collation
>>> metrics. In multi-engine environments, requiring all engines to converge on
>>> one ICU version is unrealistic.
>>>
>>> We store original string values instead of sort keys and allow per-file
>>> version annotations -- worth discussing whether something similar could
>>> work here.
>>>
>>> 2 - Provider abstraction
>>>
>>> The proposal assumes ICU as the sole provider, but Spark ships non-ICU
>>> collations like UTF8_LCASE that are widely used. A provider or namespace
>>> layer would prevent name collisions and support engine-specific collations
>>> without future spec changes.
>>>
>>> 3 - Operational surface
>>>
>>> A few things that turned out correctness-critical in our implementation:
>>> partition transforms on collated columns (collation-equal but byte-distinct
>>> values in different directories), sort order semantics, equality deletes
>>> under collation, and Parquet filter pushdown (must be disabled since
>>> Parquet has no collation concept).
>>>
>>> These don't all need to be solved in v1 but would help to scope them.
>>>
>>> 4 - Smaller items (nit's)
>>>
>>> UTF-8 bounds for the original field id should be "must write" not
>>> "should" -- otherwise backward compat breaks for non-aware engines. Engine
>>> fallback behavior (case-sensitive vs older ICU vs fail) could use a
>>> recommended preference order to avoid divergent results across engines. The
>>> collation specifier syntax would benefit from a formal grammar.
>>>
>>> ---
>>>
>>> Happy to share our Delta design doc and implementation learnings in more
>>> detail. Looking forward to the discussion.
>>>
>>> Best,
>>> Andrei
>>>
>>> On Sat, Mar 28, 2026 at 11:49 PM Alexander Löser <[email protected]>
>>> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> this is my first interaction with the Iceberg community, so here a few
>>>> words about myself:
>>>> - I'm Alex, a Berlin-based software engineer
>>>> - I've been working at Snowflake for 4 years now
>>>> - I spend most of my time on data types, particularly binary, strings
>>>> and collations.
>>>>
>>>> I'd like to start a discussion about adding collations to the Iceberg
>>>> spec.
>>>>
>>>> Conceptually, collations are an annotation on the string data type. By
>>>> default, most engines perform string operations case-sensitively.
>>>> Collations allow specifying alternative comparison rules. This is
>>>> useful for achieving, e.g., case- or accent-insensitive string operations,
>>>> or language-specific string sorting.
>>>> Collations are supported by many engines: Databricks
>>>> <https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-collation>,
>>>> Spark
>>>> <https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.collate.html>,
>>>> Snowflake <https://docs.snowflake.com/en/sql-reference/collation>,
>>>> Oracle
>>>> <https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/COLLATION.html>
>>>>  - to
>>>> name just a few - this list is not complete.
>>>>
>>>> In Snowflake, we see heavy use of the collation feature. Several users
>>>> have approached us, mentioning they want to migrate to Iceberg tables, but
>>>> are currently blocked by Iceberg's lack of collation support.
>>>>
>>>> Given the widespread support for collations across different engines, I
>>>> believe introducing collations to Iceberg will increase interoperability
>>>> and boost its adoption.
>>>> I'd be curious about your thoughts.
>>>>
>>>> *Goal of the proposal*
>>>> - Support collation specifications for columns
>>>> - Define how collation bounds should be stored - UTF-8 based bounds are
>>>> not useful for collated columns
>>>>
>>>> *Required Changes*
>>>> - Extend the schema to let (string) fields be annotated with a collation
>>>>
>>>> More details can be found in this doc
>>>> <https://docs.google.com/document/d/1m8b7u97uteHYjXk-4DNglJSpQO8OcZOCzW2tApCNTW4/edit?tab=t.0#heading=h.y1ant4w2163k>
>>>> .
>>>>
>>>> I'm also hoping to present the idea in the next community sync.
>>>>
>>>> Best, Alex
>>>>
>>>>
>>>>

Re: [Discussion] Collation Support

Reply via email to