On 10/20/19 11:03 PM, Rowan Worth wrote:
> On Sun, 20 Oct 2019 at 17:04, Simon Slavin <slav...@bigfraud.org> wrote:
>
>> Another common request is full support for Unicode (searching, sorting,
>> length()).  But even just the tables required to identify character
>> boundaries are huge.
>>
> Nitpick: there are no tables required to identify character boundaries. For
> utf-8 you know if there's another byte to come which is part of the current
> codepoint based on whether the current byte's high bit is set, and
> furthermore you know how many bytes to expect based on the initial byte.
>
> I'm less familiar with utf-16 which SQLite has some support for, but a
> quick read suggests there are exactly two reserved bit patterns you need to
> care about to identify surrogate pairs and thus codepoint boundaries.
>
> Tables relating to collation order, character case, and similar codepoint
> data can of course get huge, so your point stands.
> -Rowan

My memory is that Unicode is somewhat careful NOT to define what is a
'character' because that can really get complicated, and often
application specific about what it wants.

You have code-units, which for utf-8 are basically bytes.

You have code-points, which is what most people think of as a
'character' which has a single Unicode Codepoint number.

Then you have Graphemes, which are clusters of code-points that tend to
be expressed in a single glyph in output. (and some code-points don't
generate any output).

Dealing with Graphemes gets complicated, and that is where you run into
the need for lots of tables. Code-points them selves are fairly simple
to deal with, the problem is that in some langauges just dealing with
code-points doesn't let you fully handle some of the 'simple' operations
like sorting, or case folding with 100% accuracy, that sometimes
requires dealing with code-point clusters.

But, you also run into the issue (as I understand it) that Unicode
doesn't really define a universal ordering for all characters, that this
can be a language specific problem, and Unicode can't really solve that
issue. (Two langauges might use some of the same characters, but treat
them differently for sorting).

-- 
Richard Damon

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to