Re: Character Foldings

Asmus Freytag Wed, 26 May 2004 00:58:49 -0700

At 05:10 PM 5/25/2004, Mark Davis wrote:

I don't think the "fold to base" is as useful as some other information. For those characters with a canonical decomposition, the decomposition carries more more information, since you can combine it with a "remove combining marks" folding to get the folding to base.

I think this would have to be 'remove combining *accents*'. You wouldn't want to remove Indic combining marks by force, if what you are interested in is L/G/C style diacritic removal.

For my part, what would be more interesting would be a "full" decomposition of
the characters that don't have a canonical decomposition, e.g.

LATIN CAPITAL LETTER O WITH STROKE => O + /

I believe that when we first discussed this for TR30 it was mentioned that there are characters with diacritic like features for which there aren't combining accents because we deemed them not productive enough and intractable enough for rendering purposes.

For those characters you wouldn't be able to make a true decomposition, but the base character may still be well-defined.

I don't see where the decomposition would provide 'more more' information - nobody suggests getting rid of it. The problem is, as I mentioned on the Unicore list, how to combine flexibility for technically savvy implementers with specifications of foldings that are based on the (linguistic) facets that define the equivalence class.

This is in fact a good example: if I want to fold characters to their base form, so that I can type a search term either from a keyboard that doesn't have accents or by a user that doesn't know which one is correct, I can proceed in two ways: I can create a one-stop-shopping folding that takes any Unicode data stream and produces the desired result. Or I can string together a number of building blocks, e.g. first normalize NFD, then 'decompose' fully, then remove accents.

In the first approach, tables will contain duplicate entries. I've pushed the problem how to factor this onto the implementer (but given that all the information is there, implementers could use semi-automated tools to create an ad-hoc factoring).

In the second approach, I'm pushing the problem on how to assemble the desired effect from building blocks onto implementers or worse, the end users. That process quickly becomes non-intuitive, as the building blocks give no hint about how they must be assembled.

Kana and Width folding and their interaction (and interaction with NFx) are another good set of examples where this problem shows up.

One problem with the 'building blcoks' approach when it comes to foldings is that foldings effectively have a domain of operation (characters outside the domain are unaffected). However, certain oft-used primitives (e.g. decomposition) have a different domain of operation than common foldings (kana folding or width folding). By insisting on a chain of atomic operations, the domain of data that's affect increases (it becomes the superset).

A./

Re: Character Foldings

Reply via email to