On Thu, Sep 15 2016 at 21:56 CEST, jsb...@mimuw.edu.pl writes: [...]
> 1. Graphemes, if I understand correctly, are language dependent, textels > are not. > > 2. Textel "ń" means both U+0144 and <U+006E,U+0301>, so it is a notion > on a higher abstraction level then a grapheme cluster. In other words, textels are equivalence classes of some set of Unicode characters strings by an equivalence relation which at the moment is open to the discussion but is very close to the official Unicode canonical equivalence (when working on a corpus of historical Polish we noticed some cases where standard Unicode equivalence was not convenient). [...] On Thu, Sep 15 2016 at 21:27 CEST, leobo...@namakajiri.net writes: > Isn't the Swift "character" and the "textel" merely the same thing as > what Unicode already named "grapheme clusters"? As for the Swift "character", perhaps someone fluent in Swift will answer the question? > (Well, technically UAX > #29[1] defines them as "user-perceived characters", but then says > grapheme clusters approximate user-perceived characters > algorithmically). > > And, indeed, Swift "Characters" are explicitly defined as "extended > grapheme clusters" (also from UAX #29): > > https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html Thank you very much for the link. Let me quote the relevant fragment: --8<---------------cut here---------------start------------->8--- Extended Grapheme Clusters Every instance of Swift’s Character type represents a single extended grapheme cluster. An extended grapheme cluster is a sequence of one or more Unicode scalars that (when combined) produce a single human-readable character. Here’s an example. The letter é can be represented as the single Unicode scalar é (LATIN SMALL LETTER E WITH ACUTE, or U+00E9). However, the same letter can also be represented as a pair of scalars—a standard letter e (LATIN SMALL LETTER E, or U+0065), followed by the COMBINING ACUTE ACCENT scalar (U+0301). The COMBINING ACUTE ACCENT scalar is graphically applied to the scalar that precedes it, turning an e into an é when it is rendered by a Unicode-aware text-rendering system. In both cases, the letter é is represented as a single Swift Character value that represents an extended grapheme cluster. In the first case, the cluster contains a single scalar; in the second case, it is a cluster of two scalars: [...] *Two String values (or two Character values) are considered equal if their extended grapheme clusters are canonically equivalent.* --8<---------------cut here---------------end--------------->8--- For me it means that Swift's characters are equivalence classes of the set of extended grapheme clusters by canonical equivalence relation. > Such a notion is indeed needed, but it has been always there. > > [1] http://unicode.org/reports/tr29/ I don't see there a notion of such equivalent classes. On Thu, Sep 15 2016 at 16:36 CEST, john.w.kenn...@gmail.com writes: [...] > In the new Swift programming language, which is white-hot in the Apple > community, Apple is moving toward a model of a transparent, generic > Unicode that can be “viewed” as UTF-8, UTF-16, or UTF-32 if necessary, > but in which a “character” contains however many code points it needs > (“e” with a stacked macron, acute accent, and dieresis is > algorithmically one “character” in Swift). Moreover, > e-with-an-acute-accent and e followed by a combining acute accent, for > example, compare as equal. At present, the underlying code is still > UTF-16LE. If you insist that Swift's "character" are just grapheme clusters, than you add different, although related, meaning to the term "grapheme cluster". I think the notion deserves a term of its own. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/