-----BEGIN PGP SIGNED MESSAGE----- David Hopwood wrote: > A minor modification is needed to the grapheme breaking rules. Give > preceding tehtar a new property 'Grapheme_Precede', following tehtar > 'Grapheme_Extend', and add some rules to prevent breaking between > Grapheme_Precede and a following character: > > Precede × Precede > Precede × Base > > This is potentially useful for other scripts as well, and it wouldn't > increase the complexity of grapheme breaking much. > > [Actually, I've just noticed that there are no rules "Extend × Extend" > and "Extend × Link". Shouldn't there be? If there aren't, then there will > be breaks within combining sequences, and between a combining sequence > and GRAPHEME JOINER, for example.]
I've looked at this more closely, and I'm now sure that there are two mistakes in the breaking rules in PDUTR #28: the one described above, and the fact that in the rule to prevent breaking CRLF, 'not CR' is used intead of 'CR'. Here is a corrected version of the existing rules: CR × LF Base × Extend Extend × Extend Link × Base Link × Join_Control Base Base × Link Extend × Link L × (L / V / LV / LVT) (LV / V) × (V / T) (LVT / T) × T Any ÷ [The 6 main rules can alternatively be written as: (Base / Extend) × (Extend / Link) Link × [Join_Control] Base ] and here are some proposed modifications to support preceding combining marks, and slightly change the behaviour of join controls (see below): Precede = Join_Control / Preceding_Tehta Extend = Me / Mn / Mc / Following_Tehta / Other_Extend \ Link Link = GRAPHEME_JOINER / Virama Base = Any \ CONTROL \ Zp \ Zs \ Precede \ Extend \ Link CR × LF Precede × Precede Precede × Base Base × Extend Extend × Extend Link × Precede Link × Base Base × Link Extend × Link L × (L / V / LV / LVT) (LV / V) × (V / T) (LVT / T) × T Any ÷ [The eight main rules can alternatively be written as: (Base / Extend) × (Extend / Link) (Link / Precede) × (Precede / Base) Note that the "Link × Join_Control Base" rule is implemented instead by "Link × Precede" and "Precede × Base". This allows more than one Join_Control to appear between the Link and Base, but that should make no practical difference.] There are two differences in behaviour as a result of the modified rules: a) a sequence of join controls is considered to belong to the grapheme cluster that follows them. b) scripts like Tengwar, that have characters that combine with the following base character, are supported. a) means that there are no 'invisible' grapheme clusters as a result of join controls. This means that additional arrow keystrokes are not needed to step over join controls, and that join controls are deleted when the grapheme that follows them is deleted. (Of course, an editor could have a mode that makes normally invisible controls visible; in that case they would be treated like base characters for grapheme breaking.) There can still be invisible grapheme clusters as a result of other characters in the set 'Default_Ignorable_Code_Point'; those should probably all be looked at more closely, to see whether it would be better to put some of them in the Extend or Precede categories. (For example, why are the Mongolian and generic variation selectors not in Grapheme_Extend? I'm confused, because they are category Mn, and not in Grapheme_Link, but Grapheme_Extend is supposed to have been generated as 'Me + Mn + Mc + Other_Grapheme_Extend - Grapheme_Link'. The file versions I'm looking at are DerivedCoreProperties-3.2.0d4.txt and PropList-3.2.0d6.txt.) - -- David Hopwood <[EMAIL PROTECTED]> Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/ RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01 Nothing in this message is intended to be legally binding. If I revoke a public key but refuse to specify why, it is because the private key has been seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip -----BEGIN PGP SIGNATURE----- Version: 2.6.3i Charset: noconv iQEVAwUBPDZ07TkCAxeYt5gVAQHM/ggAw8tdn/hau+/IKsQsO0ouLB+RV4gVT/1c JzwAhsLVxcw1KaJA1Jg2eExvc8B+FrCXQw+XpGOTKaje1WoyGJm3liZNIgLrRQ3M z8da140ahfnhOcmlk13vGdGicJOutc7gJwDeoHPMU48JUqWR7eIv8GBLsXHOQ3Yn CoXuIoKiF7fGYTbtCTV9Ow3h4ya11+S6SmCxr/NszqMddA+vVzB8kOnYe7u5fmTE MHivd3B4e6fMm/RE6udmFn+gseQ4cRRj3C8UDRgnIyQOFVrrd2kbeO2Xek8HNOfn cvBJOTPP672Z+BnigDXdunNm3txeaIgBfxCOO5/yORywgIdjQANzEw== =XW57 -----END PGP SIGNATURE-----