Re: [XeTeX] Hyphenation in Transliterated Sanskrit

Jonathan Kew Mon, 12 Sep 2011 02:47:42 -0700

On 12 Sep 2011, at 08:59, Mojca Miklavec wrote:

> On Mon, Sep 12, 2011 at 09:36, Yves Codet wrote:
>> Hello.
>> 
>> A question to specialists, Arthur and Mojca maybe :) Is it necessary to have 
>> two sets of hyphenation rules, one in NFC and one in NFD? Or, if hyphenation 
>> patterns are written in NFC, for instance, will they be applied correctly to 
>> a document written in NFD?
> 
> That depends on engine.
> 
>> From what I understand, XeTeX does normalize the input, so NFD should
> work fine. But I'm only speaking from memory based on Jonathan's talk
> at BachoTeX.


xetex will normalize text as it is being read from an input file IF the 
parameter \XeTeXinputnormalization is set to 1 (NFC) or 2 (NFD), but will leave 
it untouched if it's zero (which is the initial default).

Note that this would not affect character sequences that might be created in 
other ways than reading text files - e.g. you could still create unnormalized 
text within xetex via macros, etc.

Forcing "universal normalization" is hazardous because there are fonts that do 
not render the different normalization forms equally well, so users may have a 
specific reason for wanting to use a certain form. (This is, of course, a 
shortcoming of such fonts, but because this is the real world situation, I'm 
reluctant to switch on normalization by default in the engine.)

In principle, it seems desirable that the engine should deal with normalization 
"automatigally" when using hyphenation patterns, but this is not currently 
implemented.

Personally, I'd recommend the use of NFC as a "standard" in almost all 
situations, and suggest that pattern authors should operate on this assumption; 
support for non-NFC text may then be less-than-perfect, but I'd consider that a 
feature request for the engine(s) more than for the patterns.

> I might be wrong. I'm not sure what LuaTeX does. If one
> doesn't write the code, it might be that no normalization will ever
> take place.
> 
> I can also easily imagine that our patterns don't work with NFD input
> with Hyphenator.js. I'm not sure how patterns in Firefox or OpenOffice
> deal with normalization. I never tested that.
> 
> But in my opinion engine *should* be capable of doing normalization.
> Else you can easily end up with exponential problem. A patterns with 3
> accented letters can easily result in 8 or even more duplicated
> patterns to cover all possible combinations of composed-or-decomposed
> characters.
> 
> Arthur had some plans to cover normalization in hyph-utf8, but I
> already hate the idea of duplicated apostrophe,

That's a bit different, and hard to see how we could avoid it except via 
special-case code somewhere that "knows" to treat U+0027 and U+2019 as 
equivalent for certain purposes, even though they are NOT canonically 
equivalent characters and would not be touched by normalization.

IMO, the "duplicated apostrophe" case is something we have to live with because 
there are, in effect, two different orthographic conventions in use, and we 
want both to be supported. They're alternate spellings of the word, and so 
require separate patterns - just like we'd require for "colour" and "color", if 
we were trying to support both British and American conventions in a single set 
of patterns.

> let alone all
> duplications just for the sake of "stupid engines that don't
> understand unicode" :).

Yes, the engine should handle that. But it doesn't (unless you enable input 
normalization that matches your patterns).

JK




--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Hyphenation in Transliterated Sanskrit

Reply via email to