On 6/5/15 14:14, Joseph Wright wrote:

Based on the current files, we have a block to set \XeTeXcharclass,
which only applies to XeTeX. The logic followed in that code is that
characters in the file LineBreak.txt which have class "ID" (ideographs)
not only set the \XeTeXcharclass class to 1 but also set the \catcode of
the code point to 11. That leads to a difference between the two Unicode
engines. My current feeling is that the data file should split this
process such that the category code change applies to both XeTeX and
LuaTeX, with the XeTeX-specific code separate. Does this make sense and
indeed does the current assignment make sense?


ISTM that the most appropriate (default) \catcode for characters with class ID is clearly letter (11), and would suggest that LuaTeX should follow XeTeX in this.

So yes, splitting out the XeTeX-specific code and having LuaTeX share the catcode assignments makes sense.

After all, if users can write control sequences such as

  \hello
  \halló
  \Здравствуйте
  \ሰላም
  \सलाम

they should equally well be able to write

  \你好
  \こんにちわ

and have each of these treated as single control sequences, too. This will not work if category ID characters are given catcode 12.

If you're making improvements to unicode-letters.def, I would suggest also adding a section that assigns catcode 15 (invalid) to the code values "D800 - "DFFF (i.e. the UTF-16 surrogates, which should never be used in isolation as characters).

JK



--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex

Reply via email to