On 1/2/16 10:25, David Carlisle wrote:
Thanks for the test sources,

It all seems to work for me (texlive 2015/cygwin 64 build), but..

I do wonder if this change is going in the right direction.

The main problem with the char classes is not the overall number, in
fact since the important thing as far as specifying code is the boundary
between different classes rather than the classes themselves, there are
now around 300 million such boundaries that could be specified, which
seems more than enough!

The main problem is that each character can only be in one class which
means that it is very hard to use these for any generic code. If you
have already classified characters by (say) line breaking properties and
then another package wants to classify by unicode block, or by default
writing direction, then the only way to handle that is to enumerate all
the intersecting properties and assign a a unique character class to
each intersection, this leads to a combinatorial explosion in the number of
boundary tokens that need to be specified. Where you may have had a
single specification for the boundary between LTR and RTL if you also
want to classify each unicode block you need  separate classes for LTR
and RTL characters in each block and then need to specify the same
boundary tokens for all the possible changes of LTR in one block
followed by RTL in another.

That limitation of course has always been there, but increasing the
number of classes available highlights it more strongly.

You're right, of course; this is a limitation of the concept as currently implemented.

In practice, I suppose I don't expect there to be all that many "generic purposes" for which intercharclass is really a useful tool. For example, it's hard to see how it could work well for bidi issues, because of the problem of resolving neutral characters -- especially run-initial neutrals.


Would it be impossibly difficult to extend the concept so that a
character takes a list of character classes so that you can classify
characters in more than one way without needing impossibly many
character classes to do that?

There would be two aspects to this: first, extending the character class storage so as to allow a list rather than a single number. Currently, it's stashed in the upper part of the word where sfcode already lives, making the implementation very simple and cheap.

And second, checking for the existence of a token list for the current boundary would become significantly more expensive. Currently, we just combine the two classes at the boundary to get a single 32-bit number, and do a simple lookup (in a sparse array) to see if there's anything defined. With class lists, we'd need to do this for each of the classes in the two lists -- i.e. m * n sparse-array lookups. Or perhaps go at it from the other direction: iterate over a list of defined transitions, and check whether each of them applies.

Oh, and if there are multiple matches at a given boundary, what happens? Using an imaginary extension to support lists:

  \XeTeXintercharclasses `A = { 1, 2 }
  \XeTeXintercharclasses `B = { 3, 4 }

  \XeTeXinterchartoks 1 3 = { foo }
  \XeTeXinterchartoks 1 4 = { bar }
  \XeTeXinterchartoks 2 3 = { xyzzy }
  \XeTeXinterchartoks 2 4 = { plugh }

What happens at the boundary in "AB"? Should it depend on the numerical values of the classes, or the order in which the transitions were specified, or what?

(I'm not saying the idea is a bad one; I can imagine it might be quite useful. But I can also imagine it getting a bit hairy......)

JK



--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex

Reply via email to