[XeTeX] Contextual shaping

2013-11-27 Thread Simon Cozens

This is possibly a daft question, but...

In traditional TeX, character tokens are processed and put into boxes 
individually, with fairly primitive ligature tables. Obviously XeTeX doesn't 
do this, using Harfbuzz (or ICU or whatever) to do the shaping and layout.


My question is, if you're not showing individual characters to the shaping 
engine for it to consider, what defines how big a string of characters to 
shape at a time? Does XeTeX break at the word level and then shape a word, 
and if so what defines a word? (Chinese has no word breaks!) Or does it shape 
an entire paragraph of text at a time (!) and then box up the glyphs 
individually? Or...?


(I've tried starting at layoutChars in XeTeXLayoutInterface.cpp and working 
backwards but I can't understand where I end up: measure_native_node shapes a 
node, but what's a node?)



--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Contextual shaping

2013-11-27 Thread Zdenek Wagner
2013/11/27 Simon Cozens si...@simon-cozens.org:
 This is possibly a daft question, but...

 In traditional TeX, character tokens are processed and put into boxes
 individually, with fairly primitive ligature tables. Obviously XeTeX doesn't
 do this, using Harfbuzz (or ICU or whatever) to do the shaping and layout.

 My question is, if you're not showing individual characters to the shaping
 engine for it to consider, what defines how big a string of characters to
 shape at a time? Does XeTeX break at the word level and then shape a word,
 and if so what defines a word? (Chinese has no word breaks!) Or does it
 shape an entire paragraph of text at a time (!) and then box up the glyphs
 individually? Or...?

I am not an expert but shaping can be changed within a pragraph. The
following file produces the attached PDF:

\documentclass{article}
\usepackage{fontspec,polyglossia}
\setotherlanguages{hindi,sanskrit}
\newfontfamily\hindifont[Language=Hindi,Script=Devanagari,Mapping=velthuis]{FreeSerif}
\newfontfamily\sanskritfont[Language=Sanskrit,Script=Devanagari,Mapping=velthuis]{FreeSerif}
\def\shakti{sakti}
\begin{document}

Hindi: \texthindi{\shakti}, Sanskrit: \textsanskrit{\shakti}

\end{document}


 (I've tried starting at layoutChars in XeTeXLayoutInterface.cpp and working
 backwards but I can't understand where I end up: measure_native_node shapes
 a node, but what's a node?)


 --
 Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex



-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz


sh.pdf
Description: Adobe PDF document


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Contextual shaping

2013-11-27 Thread Paul Isambert
Simon Cozens si...@simon-cozens.org:
 This is possibly a daft question, but...
 
 In traditional TeX, character tokens are processed and put into boxes
 individually, with fairly primitive ligature tables. Obviously XeTeX
 doesn't
 do this, using Harfbuzz (or ICU or whatever) to do the shaping and
 layout.
 
 My question is, if you're not showing individual characters to the
 shaping
 engine for it to consider, what defines how big a string of
 characters to
 shape at a time? Does XeTeX break at the word level and then shape
 a word,
 and if so what defines a word? (Chinese has no word breaks!) Or does
 it shape
 an entire paragraph of text at a time (!) and then box up the glyphs
 individually? Or...?
 
 (I've tried starting at layoutChars in XeTeXLayoutInterface.cpp and
 working
 backwards but I can't understand where I end up: measure_native_node
 shapes a
 node, but what's a node?)

I don't know how Harfbuzz and/or ICU work exactly, but:

- Characters are never put into individual boxes;
- Whatever shaping must take place is defined by sequences of characters; so
you look at each character, see if it must be processed (possibly as a part
of a larger sequence), move to the next character (unless it has been
processed as part of a sequence), and so on. Most of the rules you must follow
to process glyphs are explained here:
http://www.microsoft.com/typography/otspec/
So your question (as I understand it) is really about processing OT fonts. The
sequence of characters I have mentionned (your string of characters) are
defined in the font itself (for complex sequencess, see e.g. contextual 
lookups).

As for a node, it is whatever TeX processes internally to build a page: it can
be a character, a kern, a whatsit, a box...

Best,
Paul


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Contextual shaping

2013-11-27 Thread Khaled Hosny
On Wed, Nov 27, 2013 at 09:10:02PM +0900, Simon Cozens wrote:
 This is possibly a daft question, but...
 
 In traditional TeX, character tokens are processed and put into boxes
 individually, with fairly primitive ligature tables. Obviously XeTeX doesn't
 do this, using Harfbuzz (or ICU or whatever) to do the shaping and layout.
 
 My question is, if you're not showing individual characters to the shaping
 engine for it to consider, what defines how big a string of characters to
 shape at a time? Does XeTeX break at the word level and then shape a word,
 and if so what defines a word? (Chinese has no word breaks!) Or does it
 shape an entire paragraph of text at a time (!) and then box up the glyphs
 individually? Or...?

XeTeX shapes words one at a time, a word is basically any consecutive
sequence of character nodes (using the same font) after TeX has done its
macro expansion and is ready to typeset the material. The AAT code,
additionally, tries to merge word sequences separated by spaces into one
node.

 (I've tried starting at layoutChars in XeTeXLayoutInterface.cpp and working
 backwards but I can't understand where I end up: measure_native_node shapes
 a node, but what's a node?)

measure_native_node is called by the WEB code (called set_native_metrics
there), check xetex.web for collect_native:, that is where bulk of the
work is done. Check also @Merge sequences of words using AAT.

Regards,
Khaled


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Contextual shaping

2013-11-27 Thread Jonathan Kew

On 27/11/13 12:46, Khaled Hosny wrote:

On Wed, Nov 27, 2013 at 09:10:02PM +0900, Simon Cozens wrote:

This is possibly a daft question, but...

In traditional TeX, character tokens are processed and put into boxes
individually, with fairly primitive ligature tables. Obviously XeTeX doesn't
do this, using Harfbuzz (or ICU or whatever) to do the shaping and layout.

My question is, if you're not showing individual characters to the shaping
engine for it to consider, what defines how big a string of characters to
shape at a time? Does XeTeX break at the word level and then shape a word,
and if so what defines a word? (Chinese has no word breaks!) Or does it
shape an entire paragraph of text at a time (!) and then box up the glyphs
individually? Or...?


XeTeX shapes words one at a time, a word is basically any consecutive
sequence of character nodes (using the same font) after TeX has done its
macro expansion and is ready to typeset the material. The AAT code,
additionally, tries to merge word sequences separated by spaces into one
node.



In particular, in case it's not sufficiently clear from the above, note 
that spaces, being glue nodes, are NOT part of such a consecutive 
sequence of character nodes. And therefore a known limitation of xetex 
is that OpenType lookups that try to match the space glyph will not 
work. Shaping happens only within a run of non-space characters in a 
given font.


Most fonts are not affected by this, but it is an issue for certain 
fonts that want to do complex multi-word ligatures, or contextual forms 
that depend on the adjacent space glyph.


JK



--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Contextual shaping

2013-11-27 Thread Simon Cozens

On 27/11/2013 21:46, Khaled Hosny wrote:

measure_native_node is called by the WEB code (called set_native_metrics
there), check xetex.web for collect_native:, that is where bulk of the
work is done. Check also @Merge sequences of words using AAT.


Aha, I see it now, I think! Reading the WEB documentation for native_word_node 
helped.


So a run of letter characters in the same font are assembled into a 
native_word_node by collect_native, and then shaped by set_native_metrics. 
That makes sense - thanks!



--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex