[XeTeX] Contextual shaping
This is possibly a daft question, but... In traditional TeX, character tokens are processed and put into boxes individually, with fairly primitive ligature tables. Obviously XeTeX doesn't do this, using Harfbuzz (or ICU or whatever) to do the shaping and layout. My question is, if you're not showing individual characters to the shaping engine for it to consider, what defines how big a string of characters to shape at a time? Does XeTeX break at the word level and then shape a word, and if so what defines a word? (Chinese has no word breaks!) Or does it shape an entire paragraph of text at a time (!) and then box up the glyphs individually? Or...? (I've tried starting at layoutChars in XeTeXLayoutInterface.cpp and working backwards but I can't understand where I end up: measure_native_node shapes a node, but what's a node?) -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Contextual shaping
2013/11/27 Simon Cozens si...@simon-cozens.org: This is possibly a daft question, but... In traditional TeX, character tokens are processed and put into boxes individually, with fairly primitive ligature tables. Obviously XeTeX doesn't do this, using Harfbuzz (or ICU or whatever) to do the shaping and layout. My question is, if you're not showing individual characters to the shaping engine for it to consider, what defines how big a string of characters to shape at a time? Does XeTeX break at the word level and then shape a word, and if so what defines a word? (Chinese has no word breaks!) Or does it shape an entire paragraph of text at a time (!) and then box up the glyphs individually? Or...? I am not an expert but shaping can be changed within a pragraph. The following file produces the attached PDF: \documentclass{article} \usepackage{fontspec,polyglossia} \setotherlanguages{hindi,sanskrit} \newfontfamily\hindifont[Language=Hindi,Script=Devanagari,Mapping=velthuis]{FreeSerif} \newfontfamily\sanskritfont[Language=Sanskrit,Script=Devanagari,Mapping=velthuis]{FreeSerif} \def\shakti{sakti} \begin{document} Hindi: \texthindi{\shakti}, Sanskrit: \textsanskrit{\shakti} \end{document} (I've tried starting at layoutChars in XeTeXLayoutInterface.cpp and working backwards but I can't understand where I end up: measure_native_node shapes a node, but what's a node?) -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz sh.pdf Description: Adobe PDF document -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Contextual shaping
Simon Cozens si...@simon-cozens.org: This is possibly a daft question, but... In traditional TeX, character tokens are processed and put into boxes individually, with fairly primitive ligature tables. Obviously XeTeX doesn't do this, using Harfbuzz (or ICU or whatever) to do the shaping and layout. My question is, if you're not showing individual characters to the shaping engine for it to consider, what defines how big a string of characters to shape at a time? Does XeTeX break at the word level and then shape a word, and if so what defines a word? (Chinese has no word breaks!) Or does it shape an entire paragraph of text at a time (!) and then box up the glyphs individually? Or...? (I've tried starting at layoutChars in XeTeXLayoutInterface.cpp and working backwards but I can't understand where I end up: measure_native_node shapes a node, but what's a node?) I don't know how Harfbuzz and/or ICU work exactly, but: - Characters are never put into individual boxes; - Whatever shaping must take place is defined by sequences of characters; so you look at each character, see if it must be processed (possibly as a part of a larger sequence), move to the next character (unless it has been processed as part of a sequence), and so on. Most of the rules you must follow to process glyphs are explained here: http://www.microsoft.com/typography/otspec/ So your question (as I understand it) is really about processing OT fonts. The sequence of characters I have mentionned (your string of characters) are defined in the font itself (for complex sequencess, see e.g. contextual lookups). As for a node, it is whatever TeX processes internally to build a page: it can be a character, a kern, a whatsit, a box... Best, Paul -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Contextual shaping
On Wed, Nov 27, 2013 at 09:10:02PM +0900, Simon Cozens wrote: This is possibly a daft question, but... In traditional TeX, character tokens are processed and put into boxes individually, with fairly primitive ligature tables. Obviously XeTeX doesn't do this, using Harfbuzz (or ICU or whatever) to do the shaping and layout. My question is, if you're not showing individual characters to the shaping engine for it to consider, what defines how big a string of characters to shape at a time? Does XeTeX break at the word level and then shape a word, and if so what defines a word? (Chinese has no word breaks!) Or does it shape an entire paragraph of text at a time (!) and then box up the glyphs individually? Or...? XeTeX shapes words one at a time, a word is basically any consecutive sequence of character nodes (using the same font) after TeX has done its macro expansion and is ready to typeset the material. The AAT code, additionally, tries to merge word sequences separated by spaces into one node. (I've tried starting at layoutChars in XeTeXLayoutInterface.cpp and working backwards but I can't understand where I end up: measure_native_node shapes a node, but what's a node?) measure_native_node is called by the WEB code (called set_native_metrics there), check xetex.web for collect_native:, that is where bulk of the work is done. Check also @Merge sequences of words using AAT. Regards, Khaled -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Contextual shaping
On 27/11/13 12:46, Khaled Hosny wrote: On Wed, Nov 27, 2013 at 09:10:02PM +0900, Simon Cozens wrote: This is possibly a daft question, but... In traditional TeX, character tokens are processed and put into boxes individually, with fairly primitive ligature tables. Obviously XeTeX doesn't do this, using Harfbuzz (or ICU or whatever) to do the shaping and layout. My question is, if you're not showing individual characters to the shaping engine for it to consider, what defines how big a string of characters to shape at a time? Does XeTeX break at the word level and then shape a word, and if so what defines a word? (Chinese has no word breaks!) Or does it shape an entire paragraph of text at a time (!) and then box up the glyphs individually? Or...? XeTeX shapes words one at a time, a word is basically any consecutive sequence of character nodes (using the same font) after TeX has done its macro expansion and is ready to typeset the material. The AAT code, additionally, tries to merge word sequences separated by spaces into one node. In particular, in case it's not sufficiently clear from the above, note that spaces, being glue nodes, are NOT part of such a consecutive sequence of character nodes. And therefore a known limitation of xetex is that OpenType lookups that try to match the space glyph will not work. Shaping happens only within a run of non-space characters in a given font. Most fonts are not affected by this, but it is an issue for certain fonts that want to do complex multi-word ligatures, or contextual forms that depend on the adjacent space glyph. JK -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] Contextual shaping
On 27/11/2013 21:46, Khaled Hosny wrote: measure_native_node is called by the WEB code (called set_native_metrics there), check xetex.web for collect_native:, that is where bulk of the work is done. Check also @Merge sequences of words using AAT. Aha, I see it now, I think! Reading the WEB documentation for native_word_node helped. So a run of letter characters in the same font are assembled into a native_word_node by collect_native, and then shaped by set_native_metrics. That makes sense - thanks! -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex