Thanks Behdad, the info on how it works in Pango is indeed super useful.
An attempt to recap using my original Japanese example: ユニコードは、すべての文字に固有の番号を付与します ICU's "scrptrun" is detecting Katakana, Hiragana and Han scripts. Case 1: no "input list of languages" is provided. a) For Katakana and Hiragana items, "ja" will be selected, with the help of http://goo.gl/mpD9Fg In turn, MTLmr3m.ttf (default for "ja" in my system) will be used. b) For Han items, no language will be selected because of http://goo.gl/xusqwn At this stage, we still need to pick a font, so I guess we choose DroidSansFallback.ttf (default for Han in my system), unless... Some additional strategy could be used, like: observing the surrounding items? Case 2: we use "ja" (say, collected from the locale) as "input language" For all the items, "ja" will be selected because the 3 scripts are valid for writing this language, as defined in http://goo.gl/hwQri5 By the way, I wonder why Korean is not including Han (see http://goo.gl/bI5BLj), in contradiction to the explanations in http://goo.gl/xusqwn? On Mon, Dec 23, 2013 at 1:35 AM, Behdad Esfahbod <beh...@behdad.org> wrote: > On 13-12-22 06:17 PM, Ariel Malka wrote: > >> As it happens, those three scripts are all considered "simple", so the > shaping > >> logic in HarfBuzz is the same for all three. > > > > Good to know. For the record, there's a function for checking if a > script is > > complex in the recent Harfbuzz-flavored Android OS: http://goo.gl/KL1KUi > > Please NEVER use something like that. It's broken by design. It exists in > Android for legacy reasons, and will eventually be removed. > > > >> Where it does make a difference > >> is if the font has ligatures, kerning, etc for those. OpenType > organizes > >> those features by script, and if you request the wrong script you will > miss > >> out on the features. > > > > Makes sense to me for Hebrew, Arabic, Thai, etc., but I was bit > surprised to > > find-out that LATN was also a complex script. > > LATN uses the "generic" shaper, so it's not complex, no. > > > > So for instance, if I would shape some text containing Hebrew and English > > solely using the HEBR script, I would probably loose kerning and ffi-like > > ligatures for the english part > > Correct. > > > > (this is what I'm actually doing currently in > > my "simple" BIDI implementation...) > > Then fix it. BIDI and script itemization are two separate issues. > > > >> How you do font selection and what script you pass to HarfBuzz are two > >> completely separate issues. Font fallback stack should be per-language. > > > > I understand that the best scenario will always be to take decisions > based on > > "language" rather than solely on "script", but it creates a problem: > > > > Say you work on an API for Unicode text rendering: you can't promise your > > users a solution where they would use arbitrary text without providing > > language-context per span. > > These are very good questions. And we have answers to all. Unfortunately > there's no single location with all this information. I'm working on > documenting them, but looks like replying to you and letting you document > is > better. > > What Pango does is: it takes an input list of languages (through $LANGUAGE > for > example), and whenever there's a item of text with script X, it assigns a > language to the item in this manner: > > - If a language L is set on the item (through xml:lang, or whatever else > the > user can use to set a language), and script X may be used to write > language L, > then resolve to language L and return, > > - for each language L in the list of default languages $LANGUAGE, if > script > X may be used to write language L, then resolve to language L and return, > > - If there's a predominant language L that is likely for script X, > resolve > to language L and return, > > - Assign no language. > > This algorithm needs two tables of data: > > - List of scripts a language tag may possibly use. This is for example > available in pango-script-lang-table.h. It's generated from fontconfig > orth > files using pango/tools/gen-script-for-lang.c. Feel free to copy it. > > - List of most likely language for each script. This is available in > CLDR: > > > http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/likely_subtags.html > > Pango has it's own manually compiled list in pango-language.c > > Again, all these are on my plate for the next library I'm going to design. > It > will take a while though... > > > behdad > > > Or, to come back to the origin of the message: solutions like ICU's > "scrptrun" > > which are doing script detection are not appropriate (because they won't > help > > you finding the right font due to the lack of language context...) > > > > I guess the problem is even more generic, like with utf8-encoded html > pages > > rendered in modern browsers, as demonstrated by the creator of > liblinebreak: > > http://wyw.dcweb.cn/lang_utf8.htm > > > > On Sun, Dec 22, 2013 at 10:47 PM, Behdad Esfahbod <beh...@behdad.org > > <mailto:beh...@behdad.org>> wrote: > > > > On 13-12-22 10:10 AM, Ariel Malka wrote: > > > I'm trying to render "regular" (i.e. modern, horizontal) Japanese > with > > Harfbuzz. > > > > > > So far, I have been using HB_SCRIPT_KATAKANA and it looks similar > to what is > > > rendered via browsers. > > > > > > But after examining other rendering solutions I can see that > "automatic > > script > > > detection" can often take place. > > > > > > For instance, the Mapnik project is using ICU's "scrptrun", which, > given the > > > following sentence: > > > > > > ユニコードは、すべての文字に固有の番号を付与します > > > > > > would detect a mix of Katakana, Hiragana and Han scripts. > > > > > > But for instance, it would not change anything if I'd render the > sentence by > > > mixing the 3 different scripts (i.e. instead of using only > > HB_SCRIPT_KATAKANA.) > > > > > > Or are there situations where it would make a difference? > > > > As it happens, those three scripts are all considered "simple", so > the shaping > > logic in HarfBuzz is the same for all three. Where it does make a > difference > > is if the font has ligatures, kerning, etc for those. OpenType > organizes > > those features by script, and if you request the wrong script you > will miss > > out on the features. > > > > > > > I'm asking that because I suspect a catch-22 situation here. For > > example, the > > > word "diameter" in Japanese is 直径 which, given to "scrptrun" would > be > > > detected as Han script. > > > > > > As far as I understand, it could be a problem on systems where > > > DroidSansFallback.ttf is used, because the word would look like in > > Simplified > > > Chinese. > > > > > > Now, if we were using MTLmr3m.ttf, which is preferred for > Japanese, the word > > > would have been rendered as intended. > > > > How you do font selection and what script you pass to HarfBuzz are > two > > completely separate issues. Font fallback stack should be > per-language. > > > > > Reference: > https://code.google.com/p/chromium/issues/detail?id=183830 > > > > > > Any feedback would be appreciated. Note that the wisdom > accumulated here > > will > > > be translated into tangible info and code samples (see > > > https://github.com/arielm/Unicode) > > > > > > Thanks! > > > Ariel > > > > > > > > > _______________________________________________ > > > HarfBuzz mailing list > > > HarfBuzz@lists.freedesktop.org <mailto: > HarfBuzz@lists.freedesktop.org> > > > http://lists.freedesktop.org/mailman/listinfo/harfbuzz > > > > > > > -- > > behdad > > http://behdad.org/ > > > > > > -- > behdad > http://behdad.org/ >
_______________________________________________ HarfBuzz mailing list HarfBuzz@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/harfbuzz