Re: [HarfBuzz] Ligatures

2020-05-22 Thread Khaled Hosny



> On May 22, 2020, at 9:32 PM, Eli Zaretskii  wrote:
> 
> Hi,
> 
> This is a bit off-topic, but I thought it could be appropriate to ask
> here, since we have here some of the best experts on this subject.
> 
> We are discussing support for ligatures in Emacs, specifically when
> using HarfBuzz as the shaping engine.  See the discussion from
> 
>  https://lists.gnu.org/archive/html/emacs-devel/2020-05/msg02493.html
> 
> The current support for producing ligatures works in the same way as
> complex text shaping for scripts that require that, like Arabic and
> Khmer: the sequences of characters that can be displayed as ligatures
> are identified in advance with suitable regular expressions, and the
> display engine then passes these sequences to hb_shape to produce the
> ligatures.
> 
> This works well for scripts that require complex shaping, because such
> scripts generally have well-defined rules for the sequences of
> codepoints that need shaping.  My original thoughts were that
> ligatures could be supported in the same way, based on the assumption
> that the list of possible ligatures is finite and can be stored in a
> suitable data stricture in advance.

I might be stating the obvious, but what Emacs is doing is a very outdated view 
of text layout. The schism between so called complex text and simple text does 
not actually exist. There are script-specific shaping rules that layout engines 
know and apply, and there are additional/complementary rules provided by the 
font that layout engines also apply.

For all applications care about, they have text with certain properties and 
fonts, and they hand them to the layout engine and get back positioned glyphs. 
Any attempt to second guess the layout engine and classify the text into parts 
that need or do not need shaping is futile.

Fonts can, and do, provide any number of arbitrary glyph interactions (not just 
ligatures), and the only reliable way to know that is to shape and check the 
output.

I think I already said this before, but Emacs should indiscriminately give all 
the text to HarfBuzz (or any other text layout engine it additionally supports) 
and give up on trying to pre-classify text, and is what pretty much any other 
sensible application is doing already. There are many ways to solve potential 
performance issues that does not involve compromising on the text layout.

> However, I'm being told that this assumption is false, and that each
> font defines ligatures from any number of arbitrary combinations of
> characters, and therefore the exhaustive list of the ligatures is in
> practice infinite and cannot be provided in advance.

That is true.

>The only way of
> doing this right, I'm told, is to either (a) query the font to get the
> list of all the ligatures it supports, or (b) assume any combination
> of characters can produce a ligature, and therefore we need to pass
> all the characters intended for display through hb_shape.  The latter
> in particular is in stark contrast to how the current Emacs display
> code is designed and implemented.

(a) is not realistically possible as doing it properly has pretty much the same 
cost as shaping the text. So your only reliable option is (b).

Regards,
Khaled
___
HarfBuzz mailing list
HarfBuzz@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/harfbuzz


Re: [HarfBuzz] Ligatures

2020-05-22 Thread Eli Zaretskii
> Date: Fri, 22 May 2020 22:22:49 +0100
> From: Richard Wordingham 
> 
> > The current support for producing ligatures works in the same way as
> > complex text shaping for scripts that require that, like Arabic and
> > Khmer: the sequences of characters that can be displayed as ligatures
> > are identified in advance with suitable regular expressions, and the
> > display engine then passes these sequences to hb_shape to produce the
> > ligatures.
> > 
> > This works well for scripts that require complex shaping, because such
> > scripts generally have well-defined rules for the sequences of
> > codepoints that need shaping.
> 
> They may of course have more than one set of such rules, with the rule
> sets defining different sets of sequences.

Who are "they" in this context?

> > However, I'm being told that this assumption is false, and that each
> > font defines ligatures from any number of arbitrary combinations of
> > characters, and therefore the exhaustive list of the ligatures is in
> > practice infinite and cannot be provided in advance.
> 
> This arbitrariness is true.  Over the set of all credible fonts for a
> given character repertoire, the number of ligating combinations is
> unbounded.

I understand that the number of combinations is theoretically
unbounded.  I'm asking if it is also unbounded in practice.  That is,
do font designers add ligatures for arbitrary combinations of
characters, regardless of some reasonable set of requirements?  For
example, is the set of ligatures of Latin characters shown here:

  https://en.wikipedia.org/wiki/Orthographic_ligature#Latin_alphabet

reasonably complete, or should I expect any number of other arbitrary
combinations of Latin characters popping up in fonts?  And if the
latter, then what is the purpose of providing such arbitrary
ligatures?

> > To be specific, I'm talking about 2 kinds of ligatures:
> > 
> >   . ligatures made of Latin characters, like "ffi" and "Th"
> >   . ligatures produced from symbols, like "==>" that is
> > converted into ⟹

Yes, these are the only cases that I'm asking here about.  I'm not
asking about shaping complex scripts such as Arabic, where this
problem doesn't exist AFAIK.

> Have you addressed the cursive scripts yet, such as Arabic?  At its
> simplest, most consonants have four shapes, initial, medial, final and
> isolated, and roughly speaking the shape used depends on the adjacent
> spacing characters.  For the most part, Emacs would have to pass whole
> words into HarfBuzz for shaping.  In some of the more advanced fonts,
> the vowel marks in a word may also affect the shape of the consonant
> skeleton.  And of course, sometimes the Arabic script prefers to join
> letters vertically, as well as having a few straightforward ligatures.

I'm not talking about Arabic.  Emacs has a set of regular expressions
for sequences of Arabic characters that need shaping, misc-lang.el in
Emacs.  If the set is incomplete, we can augment it.

> A cursive Latin script font may behave in the same way, with the shape
> of letters depending on what precedes and follows them.  With a small
> enough character repertoire, there might be no ligatures, but your
> rendering logic would fail miserably.

If a font requires special shaping for any sequence of any number of
26 (or maybe 52) ASCII letters, then the Emacs display engine will
need to be redesigned.  So this extreme possibility doesn't bother me.

> How would you handle the possibility that all three of <æ>,  and
>  might be rendered by the same glyph, althouɡh they are
> comprised of 1, 2 and 3 characters respectively?

By using a composition rule that matches both  and .
The rules are regexp-based, and expressing the above as a regexp is
simple.  Once a sequence of characters matches the regexp, Emacs calls
the shaper (hb_shape etc.) to produce the font glyphs for the
sequence, and displays the glyphs that the shaper returns.

> And if Emacs is not imposing a normalisation, then all the
> precomposed characters in Unicode might have been entered as one or
> as more than one character?

If you are talking about composition with combining characters, Emacs
already has the rules to compose them as described above.  You can try
this in your Emacs: insert a, then U+0301 COMBINING ACUTE ACCENT, and
you should see them composed into a single glyph (provided that you
use a suitable font).

But I'm not asking about character composition in general, I'm asking
specifically about ligatures of ASCII characters, without any
non-ASCII codepoints or combining accents.
___
HarfBuzz mailing list
HarfBuzz@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/harfbuzz


Re: [HarfBuzz] Ligatures

2020-05-22 Thread Richard Wordingham
On Fri, 22 May 2020 22:32:04 +0300
Eli Zaretskii  wrote:

> Can someone please tell what are the recommended practices regarding
> these ligatures?  Is the set of possible ligatures indeed infinite and
> impossible to know in advance?  And does HarfBuzz have APIs to query a
> font about the ligatures it supports?

hb_ot_layout_get_ligature_carets() is liable to be garbage in garbage
out.  While the cursor positions were included in OTL fonts to assist
cursor placement, it obviously fails when the components are stacked
vertically. Microsoft gave up on it and, if I remember the informal
statement correctly, just divides it up evenly between the characters
or grapheme clusters.  Many OpenType fonts don't populate the relevant
section of the GDEF table. And, of course, one has real trouble when
one glyph can come from different numbers of components.

LibreOffice takes (or took) a different approach, and uses the width of
the characters logically before the insertion point.  It's rather
disconcerting when the cursor jumps backwards as one steps through the
string.  It could happen with the Latin script string "a͡i", for the
'double' inverted breve should shorten when the second letter is 'i'.
One can get the effect in Indic scripts because of spacing viramas.

Richard.
___
HarfBuzz mailing list
HarfBuzz@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/harfbuzz


Re: [HarfBuzz] Ligatures

2020-05-22 Thread Richard Wordingham
On Fri, 22 May 2020 22:32:04 +0300
Eli Zaretskii  wrote:

> Hi,
> 
> This is a bit off-topic, but I thought it could be appropriate to ask
> here, since we have here some of the best experts on this subject.
> 
> We are discussing support for ligatures in Emacs, specifically when
> using HarfBuzz as the shaping engine.  See the discussion from
> 
>   https://lists.gnu.org/archive/html/emacs-devel/2020-05/msg02493.html
> 
> The current support for producing ligatures works in the same way as
> complex text shaping for scripts that require that, like Arabic and
> Khmer: the sequences of characters that can be displayed as ligatures
> are identified in advance with suitable regular expressions, and the
> display engine then passes these sequences to hb_shape to produce the
> ligatures.
> 
> This works well for scripts that require complex shaping, because such
> scripts generally have well-defined rules for the sequences of
> codepoints that need shaping.

They may of course have more than one set of such rules, with the rule
sets defining different sets of sequences.

> My original thoughts were that
> ligatures could be supported in the same way, based on the assumption
> that the list of possible ligatures is finite and can be stored in a
> suitable data stricture in advance.

At one level, this is true for any individual font, for it cannot have
more than 65,536 glyphs.

> However, I'm being told that this assumption is false, and that each
> font defines ligatures from any number of arbitrary combinations of
> characters, and therefore the exhaustive list of the ligatures is in
> practice infinite and cannot be provided in advance.

This arbitrariness is true.  Over the set of all credible fonts for a
given character repertoire, the number of ligating combinations is
unbounded.

> The only way of
> doing this right, I'm told, is to either (a) query the font to get the
> list of all the ligatures it supports, or (b) assume any combination
> of characters can produce a ligature, and therefore we need to pass
> all the characters intended for display through hb_shape.  The latter
> in particular is in stark contrast to how the current Emacs display
> code is designed and implemented.

> To be specific, I'm talking about 2 kinds of ligatures:
> 
>   . ligatures made of Latin characters, like "ffi" and "Th"
>   . ligatures produced from symbols, like "==>" that is
> converted into ⟹
> 
> Can someone please tell what are the recommended practices regarding
> these ligatures?  Is the set of possible ligatures indeed infinite and
> impossible to know in advance?  And does HarfBuzz have APIs to query a
> font about the ligatures it supports?

Have you addressed the cursive scripts yet, such as Arabic?  At its
simplest, most consonants have four shapes, initial, medial, final and
isolated, and roughly speaking the shape used depends on the adjacent
spacing characters.  For the most part, Emacs would have to pass whole
words into HarfBuzz for shaping.  In some of the more advanced fonts,
the vowel marks in a word may also affect the shape of the consonant
skeleton.  And of course, sometimes the Arabic script prefers to join
letters vertically, as well as having a few straightforward ligatures.

A cursive Latin script font may behave in the same way, with the shape
of letters depending on what precedes and follows them.  With a small
enough character repertoire, there might be no ligatures, but your
rendering logic would fail miserably.

How would you handle the possibility that all three of <æ>,  and
 might be rendered by the same glyph, althouɡh they are
comprised of 1, 2 and 3 characters respectively?  And if Emacs is not
imposing a normalisation, then all the precomposed characters in
Unicode might have been entered as one or as more than one character? 

Richard.
___
HarfBuzz mailing list
HarfBuzz@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/harfbuzz


[HarfBuzz] Ligatures

2020-05-22 Thread Eli Zaretskii
Hi,

This is a bit off-topic, but I thought it could be appropriate to ask
here, since we have here some of the best experts on this subject.

We are discussing support for ligatures in Emacs, specifically when
using HarfBuzz as the shaping engine.  See the discussion from

  https://lists.gnu.org/archive/html/emacs-devel/2020-05/msg02493.html

The current support for producing ligatures works in the same way as
complex text shaping for scripts that require that, like Arabic and
Khmer: the sequences of characters that can be displayed as ligatures
are identified in advance with suitable regular expressions, and the
display engine then passes these sequences to hb_shape to produce the
ligatures.

This works well for scripts that require complex shaping, because such
scripts generally have well-defined rules for the sequences of
codepoints that need shaping.  My original thoughts were that
ligatures could be supported in the same way, based on the assumption
that the list of possible ligatures is finite and can be stored in a
suitable data stricture in advance.

However, I'm being told that this assumption is false, and that each
font defines ligatures from any number of arbitrary combinations of
characters, and therefore the exhaustive list of the ligatures is in
practice infinite and cannot be provided in advance.  The only way of
doing this right, I'm told, is to either (a) query the font to get the
list of all the ligatures it supports, or (b) assume any combination
of characters can produce a ligature, and therefore we need to pass
all the characters intended for display through hb_shape.  The latter
in particular is in stark contrast to how the current Emacs display
code is designed and implemented.

To be specific, I'm talking about 2 kinds of ligatures:

  . ligatures made of Latin characters, like "ffi" and "Th"
  . ligatures produced from symbols, like "==>" that is
converted into ⟹

Can someone please tell what are the recommended practices regarding
these ligatures?  Is the set of possible ligatures indeed infinite and
impossible to know in advance?  And does HarfBuzz have APIs to query a
font about the ligatures it supports?

Thanks in advance for any help.
___
HarfBuzz mailing list
HarfBuzz@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/harfbuzz