Standard fallback characters (was: Draft Proposal to add Variation=D=A Sequences for Latin and Cyrillic letters)

verdy_p Wed, 04 Aug 2010 13:43:09 -0700

"Asmus Freytag"  wrote:
> The Fraktur problem is one where one typestyle requires additional 
> information (e.g. when to select long s) that is not required for 
> rendering the same text in another typestyle. If it is indeed desirable 
> (and possible) to create a correctly encoded string that can be rendered 
> without further change automatically in both typestyles, then adding any 
> necessary variation sequences to ensure that ability might be useful. 
> However, that needs to be addressed in the context of a precise 
> specification of how to encode texts so that they are dual renderable. 
> Only addressing some isolated variation sequences makes no sense.


I don't think so.

If a text was initially using a round s, nothing prohibits it being rendered in 
Fraktur style, but even in this 
case, the conversion to "long s" will be inappropriate. So use the Fraktur 
"round s" directly.

If a text in Fraktur absolutely requires the "long s", it's only when the 
original text was already using this "long 
s". In that case, encode the "long s": The text will render with a "long s" in 
both "modern" Latin font styles like 
Bodoni (with a possible fallback to modern "round s" if that font does not have 
a "long s"), an in "classic" Fraktur 
font styles (with here also a possible fallback to Fraktur "round s" if the 
Frakut font forgets the long s in its 
repertoire of supported glyphs).

In other words, you don't need any variation sequence: "s+VS1" would be 
strictly encoding the same thing as the 
existing encoded "long s". Adding this variation selector would just be a 
pollution (an unjustified desunification). 
The two existing characters are already clearly stating their semantic 
differences, so we should continue to use 
them.

This does not mean that fonts should not continue to be enhanced, and that font 
renderers and text-layout engines 
should not be corrected to support more fallbacks (in fact it will be simpler 
to implement these fallbacks within 
text-renderers, instead of requiring a new font version).

You can apply the same policy to the French narrow non-breaking space NNBSP 
(aka "fine" in French) that fonts do not 
need to map, provided that the font renderers or text layout engines are 
correctly infering its bet fallback as 
"THIN SPACE", before retrying with the "FIFTH EM SPACE" or "SIXTH EM SPACE" 
characters, then with a standard SPACE 
with a reduced metric...

That's because fonts never care about line-breaking properties, that are 
implemented only in text layout engines. 
The same should apply as well with NBSP, if a font does not map it (the text 
renderer just has to use the fallback 
to SPACE to find the glyph in the selected font), to the NON-BREAKING HYPHEN 
(just infer the fallback to the 
standard HYPHEN, then to MINUS-HYPHEN).

In fact, it would be more elegant if Unicode provided a new property file, 
suggesting the best fallbacks (ordered by 
preference) for each character (these fallbacks possibly having their own 
fallbacks that will be retried if all the 
suggested ordered fallbacks are already failing). In most cases, only one 
fallback will be needed (in very few 
cases, several ordered fallbacks should be listed if the implied sub-fallbacks 
are not in the correct order of 
resolution).

It would avoid selecting glyphs from other fallback fonts with very different 
metrics. Some of these fallbacks are 
already listed in the main UCD file, but they are too generic (because the 
compatibility mappings must resolve ONLY 
to non-compatibility decomposable characters). For example NNBSP has a 
compatibility decomposition as 0020, 
just like many other whitespace characters, so it completely looses the width 
information.

If we had standardized fallback resolution sequences implemented in text 
renderers, we would not need to update 
complex fonts, and the job for font designers would be much simpler, and users 
of existing fonts could continue to 
use them, even if new characters are encoded.

I took the example of NNBSP, because it is one character that has been encoded 
since long now, but vendors are still 
forgetting to provide a glyph mapping for it (for example in core fonts of 
Windows 7 such as the new "Segoe UI" 
font, even though Microsoft included an explicit mapping for NNBSP in Times New 
Roman). It's one of the frequent 
cases where this can be solved very simply by the text-renderer itself.

The same should be done for providing a correct fallback to "round s" if ever 
any font does not map the "long s".

I also suggest that the lists of standard character fallbacks are scanned 
within the first selected font, without 
trying with other fallback fonts (including multiple font families specified in 
a stylesheet or generic CSS fonts), 
unless the list of fallback characters includes a  specifier in the middle of 
the list that would indicate 
that all the characters (the original or the fallback characters already 
specified before ) should be 
searched (this will be useful mostly for symbol/pictograms characters).

As the ordered list of suggested fallback characters will be then rescanned for 
other fonts, when it reaches the end 
without finding any fallback in the current font, it is not necessary to 
include it at end of the list.

Example of standardized fallback data for "long s" and symbols:

> 00A0 ; 0020 # NON-BREAKING SPACE
> 00B2 ; 0032 # EXPONENT DIGIT TWO
> 2009 ; 200A, 2008, 2005, 2006, 00A0 # NARROW NON-BREAKING SPACE
> 0283 ; 0073 # LATIN SMALL LETTER ESH
> 02A6 ; 
0074 200C 0073 # LATIN SMALL LIGATURE T-S
> 02A7 ; 
0074 200C 0283 # LATIN SMALL LIGATURE T-ESH
> 20A7 ; , 
0050 200C 0074 200C 0073 # PESETA SYMBOL

Here the Peseta symbol ("Pts" ligature) will be first search in other fonts, 
before trying to infer a ligature of 
the three letters). Because all other fonts will have been scanned for only the 
precomposed "Pts" symbol ONLY, the 
processing will continue by trying to represent the ligature of the three 
letters: the renderer will attempt to 
locate such ligature in the primary font, and as it will likely fail, it will 
immediately reprocess it trying to 
ignore the ZWJ characters, so it will show the three letters "Pts" that are 
very likely to succeed in many fonts (in 
fact almost all Latin fonts).

If it still fails at this point (because the primary font was not designed for 
Latin), it reaches the end of the 
list of standard fallbacks, so it will rescan the other fonts for all suggested 
fallbacks after the last  (no 
need to rescan for the symbol), so other fonts will then be scanned 
successively for the three-letters with ZWJ and 
immediately without it, before trying with the next fallback fonts in the 
specified stylesheet, and then in system-
specific fallback fonts.

Each fallback listed starts with a qualifier which is intended to be processed 
by the text-renderer when the listed 
characters succeeds to be resolved in the current font : it will provide 
synthetic information. The  
qualifier in fact will not alter the rendering, it just specifies that no 
change is necessary to the rendered glyph 
or its metrics and position.

The , , , , , , , ,  specifiers are altering the 
rendering appropriately, by synthetic style modifications, or metric 
modifications (font size, advance width...). 
They may be combined for the same specified fallback...

If the renderer finds the characters listed in the fallback with a mapping in 
the currently scanned font, it will 
render the mapped glyph, using the style modifications indicated by the 
specifiers (such as the equivalent ones 
available in CSS).

The  and  specifiers could be used as defined aliases for  and 
respectively...

We could have such data for many of the proposed emojis for emotional faces 
(most probably using ).

Note that the fully expanded list (after recursion) should contain somewhere 
the compatibility mappings listed in 
the main UCD file. For example:

> 2009 ; 200A, 2008, 2005, 2006, 00A0 # NARROW NON-BREAKING SPACE

complies to this, because it lists a fallback using U+00A0, which already 
fallbacks with:

> 00A0 ; 0020 # NON-BREAKING SPACE

So U+2009 will effectively fallback (at least at end) to the existing 
compatibility decomposition in the main UCD 
file. The data above also includes the compatibility fallback of long s to 
round s already specified in the main UCD 
file.

Compliant renderers will have to support the list in the specified order at 
least in its claimed version (this list 
of standard fallbacks should not be subject to the encoding stability 
principles like the compatibility mappings in 
the UCD, but should still assert that it lists the compatibility mappings. But 
we could get a smaller data file if 
we dropped this requirement, by also dropping fallbacks that are already 
specified in the UCD, such as:

> # 00A0 ; 0020 # NON-BREAKING SPACE

which would be commented out in a complete version of the file (for clarity), 
as it is inferable from the UCD 
compatibility decomposition mappings.

As well, the canonical decompositions need not be specified in this new data 
file for fallbacks: ALL of them are 
implied, and there should be NO attempt to override them, so if a canonically 
decomposable character is not found in 
the current font, it will IMMEDIATELY look for the canonical equivalent in the 
same font, as if the rule was 
present, before retrying at end with the list of fallback fonts:

A font-renderer that finds any mapping for a precomposed character but not for 
its NFD equivalent should still use 
that glyph to the precomposed character, for any canonically equivalents 
strings.

Fonts may also be built with mappings only for decomposed sequences: the 
renderer should be able to locate the 
mapped glyphs in the same way. This will simplify the development of fonts, 
because the other mappings will only be 
needed for legacy systems that still don't use this fallback mechanism.


Philippe.

Standard fallback characters (was: Draft Proposal to add Variation=D=A Sequences for Latin and Cyrillic letters)

Reply via email to