Re: Tamil Brahmi Short Mid Vowels

2018-07-20 Thread Shriramana Sharma via Unicode
This is a unique problem because this is probably the only case where the
same script produces conjuncts for one language and not for another. I had
asked for a separate Tamil Brahmi virama to be encoded which would obviate
this problem but that was shot down. Maybe that case should be reopened?

On Sat 21 Jul, 2018, 06:33 Richard Wordingham via Unicode, <
unicode@unicode.org> wrote:

> A problem has been spotted with the rendering of Tamil Brahmi vowels -
> in particular the sequence  VOWEL SIGN O, U+11046 BRAHMI VIRAMA> does not conform to the grammar
> of the Universal Shaping Engine (USE); a dotted circle may be inserted
> between the vowel and the pulli.
>
> When considering font-level remedies, I realised that there may be a
> problem with a following consonant - is  U+11022 BRAHMI LETTER TA> a correct encoding of what may be
> transliterated as _kŏta_?
>
> The nearest to a convincing justification I can find for it to require
> U+200C ZWNJ after the virama is the text in TUS Section 12.1 for
> *Explicit Virama*, but that merely says that ZWNJ is required to
> produce explicit virama rather than a _conjunct_.  As I understand
> it, a subscript final consonant would be encoded as consonant+virama
> rather than virama+consonant, so there is no ambiguity in Brahmi text.
> (If we try to make a rule out of two conflicting mechanisms, the
> difference might be that one is used for viramas and the other is used
> for invisible stackers, though that would require changing U+10A3F
> KHAROSHTHI VIRAMA back to being a virama.) The problem is that a font
> that tries to recover the situation might interpret  U+11044, U+25CC DOTTED CIRCLE, U+11046, U+11022> as having TA
> subscripted to the dotted circle.  If ZWNJ is required for _kŏta_, what
> text if any in TUS requires it?
>
> Richard.
>
>


Tamil Brahmi Short Mid Vowels

2018-07-20 Thread Richard Wordingham via Unicode
A problem has been spotted with the rendering of Tamil Brahmi vowels -
in particular the sequence  does not conform to the grammar
of the Universal Shaping Engine (USE); a dotted circle may be inserted
between the vowel and the pulli.

When considering font-level remedies, I realised that there may be a
problem with a following consonant - is  a correct encoding of what may be
transliterated as _kŏta_?

The nearest to a convincing justification I can find for it to require
U+200C ZWNJ after the virama is the text in TUS Section 12.1 for
*Explicit Virama*, but that merely says that ZWNJ is required to
produce explicit virama rather than a _conjunct_.  As I understand
it, a subscript final consonant would be encoded as consonant+virama
rather than virama+consonant, so there is no ambiguity in Brahmi text.
(If we try to make a rule out of two conflicting mechanisms, the
difference might be that one is used for viramas and the other is used
for invisible stackers, though that would require changing U+10A3F
KHAROSHTHI VIRAMA back to being a virama.) The problem is that a font
that tries to recover the situation might interpret  as having TA
subscripted to the dotted circle.  If ZWNJ is required for _kŏta_, what
text if any in TUS requires it?

Richard.



Consonant shifters and ZWNJ in Khmer

2018-07-20 Thread Norbert Lindenberg via Unicode
The section on consonant shifters in the Khmer section of the Unicode standard 
(page 647 of Unicode 11 [1]) isn’t entirely clear on where the zero width 
non-joiner should be placed to prevent a consonant shifter that’s followed by 
an above-base vowel from being changed to a below-base glyph.

First, it says “U+200C zero width non-joiner should be inserted before the 
consonant shifter” to prevent the change. Then it continues “in such cases, 
U+200C zero width non-joiner is inserted before the vowel sign”, which could be 
interpreted as “after the consonant shifter”. Finally, the examples show ZWNJ 
inserted before the consonant shifter.

The OpenType Khmer shaping description [2], on the other hand, expects ZWNJ to 
be inserted between the consonant shifter (here called RegShift) and the 
above-base vowel.

Questions to the people here who have dealt with Khmer: How is this handled in 
real life?

Thanks,
Norbert

[1] https://www.unicode.org/versions/Unicode11.0.0/ch16.pdf
[2] https://docs.microsoft.com/en-us/typography/script-development/khmer


old Polish and Unicode (was: Variation Sequences (and L2-11/059))

2018-07-20 Thread Janusz S. Bień via Unicode


I apologize for sending by mistake the previous post with no new
content.

On Thu, Jul 19 2018 at 17:47 +0100, wjgo_10...@btinternet.com writes:

[...]

> I found the following.
>
> https://en.wikipedia.org/wiki/Old_Polish_language

Thanks again for your interest in Polish language.

There is also

https://en.wikipedia.org/wiki/History_of_Polish
https://en.wikipedia.org/wiki/Middle_Polish_language
https://en.wikipedia.org/wiki/Polish_orthography
https://en.wikipedia.org/wiki/History_of_Polish_orthography

To make a long story short, this is just a mess. Looking for a good link
to recommend I just found

https://culture.pl/en/article/a-foreigners-guide-to-the-polish-alphabet

which seems worth looking at (but the multimedia version doesn't work
for me).

I used to recommend the paper

http://wbl.klf.uw.edu.pl/45/

which unfortunately it seems no longer available on the Internet.

>
> WJGO >> So you could if you wish try to make your own font
>
> JSB >Actually I tried:
>
> JSB > https://bitbucket.org/jsbien/parkosz-font/
>
> Thank you for the link to the font. I have studied the font in the 
> FontCreator program (version 8).


Please revisit the site, I just added some links and comments. This
project is now orphaned. 

>
> I remember that I produced an OpenType font using Variation Selectors
> and OpenType Glyph Substitution back in April 2017. I wrote about it
> and provided a link to the font and a link to a typecase document.
>
> https://forum.high-logic.com/viewtopic.php?f=10&t=7033
>
> Although that font is about chess, I am thinking that that is the sort
> of font that is needed for what you are wanting to do. This could use
> variation selectors or could use circled digits as desired.

Thanks for the link. I think I will do some tests with XeLaTeX.

>
> I am a researcher and I am looking for a worthwhile project related to
> typography in which to participate from time to time - no money
> charged, no money to pay - and I am interested in printed books of the
> incunabula period and the early sixteenth century.
>
> I do not know any Polish, but I do not need to be involved in choosing
> which glyphs are needed, so my not knowing any Polish would not seem
> to be a problem.

Please feel free to take over the font for Parkosz's treatise, if you
wish to.

I think another interesting challenge is "Nowy Karakter Polski", a 16th
century treatise comparing several proposals of Polish spelling, which
uses various strange characters. You can find the scan in various places
and in various format, e.g.

https://books.google.pl/books?id=Z3ojMAAJ
http://www.dbc.wroc.pl/publication/4239

The treatise is used as one of the important sources used by the
dictionary of the 16th century Polish language:

http://spxvi.edu.pl/

The only English language presentation of the dictionary seems to be

Luto-Kamińska, A. (2017). Several words on the dictionary of the 16th
century Polish language.

unfortunately behind a paywall:

http://www.dbpia.co.kr/Journal/ArticleList/VOIS00297995#

The history of the dictionary is long and sad. The work started in 1949
(!)  and after the initial enthusiasm and generous funding the team had
to struggle with various difficulties; in the consequence the dictionary
is still unfinished but the work continues, although rather slowly.

In my unpublished presentation

http://bc.klf.uw.edu.pl/179/

I show how the editors managed quoting "Nowy Karakter" (slides
26-35). Look like in the time of hot type the strange letters has been
written by hand, and there was a regress when the dictionary started to
be typeset on computer.

In my presentation I made some suggestions how to use Unicode for "Nowy
Karakter" (slides 40-69). Unfortunately the dictionary editors were not
interested in the proposal (there had at the time much more important
problems).

Not long ago the team received long-awaited grant for computerizing the
work on the dictionary, in particular for creating a corpus of 16th
century texts. Looks like the corpus was prepared rather in a hurry and
there was no time or money to develop a faithfull rendering of "Nowy
Karakter". The work exists in the corpus in two forms:

PDF: http://rcin.org.pl/publication/82568
HTML: http://spxvi.edu.pl/korpus/teksty/JanNKar/

I must say that for a typical user of the dictionary the solution
applied is probably a good one. The spelling has been modernized but the
occurences of strange characters has been marked with color in PDF, and
in HTML additionaly with some information displayed when you hoover over
the appropriate fragment of the text.

This solution is however not applicable to e.g. quotations in a research
paper when color is for some reasons not allowed.

So encoding "Nowy Karakter Polski" in Unicode and providing a font for
it is still in my opinion an interesting open problem.

Cf. also the thread

http://www.unicode.org/mail-arch/unicode-ml/y2010-m04/0024.html

BTW, I was definitely too optimistic...

Best regards

Janusz

-- 
   

Re: UAX #9: applicability of higher-level protocols to bidi plaintext

2018-07-20 Thread Shai Berger via Unicode
Hi Ken (and all),

Thanks for your time and patience with this.

On Thu, 19 Jul 2018 18:10:49 -0700
Ken Whistler via Unicode  wrote:

> On 7/19/2018 12:38 AM, Shai Berger via Unicode wrote:
> > If I cannot trust that
> > people I communicate with make the same choices I make, plain text
> > cannot be used.  
> 
> Here is a counterexample [a table rendered in plain text, which is
> only truly legible using a fixed-width font].
> 
> It isn't that "plain text cannot be used" to convey this content. The 
> content is certainly "legible" in the minimal sense required by the 
> Unicode Standard, and it is interchangeable without data corruption.
> The problem is that for optimal display and interpretation as
> intended, I also need to convey (and/or have the reader guess) the
> higher-level protocol requirement that this particular plain text
> needs to be displayed with a monowidth font.
> 

If I understand correctly, you are rejecting my claim that
directionality is an issue of content, and claiming that, just like
the crumbling-down of your table, it is an issue of display. But that
argument is clearly disproved by the mere presence of the
directionality-setting characters (RLM, LRE, etc) in the Unicode
character set; in other words, your example would be convincing if
Unicode included characters like "start table row" and "close table
cell", AND there was an annex saying that your lines (for whatever
reason) are to be treated as table rows unless a higher-level-protocol
said otherwise. I believe this is not the case.

> > If the Unicode standard does not impose a
> > universal default, it does not define interchangeable plain text.  
> 
> And that is simply not the case. If your text is  ( L, 
> ON>), that will display as {abc!} in a LTR paragraph directional
> ON>context and as {!abc} in a RTL paragraph directional context.


> [...] if plain text doesn't forcefully carry with it and
> require how it must be displayed, well, then it isn't really
> interchangeable.
> 
> But that isn't what the Unicode Standard means by plain text. And
> isn't what it requires for interchangeability of plain text.

If I understood your argument correctly, it amounts to a claim that
Unicode defines plain text as a component in a data format, but not to
be used as a full document. If that is correct, then there is much to
fix -- I think that quite a lot of existing technology assumes the
opposite (e.g. the use of "Content-Type: text/plain; charset=UTF-8" in
MIME should be strongly discouraged, if the people who designed
Unicode and UTF-8 think it is not appropriate for full documents).

If I misunderstood, please correct me.

> >
> > My main point, whose rejection baffles me to no end, is that it
> > should.  
> 
> Well, I'm not expecting that I can make you feel good about the 
> situation. ;-) But perhaps the UTC position will seem a little less 
> baffling.

As I hope I've shown above, there's plenty of reason for bafflement.
The UTC defines code points to encode directionality, but then refuses
to treat directionality as content when it comes to paragraph
directionality; it defines a higher-level-protocol as an agreement, and
then turns around and says the word "agreement" actually means
"decision".

I can guess reasons for why the things are the way they are, but not
justifications. I stay baffled.

Thanks,
Shai.


Re: Variation Sequences (and L2-11/059)

2018-07-20 Thread Janusz S. Bień via Unicode
On Thu, Jul 19 2018 at 17:47 +0100, wjgo_10...@btinternet.com writes:
> Janusz S. Bien wrote:
>
>> You seem to assume that my concern is only rendering.
>
> Well my thinking is that what you are wanting is a way to accurately
> transcribe documents and maybe printed books from Old Polish into a
> Unicode-based electronic format so that the information can be more
> readily studied, while retaining glyph information that is not
> presently representable using Unicode characters.
>
> I found the following.
>
> https://en.wikipedia.org/wiki/Old_Polish_language
>
> WJGO >> So you could if you wish try to make your own font
>
> JSB >Actually I tried:
>
> JSB > https://bitbucket.org/jsbien/parkosz-font/
>
> Thank you for the link to the font. I have studied the font in the 
> FontCreator program (version 8).
>
> I remember that I produced an OpenType font using Variation Selectors
> and OpenType Glyph Substitution back in April 2017. I wrote about it
> and provided a link to the font and a link to a typecase document.
>
> https://forum.high-logic.com/viewtopic.php?f=10&t=7033
>
> Although that font is about chess, I am thinking that that is the sort
> of font that is needed for what you are wanting to do. This could use
> variation selectors or could use circled digits as desired.
>
> I am a researcher and I am looking for a worthwhile project related to
> typography in which to participate from time to time - no money
> charged, no money to pay - and I am interested in printed books of the
> incunabula period and the early sixteenth century.
>
> I do not know any Polish, but I do not need to be involved in choosing
> which glyphs are needed, so my not knowing any Polish would not seem
> to be a problem.
>
> William Overington
>
> Thursday 19 July 2018
>

-- 
 ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien


RE: Unicode 11 Georgian uppercase vs. fonts

2018-07-20 Thread Peter Constable via Unicode
IMO, the correct answer is 2, except that “all common fonts” is more sweeping 
that necessary: it’s sufficient to have fonts used for fallback in platforms 
and browsers, and the related fallback logic, to get updated. Of course, that 
takes some time, and it’s not even two months since Unicode 11 was released. 
The Georgian community understood that it would take time to get 
implementations in place, and that they would need to take measures to smooth 
over that transition — which can include having Web sites for Georgian 
businesses and institutions using fonts to match the requirements of the 
content.


Peter

From: Unicore  On Behalf Of Markus Scherer via 
Unicore
Sent: Wednesday, July 18, 2018 3:05 PM
To: unicore UnicoRe Discussion 
Cc: mark 
Subject: Unicode 11 Georgian uppercase vs. fonts


Dear fellow Unicoders,



We’ve run into some significant problems with the Georgian capital letters 
added in Unicode 11. If you have run into them yourselves, or have feedback on 
our brainstormed solutions below, we’d love to hear your thoughts.


Here's the problem. The vast majority of Georgian fonts do not yet have the new 
uppercase characters. So when any system uses case mapping to uppercase text 
(e.g. browsers interpreting CSS’s text-transform: capitalize), then the users 
of Georgian will see boxes (“tofu”) if the font they are using does not have 
the glyphs.


For example, a program constructs a web page with buttons. It uses a CSS style 
to uppercase text in buttons, as a house style. Unless the user has a very 
up-to-date font, they see tofu (boxes). If a server does backend rendering, its 
font has to be very up-to-date. We also saw this problem in a program that was 
doing titlecasing, but on the first character it used the uppercase mappings 
rather than titlecase mappings. Not the right thing to do, of course, but code 
that accidentally works (most of the time) doesn't get fixed if nobody reports 
a bug about it.


All of these will result in bad bugs in the UI, in software that formerly 
worked fine.


We brainstormed some options to fix this:


  1.  Get all call sites to change their code to not uppercase Georgian (and 
fix titlecasing to use the titlecase mappings, not the uppercase mappings). 
Since we have no control over call sites and release cycles of affected 
software, this would not help Georgian users for a long time, if ever. We’d 
eventually want to retract these changes, creating even more work.
  2.  Change all common fonts with Georgian characters to add the U11.0 ones. 
This should eventually happen but would probably take a couple of years at 
least, which does not help users in the short term.
  3.  Hack font CMAPs to just map the new characters to the glyphs of the old 
ones. Works but only when a programmer can control the fonts used, such as with 
server-side rendering or downloadable fonts.
  4.  Remove the uppercase mappings for Georgian, until the fonts catch up.

 *   Would at least have to be done in all browsers, otherwise web apps 
will still break for Georgian.
 *   A broader alternative is to do it in ICU. Because that is used by the 
majority of the browser implementations, it would solve the short-term problem 
for the browsers — and many other programs. Drawback: Non-conformant, and 
uppercasing will be inconsistent depending on who has which variant of ICU 
(with vs. without hack, on top of: with Unicode 11 vs. before Unicode 11).

*   One precedent is that in CLDR we deliberately hold back from using 
new currency characters until the font support is sufficiently widespread. 
(Wishing we'd held back the uppercase mappings in Unicode 11.0 too!)


Mark & Markus