Re: Tamil Brahmi Short Mid Vowels
This is a unique problem because this is probably the only case where the same script produces conjuncts for one language and not for another. I had asked for a separate Tamil Brahmi virama to be encoded which would obviate this problem but that was shot down. Maybe that case should be reopened? On Sat 21 Jul, 2018, 06:33 Richard Wordingham via Unicode, < unicode@unicode.org> wrote: > A problem has been spotted with the rendering of Tamil Brahmi vowels - > in particular the sequence VOWEL SIGN O, U+11046 BRAHMI VIRAMA> does not conform to the grammar > of the Universal Shaping Engine (USE); a dotted circle may be inserted > between the vowel and the pulli. > > When considering font-level remedies, I realised that there may be a > problem with a following consonant - is U+11022 BRAHMI LETTER TA> a correct encoding of what may be > transliterated as _kŏta_? > > The nearest to a convincing justification I can find for it to require > U+200C ZWNJ after the virama is the text in TUS Section 12.1 for > *Explicit Virama*, but that merely says that ZWNJ is required to > produce explicit virama rather than a _conjunct_. As I understand > it, a subscript final consonant would be encoded as consonant+virama > rather than virama+consonant, so there is no ambiguity in Brahmi text. > (If we try to make a rule out of two conflicting mechanisms, the > difference might be that one is used for viramas and the other is used > for invisible stackers, though that would require changing U+10A3F > KHAROSHTHI VIRAMA back to being a virama.) The problem is that a font > that tries to recover the situation might interpret U+11044, U+25CC DOTTED CIRCLE, U+11046, U+11022> as having TA > subscripted to the dotted circle. If ZWNJ is required for _kŏta_, what > text if any in TUS requires it? > > Richard. > >
Tamil Brahmi Short Mid Vowels
A problem has been spotted with the rendering of Tamil Brahmi vowels - in particular the sequence does not conform to the grammar of the Universal Shaping Engine (USE); a dotted circle may be inserted between the vowel and the pulli. When considering font-level remedies, I realised that there may be a problem with a following consonant - is a correct encoding of what may be transliterated as _kŏta_? The nearest to a convincing justification I can find for it to require U+200C ZWNJ after the virama is the text in TUS Section 12.1 for *Explicit Virama*, but that merely says that ZWNJ is required to produce explicit virama rather than a _conjunct_. As I understand it, a subscript final consonant would be encoded as consonant+virama rather than virama+consonant, so there is no ambiguity in Brahmi text. (If we try to make a rule out of two conflicting mechanisms, the difference might be that one is used for viramas and the other is used for invisible stackers, though that would require changing U+10A3F KHAROSHTHI VIRAMA back to being a virama.) The problem is that a font that tries to recover the situation might interpret as having TA subscripted to the dotted circle. If ZWNJ is required for _kŏta_, what text if any in TUS requires it? Richard.
Consonant shifters and ZWNJ in Khmer
The section on consonant shifters in the Khmer section of the Unicode standard (page 647 of Unicode 11 [1]) isn’t entirely clear on where the zero width non-joiner should be placed to prevent a consonant shifter that’s followed by an above-base vowel from being changed to a below-base glyph. First, it says “U+200C zero width non-joiner should be inserted before the consonant shifter” to prevent the change. Then it continues “in such cases, U+200C zero width non-joiner is inserted before the vowel sign”, which could be interpreted as “after the consonant shifter”. Finally, the examples show ZWNJ inserted before the consonant shifter. The OpenType Khmer shaping description [2], on the other hand, expects ZWNJ to be inserted between the consonant shifter (here called RegShift) and the above-base vowel. Questions to the people here who have dealt with Khmer: How is this handled in real life? Thanks, Norbert [1] https://www.unicode.org/versions/Unicode11.0.0/ch16.pdf [2] https://docs.microsoft.com/en-us/typography/script-development/khmer
old Polish and Unicode (was: Variation Sequences (and L2-11/059))
I apologize for sending by mistake the previous post with no new content. On Thu, Jul 19 2018 at 17:47 +0100, wjgo_10...@btinternet.com writes: [...] > I found the following. > > https://en.wikipedia.org/wiki/Old_Polish_language Thanks again for your interest in Polish language. There is also https://en.wikipedia.org/wiki/History_of_Polish https://en.wikipedia.org/wiki/Middle_Polish_language https://en.wikipedia.org/wiki/Polish_orthography https://en.wikipedia.org/wiki/History_of_Polish_orthography To make a long story short, this is just a mess. Looking for a good link to recommend I just found https://culture.pl/en/article/a-foreigners-guide-to-the-polish-alphabet which seems worth looking at (but the multimedia version doesn't work for me). I used to recommend the paper http://wbl.klf.uw.edu.pl/45/ which unfortunately it seems no longer available on the Internet. > > WJGO >> So you could if you wish try to make your own font > > JSB >Actually I tried: > > JSB > https://bitbucket.org/jsbien/parkosz-font/ > > Thank you for the link to the font. I have studied the font in the > FontCreator program (version 8). Please revisit the site, I just added some links and comments. This project is now orphaned. > > I remember that I produced an OpenType font using Variation Selectors > and OpenType Glyph Substitution back in April 2017. I wrote about it > and provided a link to the font and a link to a typecase document. > > https://forum.high-logic.com/viewtopic.php?f=10&t=7033 > > Although that font is about chess, I am thinking that that is the sort > of font that is needed for what you are wanting to do. This could use > variation selectors or could use circled digits as desired. Thanks for the link. I think I will do some tests with XeLaTeX. > > I am a researcher and I am looking for a worthwhile project related to > typography in which to participate from time to time - no money > charged, no money to pay - and I am interested in printed books of the > incunabula period and the early sixteenth century. > > I do not know any Polish, but I do not need to be involved in choosing > which glyphs are needed, so my not knowing any Polish would not seem > to be a problem. Please feel free to take over the font for Parkosz's treatise, if you wish to. I think another interesting challenge is "Nowy Karakter Polski", a 16th century treatise comparing several proposals of Polish spelling, which uses various strange characters. You can find the scan in various places and in various format, e.g. https://books.google.pl/books?id=Z3ojMAAJ http://www.dbc.wroc.pl/publication/4239 The treatise is used as one of the important sources used by the dictionary of the 16th century Polish language: http://spxvi.edu.pl/ The only English language presentation of the dictionary seems to be Luto-Kamińska, A. (2017). Several words on the dictionary of the 16th century Polish language. unfortunately behind a paywall: http://www.dbpia.co.kr/Journal/ArticleList/VOIS00297995# The history of the dictionary is long and sad. The work started in 1949 (!) and after the initial enthusiasm and generous funding the team had to struggle with various difficulties; in the consequence the dictionary is still unfinished but the work continues, although rather slowly. In my unpublished presentation http://bc.klf.uw.edu.pl/179/ I show how the editors managed quoting "Nowy Karakter" (slides 26-35). Look like in the time of hot type the strange letters has been written by hand, and there was a regress when the dictionary started to be typeset on computer. In my presentation I made some suggestions how to use Unicode for "Nowy Karakter" (slides 40-69). Unfortunately the dictionary editors were not interested in the proposal (there had at the time much more important problems). Not long ago the team received long-awaited grant for computerizing the work on the dictionary, in particular for creating a corpus of 16th century texts. Looks like the corpus was prepared rather in a hurry and there was no time or money to develop a faithfull rendering of "Nowy Karakter". The work exists in the corpus in two forms: PDF: http://rcin.org.pl/publication/82568 HTML: http://spxvi.edu.pl/korpus/teksty/JanNKar/ I must say that for a typical user of the dictionary the solution applied is probably a good one. The spelling has been modernized but the occurences of strange characters has been marked with color in PDF, and in HTML additionaly with some information displayed when you hoover over the appropriate fragment of the text. This solution is however not applicable to e.g. quotations in a research paper when color is for some reasons not allowed. So encoding "Nowy Karakter Polski" in Unicode and providing a font for it is still in my opinion an interesting open problem. Cf. also the thread http://www.unicode.org/mail-arch/unicode-ml/y2010-m04/0024.html BTW, I was definitely too optimistic... Best regards Janusz --
Re: UAX #9: applicability of higher-level protocols to bidi plaintext
Hi Ken (and all), Thanks for your time and patience with this. On Thu, 19 Jul 2018 18:10:49 -0700 Ken Whistler via Unicode wrote: > On 7/19/2018 12:38 AM, Shai Berger via Unicode wrote: > > If I cannot trust that > > people I communicate with make the same choices I make, plain text > > cannot be used. > > Here is a counterexample [a table rendered in plain text, which is > only truly legible using a fixed-width font]. > > It isn't that "plain text cannot be used" to convey this content. The > content is certainly "legible" in the minimal sense required by the > Unicode Standard, and it is interchangeable without data corruption. > The problem is that for optimal display and interpretation as > intended, I also need to convey (and/or have the reader guess) the > higher-level protocol requirement that this particular plain text > needs to be displayed with a monowidth font. > If I understand correctly, you are rejecting my claim that directionality is an issue of content, and claiming that, just like the crumbling-down of your table, it is an issue of display. But that argument is clearly disproved by the mere presence of the directionality-setting characters (RLM, LRE, etc) in the Unicode character set; in other words, your example would be convincing if Unicode included characters like "start table row" and "close table cell", AND there was an annex saying that your lines (for whatever reason) are to be treated as table rows unless a higher-level-protocol said otherwise. I believe this is not the case. > > If the Unicode standard does not impose a > > universal default, it does not define interchangeable plain text. > > And that is simply not the case. If your text is ( L, > ON>), that will display as {abc!} in a LTR paragraph directional > ON>context and as {!abc} in a RTL paragraph directional context. > [...] if plain text doesn't forcefully carry with it and > require how it must be displayed, well, then it isn't really > interchangeable. > > But that isn't what the Unicode Standard means by plain text. And > isn't what it requires for interchangeability of plain text. If I understood your argument correctly, it amounts to a claim that Unicode defines plain text as a component in a data format, but not to be used as a full document. If that is correct, then there is much to fix -- I think that quite a lot of existing technology assumes the opposite (e.g. the use of "Content-Type: text/plain; charset=UTF-8" in MIME should be strongly discouraged, if the people who designed Unicode and UTF-8 think it is not appropriate for full documents). If I misunderstood, please correct me. > > > > My main point, whose rejection baffles me to no end, is that it > > should. > > Well, I'm not expecting that I can make you feel good about the > situation. ;-) But perhaps the UTC position will seem a little less > baffling. As I hope I've shown above, there's plenty of reason for bafflement. The UTC defines code points to encode directionality, but then refuses to treat directionality as content when it comes to paragraph directionality; it defines a higher-level-protocol as an agreement, and then turns around and says the word "agreement" actually means "decision". I can guess reasons for why the things are the way they are, but not justifications. I stay baffled. Thanks, Shai.
Re: Variation Sequences (and L2-11/059)
On Thu, Jul 19 2018 at 17:47 +0100, wjgo_10...@btinternet.com writes: > Janusz S. Bien wrote: > >> You seem to assume that my concern is only rendering. > > Well my thinking is that what you are wanting is a way to accurately > transcribe documents and maybe printed books from Old Polish into a > Unicode-based electronic format so that the information can be more > readily studied, while retaining glyph information that is not > presently representable using Unicode characters. > > I found the following. > > https://en.wikipedia.org/wiki/Old_Polish_language > > WJGO >> So you could if you wish try to make your own font > > JSB >Actually I tried: > > JSB > https://bitbucket.org/jsbien/parkosz-font/ > > Thank you for the link to the font. I have studied the font in the > FontCreator program (version 8). > > I remember that I produced an OpenType font using Variation Selectors > and OpenType Glyph Substitution back in April 2017. I wrote about it > and provided a link to the font and a link to a typecase document. > > https://forum.high-logic.com/viewtopic.php?f=10&t=7033 > > Although that font is about chess, I am thinking that that is the sort > of font that is needed for what you are wanting to do. This could use > variation selectors or could use circled digits as desired. > > I am a researcher and I am looking for a worthwhile project related to > typography in which to participate from time to time - no money > charged, no money to pay - and I am interested in printed books of the > incunabula period and the early sixteenth century. > > I do not know any Polish, but I do not need to be involved in choosing > which glyphs are needed, so my not knowing any Polish would not seem > to be a problem. > > William Overington > > Thursday 19 July 2018 > -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien
RE: Unicode 11 Georgian uppercase vs. fonts
IMO, the correct answer is 2, except that “all common fonts” is more sweeping that necessary: it’s sufficient to have fonts used for fallback in platforms and browsers, and the related fallback logic, to get updated. Of course, that takes some time, and it’s not even two months since Unicode 11 was released. The Georgian community understood that it would take time to get implementations in place, and that they would need to take measures to smooth over that transition — which can include having Web sites for Georgian businesses and institutions using fonts to match the requirements of the content. Peter From: Unicore On Behalf Of Markus Scherer via Unicore Sent: Wednesday, July 18, 2018 3:05 PM To: unicore UnicoRe Discussion Cc: mark Subject: Unicode 11 Georgian uppercase vs. fonts Dear fellow Unicoders, We’ve run into some significant problems with the Georgian capital letters added in Unicode 11. If you have run into them yourselves, or have feedback on our brainstormed solutions below, we’d love to hear your thoughts. Here's the problem. The vast majority of Georgian fonts do not yet have the new uppercase characters. So when any system uses case mapping to uppercase text (e.g. browsers interpreting CSS’s text-transform: capitalize), then the users of Georgian will see boxes (“tofu”) if the font they are using does not have the glyphs. For example, a program constructs a web page with buttons. It uses a CSS style to uppercase text in buttons, as a house style. Unless the user has a very up-to-date font, they see tofu (boxes). If a server does backend rendering, its font has to be very up-to-date. We also saw this problem in a program that was doing titlecasing, but on the first character it used the uppercase mappings rather than titlecase mappings. Not the right thing to do, of course, but code that accidentally works (most of the time) doesn't get fixed if nobody reports a bug about it. All of these will result in bad bugs in the UI, in software that formerly worked fine. We brainstormed some options to fix this: 1. Get all call sites to change their code to not uppercase Georgian (and fix titlecasing to use the titlecase mappings, not the uppercase mappings). Since we have no control over call sites and release cycles of affected software, this would not help Georgian users for a long time, if ever. We’d eventually want to retract these changes, creating even more work. 2. Change all common fonts with Georgian characters to add the U11.0 ones. This should eventually happen but would probably take a couple of years at least, which does not help users in the short term. 3. Hack font CMAPs to just map the new characters to the glyphs of the old ones. Works but only when a programmer can control the fonts used, such as with server-side rendering or downloadable fonts. 4. Remove the uppercase mappings for Georgian, until the fonts catch up. * Would at least have to be done in all browsers, otherwise web apps will still break for Georgian. * A broader alternative is to do it in ICU. Because that is used by the majority of the browser implementations, it would solve the short-term problem for the browsers — and many other programs. Drawback: Non-conformant, and uppercasing will be inconsistent depending on who has which variant of ICU (with vs. without hack, on top of: with Unicode 11 vs. before Unicode 11). * One precedent is that in CLDR we deliberately hold back from using new currency characters until the font support is sufficiently widespread. (Wishing we'd held back the uppercase mappings in Unicode 11.0 too!) Mark & Markus