Re: Encoding italic
On Thursday, 31 January 2019, James Kass via Unicode wrote:. > > > As for use of other variant letter forms enabled by the math > alphanumerics, the situation exists. It’s an interesting phenomenon which > is sometimes worthy of comment and relates to this thread because the math > alphanumerics include italics. One of the web pages referring to > third-party input tools calls the practice “super cool Unicode text magic”. > > Although not all devices can render such text. Many Android handsets on the market do not have a sufficiently recent version of Android to have system fonts that can render such existing usage. -- Andrew Cunningham lang.supp...@gmail.com
Re: Ancient Greek apostrophe marking elision
On Sunday, 27 January 2019, Asmus Freytag via Unicode wrote: > > Choice of quotation marks is language-based and for novels, many times > there are > additional conventions that may differ by publisher. > > Wonder why the publisher is forcing single quotes on them > In theory quotation marks are language based but many languages have had the puntuation and typographic conventions of colonial languages imposed, even when it isn't the best choice. And publishers are following established patterns. The publishers that care about the language do try to distinguish or refine these characters typographically. Andrew -- Andrew Cunningham lang.supp...@gmail.com
Re: Encoding italic
Assuming some mechanism for italics is added to Unicode, when converting between the new plain text and HTML there is insufficient information to correctly convert to HTML. many elements may have italic stying and there would be no meta information in Unicode to indicate the appropriate HTML element. On Friday, 25 January 2019, wjgo_10...@btinternet.com via Unicode < unicode@unicode.org> wrote: > Asmus Freytag wrote; > > Other schemes, like a VS per code point, also suffer from being different >> in philosophy from "standard" rich text approaches. Best would be as >> standard extension to all the messaging systems (e.g. a common markdown >> language, supported by UI). A./ >> > > Yet that claim of what would be best would be stateful and statefulness is > the very thing that Unicode seeks to avoid. > > Plain text is the basic system and a Variation Selector mechanism after > each character that is to become italicized is not stateful and can be > implemented using existing OpenType technology. > > If an organization chooses to develop and use a rich text format then that > is a matter for that organization and any changing of formatting of how > italics are done when converting between plain text and rich text is the > responsibility of the organization that introduces its rich text format. > > Twitter was just an example that someone introduced along the way, it was > not the original request. > > Also this is not only about messaging. Of primary importance is the > conservation of texts in plain text format, for example, where a printed > book has one word italicized in a sentence and the text is being > transcribed into a computer. > > William Overington > Friday 25 January 2019 > > -- Andrew Cunningham lang.supp...@gmail.com
Re: Encoding italic (was: A last missing link)
HI Victor, an off list reply. The contents are just random thoughts sparked by an interesting conversation. On Wed, 16 Jan 2019 at 22:44, Victor Gaultney via Unicode < unicode@unicode.org> wrote: > > - It finally, and conclusively, would end the decades of the mess in HTML > that surrounds and . > I am not sure that would fix the issue, more likely compound the issue making it even more blurry what the semantic purpose is. HTML5 make both and semantic ... and by the definition the style of the elements is not necessarily italic. for instance would be script dependant, may be partially script dependant when another appropriate semantic tag is missing. A character/encoding level distinction is just going to compound the mess. And then there are all the other script specific typographic / typesetting conventions that should also be considered. > My main point in suggesting that Unicode needs these characters is that > italic has been used to indicate specific meaning - this text is somehow > special - for over 400 years, and that content should be preserved in plain > text. > > > Underlying, bold text, interletter spacing, colour change, font style change all are used to apply meaning in various ways. Not sure why italic is special in this sense. Additionally without encoding the meaning of italic, all you know is that it is italic, not what convention of semantic meaning lies behind it. And I am curious on your thoughts, if we distinguish italic in Unicode, encode some way of spacifying italic text, wouldn't it make more sense to do away with italic fonts all together? and just roll the italic glyphs into the regular font? In theory changing italic from a stylistic choice as it currently is to a encoding/character level semantic is a paradigmn shift. We dont have separate fonts for variation selectors or any other mecahanism in unicode,and it would seem to make sense to roll character glyph variation into a single font. And potentially exclude italicisation from being a viable axis in a variable font. Just speculation on my part. To clarify I am neither for nor against encoding italics. But so far there doesn't seem to be a robust case for it. But it it were introduced I would prefer a system that was more inclusive of all scripts, giving proper analysis of typeseting and typographic conventions in each script and well founded decisions on which should be encoded. Cherry picking one feature relevant to a small set of scripts seems to be a problematic path. I have enough trouble with ordered and unordered lists and list markers in HTML without expaning the italics mess in HTML. -- Andrew Cunningham lang.supp...@gmail.com
Re: Private Use areas
On Wednesday, 22 August 2018, Mark E. Shoulson via Unicode < unicode@unicode.org> wrote: > On 08/21/2018 02:03 PM, Ken Whistler via Unicode wrote: > >> >> > Best we can do is shout loudly at OpenType tables and hope to cram in > behavior (or at least appearance, which is more likely all we can get) that > vaguely resembles what we're after. And that's not SO awful, given what > we're dealing with. > >> >> At the moment I am looking at implementing three unencoded Arabic characters in the PUA. For the foreseeable future OpenType is a non-starter, so I will look at implementing them in Graphite tables in a font. Andrew -- Andrew Cunningham lang.supp...@gmail.com
Re: Unicode 11 Georgian uppercase vs. fonts
On Saturday, 28 July 2018, Asmus Freytag (c) via Unicode < unicode@unicode.org> wrote: > > > A real plan would have consisted of documentation suggesting how to roll > out library update, whether to change/augment CSS styling keywords, what > types of locale adaptations of case transforms should be implemented, how > to get OSs to deliver fonts to people, etc., etc.. > > It can be dealt with in various ways in CSS as it is. The question is why the designer chose to apply capitals, the purpose behind it, and how that should be appropriately internationalised. For instance for Cherokee you may want to lowercase instead of upercase, assuming this is wise. Other languages you may want to embolden text, Italian it, underline it, change colour, change interchanged or integral heme spacing, etc . Ultimately it's a question of whether you want a single UI design or a language responsive UI design. -- Andrew Cunningham lang.supp...@gmail.com
Re: Northern Khmer on iPhone
On iOS it is fairly straightforward to arrange solutions for minority languages. Android has always been a challenge. Older versions of Android might not rendering support for the script. Most handset manufactorers dont allow users to chamge fonts. A couple of handset manufactorers allow users to change between preinstalled fonts and in some cases allow installation of fonts via licensed solutions like flipfont. There are a few apps available that allow you to install additional fonts. But changing the fonts is still device dependent unless you jailbreak the handset. If you want to discuss specific devices or approaches easiest to do it offlist. Andrew On Wednesday, 1 March 2017, Richard Wordingham < richard.wording...@ntlworld.com> wrote: > On Tue, 28 Feb 2017 23:09:05 +0100 > Philippe Verdy <verd...@wanadoo.fr> wrote: > >> ... default stock fonts will be enough if they fit the basic >> need for the language users want to use and will be rarely updated, >> unless they buy a new phone with a newer version of the OS featuring >> better stock fonts. > > I'm not sure that that applies to minority languages. I'm currently > exploring the hypothesis that there is very little in the way of > Northern Khmer on the web in the Thai script because input methods or > rendering prevent or penalise (e.g. by dotted circles) its use. I am > therefore interested in how compatible it is with mobile phones. > Chatting with family and childhood friends is one place where using > one's mother tongue might make good sense. > > Richard. > -- Andrew Cunningham lang.supp...@gmail.com
Re: Possible to add new precomposed characters for local language in Togo?
Thanks Doug, That would be welcome. On Saturday, 5 November 2016, Doug Ewell <d...@ewellic.org> wrote: > I am seeking technical information from a Microsoft team member. > Hopefully we will soon have definitive answers to replace all the > controversy. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > -- Andrew Cunningham lang.supp...@gmail.com
Re: font-encoded hacks
HI Neil, I tend to prefer refering to them as Pseudo-Unicode solutions, rather than hacked fonts or adhoc fonts, and differentiating them from legacy or 8-bit solutions. My preferred approach would to be to treat them as a separate encoding. But I doubt that will likely happen. It doesn't help that a mobile devices I purchase in Australia will ship with a Unicode font installed, but the same device and model, may ship with a non-Unicode font installed in Myanmar and potentially other parts of SE Asia. Andrew On 7 Oct 2016 22:04, "Neil Harris"wrote: > On 07/10/16 07:42, Denis Jacquerye wrote: > >> In may case people resort to these hacks because it is an easier short >> term >> solution. All they have to do is use a specific font. They don't have to >> switch or find and install a keyboard layout and they don't have to >> upgrade >> to an OS that supports their script with Unicode properly. Because of >> these >> sort term solutions it's hard for a switch to Unicode to gain proper >> momentum. Unfortunately, not everybody sees the long term benefit, or >> often >> they see it but cannot do it practically. >> >> Too often Unicode compliant fonts or keyboard layouts have been lacking or >> at least have taken much longer to be implemented. >> One could wonder if a technical group for keyboards layouts would help >> this >> process. >> > > What might also help is a reconceptualization of these hacks as being in > effect non-standard character encodings: the existing software > infrastructure for handling charsets could then be co-opted to convert them > to (and possibly from) Unicode if desired. > > Neil > >
Re: font-encoded hacks
Hi Mark, The converters would be interesting to see, and would be personally useful to me. But the type of keyboard layouts and input frameworks reflected in CLDR have limited bearing on issues related to the uptake of Unicode for Myanmar script. Andrew On 7 Oct 2016 17:54, "Mark Davis ☕️" <m...@macchiato.com> wrote: > We do provide data for keyboard mappings in CLDR (http://unicode.org/cldr/ > charts/latest/keyboards/index.html). There are some further pieces we > need to put into place. > >1. Provide a bulk uploader that applies our sanity-checking tests for >a proposed keyboard mapping, and provides real-time feedback to users about >the problems they need to fix. >2. Provide code that converts from the CLDR format into the major >platforms' formats (we have the reverse direction already). >3. (Optional) Prettier charts! > > > Mark > > On Fri, Oct 7, 2016 at 8:42 AM, Denis Jacquerye <moy...@gmail.com> wrote: > >> In may case people resort to these hacks because it is an easier short >> term solution. All they have to do is use a specific font. They don't have >> to switch or find and install a keyboard layout and they don't have to >> upgrade to an OS that supports their script with Unicode properly. Because >> of these sort term solutions it's hard for a switch to Unicode to gain >> proper momentum. Unfortunately, not everybody sees the long term benefit, >> or often they see it but cannot do it practically. >> >> Too often Unicode compliant fonts or keyboard layouts have been lacking >> or at least have taken much longer to be implemented. >> One could wonder if a technical group for keyboards layouts would help >> this process. >> >> On Fri, Oct 7, 2016, 07:12 Martin J. Dürst <due...@it.aoyama.ac.jp> >> wrote: >> >>> Hello Andrew, >>> >>> On 2016/10/07 11:11, Andrew Cunningham wrote: >>> > Considering the mess that adhoc fonts create. What is the best way >>> forward? >>> >>> That's very clear: Use Unicode. >>> >>> > Zwekabin, Mon, Zawgyi, and Zawgyi-Tai and their ilk? >>> > >>> > Most governemt translations I am seeing in Australia for Burmese are in >>> > Zawgyi, while most of the Sgaw Karen tramslations are routinely in >>> legacy >>> > 8-bit fonts. >>> >>> Why don't you tell the Australian government? >>> >>> Regards, Martin. >>> >> >
Re: font-encoded hacks
Hi Denis, In some ways, it was easier. But looking at each language, the issues seem to be have a slightly different slant. Sgaw Karen is interesting in comparison to Burmese. There is some use of the hacked Zwekabin font by bloggers, but most content, and key media still use 8 bit fonts. Although little use of Unicode. The lack of uptake of Unicode fonts seems to lie in the fact that the default rendering for most Myanmar script fonts is Burmese. If Sgaw Karen, etc are supported it is via locl features. If a Sgaw Karen user is using the font in software when they can't control the necessary opentype features, or don't know they can and need to you will eventually get a perception that their language isn't supported. There are font developers among the Burmese, Mon, Shan ethnic groups developing Unicode fonts tailored for there needs. Burmese situation is quite different. A topic that I have discussed often with Burmese colleagues. I have my theories. But the current resurgence of Zawgyi very much depends on the ability of mobile devices to render Myanmar Unicode, and the choices telcos and handset manufacturers make regarding system fonts. Regarding keyboards, it is interesting comparing Khmer and Burmese. Uptake of Unicode was earlier and quicker for Khmer. When Khmer keyboards were developed, the Khmer developers chose to live with the severe limitations of system level input frameworks. It is only this year that I have started to see truly innovative research into what a Khmer input system should be. Burmese Unicode developers on the other hand were never satisfied with those limitations, and various developers looked into alternatives. Andrew On 7 Oct 2016 17:42, "Denis Jacquerye" <moy...@gmail.com> wrote: > > In may case people resort to these hacks because it is an easier short term solution. All they have to do is use a specific font. They don't have to switch or find and install a keyboard layout and they don't have to upgrade to an OS that supports their script with Unicode properly. Because of these sort term solutions it's hard for a switch to Unicode to gain proper momentum. Unfortunately, not everybody sees the long term benefit, or often they see it but cannot do it practically. > > Too often Unicode compliant fonts or keyboard layouts have been lacking or at least have taken much longer to be implemented. > One could wonder if a technical group for keyboards layouts would help this process. > > > On Fri, Oct 7, 2016, 07:12 Martin J. Dürst <due...@it.aoyama.ac.jp> wrote: >> >> Hello Andrew, >> >> On 2016/10/07 11:11, Andrew Cunningham wrote: >> > Considering the mess that adhoc fonts create. What is the best way forward? >> >> That's very clear: Use Unicode. >> >> > Zwekabin, Mon, Zawgyi, and Zawgyi-Tai and their ilk? >> > >> > Most governemt translations I am seeing in Australia for Burmese are in >> > Zawgyi, while most of the Sgaw Karen tramslations are routinely in legacy >> > 8-bit fonts. >> >> Why don't you tell the Australian government? >> >> Regards, Martin.
Re: font-encoded hacks
On 7 Oct 2016 17:08, "Martin J. Dürst" <due...@it.aoyama.ac.jp> wrote: > > Hello Andrew, > > > On 2016/10/07 11:11, Andrew Cunningham wrote: >> >> Considering the mess that adhoc fonts create. What is the best way forward? > > > That's very clear: Use Unicode. > LOL, thanks Martin. That has been my position for a long time. > >> Zwekabin, Mon, Zawgyi, and Zawgyi-Tai and their ilk? >> >> Most governemt translations I am seeing in Australia for Burmese are in >> Zawgyi, while most of the Sgaw Karen tramslations are routinely in legacy >> 8-bit fonts. > > > Why don't you tell the Australian government? Easier to tell the state governments, than the Federal government. But it is something I am working on. > > Regards, Martin.
font-encoded hacks
Considering the mess that adhoc fonts create. What is the best way forward? Zwekabin, Mon, Zawgyi, and Zawgyi-Tai and their ilk? Most governemt translations I am seeing in Australia for Burmese are in Zawgyi, while most of the Sgaw Karen tramslations are routinely in legacy 8-bit fonts. Andrew On Friday, 7 October 2016, Ken Whistler <kenwhist...@att.net> wrote: > By the way, the biggest ongoing problem we deal with here is the continuing urge to proliferate font-encoded hacks for particular languages and writing systems. The text interchange problems that such schemes pose on an ongoing basis far far outweigh issues like what to do with a Shibuya 109 emoji, imo. -- Andrew Cunningham lang.supp...@gmail.com
Myanmar Scripts and Languages FAQ
Hì, I just finished looking at the Myanmar Scripts and Languages FAQ. A few comments. Most of the questions and answers are specific to the Myanmar (Burmese) language. When discussing the ad hoc fonts, it would be useful to indicate that the ones already mentioned are Burmese specific, and that each of the major languages has its own ad hoc font(s). Mon, Shan and Sgaw Karen & Western Pwo Karen have their own specific fonts. It is also worth warning that most detectors and convertors are language specific. If your data has content in a range of Myanmar script languages, the results from such detectors and converters will be less than ideal. Andrew
RE: Myanmar character set
Hi Andrew, I assume the issue is with mym2 shaper? Andrew C On 13 Aug 2016 5:02 am, "Andrew Glass" <andrew.gl...@microsoft.com> wrote: > > Hi Taylor and Andrew, > > > > This is a known issue with the Myanmar engine on Windows. We are tracking the issue, but don’t have a date for the fix at this time. > > > > Cheers, > > > > Andrew > > > > From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Andrew Cunningham > Sent: Thursday, August 11, 2016 8:51 PM > To: Taylor Canning <taylorcann...@outlook.com> > Cc: Unicode Mailing List <unicode@unicode.org> > > Subject: Re: Myanmar character set > > > > Hi Taylor, > > This should work fine in theory. Are you using a mymr or mym2 style opentype font? > > What applications, operating system and fonts are you using? > > Andrew > > > > On 12 Aug 2016 12:55 pm, "Taylor Canning" <taylorcann...@outlook.com> wrote: >> >> Hi there, has anyone had any issues with the Myanmar character set – i have raised an issue recently where the combination ၣ and ် does not combine correctly to make ၣ် on my windows devices. It used to work just fine. It is am extremely common tonal marker and is a big issue for anyone who types the S’Gaw Karen language, which is a lot of people ! >> >> Thanks, Taylor >> >> >> >> Sent from my Windows 10 phone >> >>
Re: Myanmar character set
Hi Taylor, This should work fine in theory. Are you using a mymr or mym2 style opentype font? What applications, operating system and fonts are you using? Andrew On 12 Aug 2016 12:55 pm, "Taylor Canning"wrote: > Hi there, has anyone had any issues with the Myanmar character set – i > have raised an issue recently where the combination ၣ and ် does not > combine correctly to make ၣ် on my windows devices. It used to work just > fine. It is am extremely common tonal marker and is a big issue for anyone > who types the S’Gaw Karen language, which is a lot of people ! > > Thanks, Taylor > > > > Sent from my Windows 10 phone > > >
Re: Mende Kikakui Number 10
Marcel, it isn't so much that the conversation was exhausted, rather that the original question has been sufficienlty answered. A. On Sunday, 12 June 2016, Marcel Schneider <charupd...@orange.fr> wrote: > On Sat, 11 Jun 2016 12:25:39 +0200, Philippe Verdy wrote: >> >> Exactly, Unicode should not create its own logic about scripts or numeral systems. >> >> All looks like the encoding of 10 as a pair (ONE+combining TENS) was a severe >> conceptual error that could have been avoided by NOT encoding "TENS" as combining >> but as a regular number/digit TEN usable isolately, and forming a contectual >> ligature with a previous digit from TWO to NINE. >> >> The encoding of 10 as (ONE+TENS) is superfluously needing an artificial leading >> ONE. This is purely an Unicode construction, foreign to the logic of the numeral >> system. >> > > > Seeing the discussion exhausted, I join my hope to Philippe Verdyʼs, > and reinforce by quoting Asmus Freytag on backcompat vs enhancement, > before bringing another concern: > > «If you add a feature to match behavior somewhere else, > it rarely pays to make that perform "better", because > it just means it's now different and no longer matches. > The exception is a feature for which you can establish > unambiguously that there is a metric of correctness or > a widely (universally?) shared expectation by users > as to the ideal behavior. In that case, being compatible > with a broken feature (or a random implementation of one) > may in fact be counter productive.» > > http://www.unicode.org/mail-arch/unicode-ml/y2016-m03/0109.html > > Being bound with stability guarantees, Unicode could eventually add a _new_ > > *1E8D7 MENDE KIKAKUI NUMBER TEN > > Best wishes, > > Marcel > > -- Andrew Cunningham lang.supp...@gmail.com
Re: Mende Kikakui Number 10
rlig is the quickest and easiest approach. But in theory could be done other more complicated ways. There are currently no opentype implementations that I know of. And no known shapers. rlig hopefully works with general shapers. But what what ot features will be expected by script specific shaper is still an unknown. On Saturday, 11 June 2016, Michael Everson <ever...@evertype.com> wrote: > On 11 Jun 2016, at 02:47, Andrew Cunningham <lang.supp...@gmail.com> wrote: > >> It can be done via a ligature. It would have to be a required ligature. Since other ligature types may or may not be enabled in various contexts. And we dont want default substitution and mark positioning to generate a non-ligature equivalent. > > Aren’t all of the number combinations required ligatures? > > Michael > -- Andrew Cunningham lang.supp...@gmail.com
Re: Mende Kikakui Number 10
I am not suggesting it needs to be encoded. And I did suggest that using the digit one and the symbol for tens was an option. It can be done via a ligature. It would have to be a required ligature. Since other ligature types may or may not be enabled in various contexts. And we dont want default substitution and mark positioning to generate a non-ligature equivalent. A. An it will be interesting to see which rendering engines handle kikakui. A. On Saturday, 11 June 2016, Ken Whistler <kenwhist...@att.net> wrote: > > On 6/10/2016 5:34 PM, Andrew Cunningham wrote: >> >> There are two few descriptions of the system for me to be definitive but the number ten seems hold a unique position within the numeral system. > > As does the number 10 in every decimal numeral system. ;-) > > But that doesn't automatically require that it be *encoded* with a single character. After all the number 10 in the European decimal numeral system is also represented with a character sequence: <0031, 0030>. > > --Ken > > -- Andrew Cunningham lang.supp...@gmail.com
Re: Mende Kikakui Number 10
On Saturday, 11 June 2016, Ken Whistler <kenwhist...@att.net> wrote: > > I disagree about that. There is no reason to depart from the logic of the system for this one value. Add one ligature glyph to your font for the sequence for 10, and you're done. > > There is the logic of how kikakui numbers are encoded in Unicode and there is the internal logic of the numeral system itself. They are not necessarily the same. There are two few descriptions of the system for me to be definitive but the number ten seems hold a unique position within the numeral system. A. -- Andrew Cunningham lang.supp...@gmail.com
Re: Mende Kikakui Number 10
Hi Phillipe, On Saturday, 11 June 2016, Philippe Verdy <verd...@wanadoo.fr> wrote: > OK, <ONE;combining TEENS> represents 11, but <ONE;combining TENS> is not clearly represents 10, and the proposals do not exhibit 10 with the same glyph as PU (even if it is based on it, in fact the combining TENS is a small subscript glyph variant of letter/syllable PU intended to mark digits). > Mende Kikakui script disolays a high degree of glyph variation. Some variations minor, some variations more substantive. The syllable PU can be found as it is in the charts, it can be found looking like the number 10. Other variations are also observed. The ideal situation would have been to encode the number 10. But in its absence, I guess ONE+TENS may be the approach. Even though it seems less than ideal. A. A. -- Andrew Cunningham lang.supp...@gmail.com
Re: Mende Kikakui Number 10
The original proposals inluded a specific numbr 10 codepoint. I assume it was removed and its representation was to be generated by use of the combining characters In the original proposal there was nothing corresponding to ONE+TENS instead there was a distinct number TEN. The glyph for number 10 was identical to glyph for syllable PU. A. On Friday, 10 June 2016, Philippe Verdy <verd...@wanadoo.fr> wrote: > I do not contest that about number 11, and it was not the question ! > The question was about number **10**: > * ONE+TENS or ONE+TEENS ? > This is NOT specified clearly in TUS Chapter 19 which speaks about numbers 1-9 then 11-19 for TEENS, and TENS for numbers 20-99. > The question is the same about 110,210,...,910: > * (ONE..NINE)+HUNDREDS+ONE+TENS or (ONE..NINE)+HUNDREDS+ONE+TEENS ? > For me it seems that both questions will repy with ONE+TENS, not ONE+TEENS. > > 2016-06-10 9:00 GMT+02:00 Andrew Cunningham <lang.supp...@gmail.com>: >> >> Hi Phillipe, >> >> ONE+TEENS (1E8C7,1E8D0) is definitely the number 11 >> >> A. >> >> On 10 Jun 2016 4:53 pm, "Philippe Verdy" <verd...@wanadoo.fr> wrote: >>> >>> Given that there's no digit for zero, you need to append combining characters to digits 1-9 in order to multiply them by a base 10/100/1,000/10,000/100,000/1,000,000. The system is then additive. I don't know how zero is represented. Note that for base 10, when the first digit is 1 (i.e. for numbers 11-19), the combining character is not 1E8D1 (TENS) but 1E8D0 (TEENS), i.e. the slash-like glyph. But the description says that TEENS is only for numbers 11-19, not for number 10. >>> But I agree that there should be a reference in http://www.unicode.org/charts/PDF/U1E800.pdf, to the description in http://www.unicode.org/versions/Unicode8.0.0/ch19.pdf (section 19.8, pages 722-723) that would explain how to render 10 (add some rows in table 19-6 for the numbers 10/100/.../1,000,000). >>> This leaves a hole in the description. I'm not sure that the glyph for PU is exactly the glyph for 10. Or what is the appropriate sequence: ONE+TENS (1E8C7,1E8D1) or ONE+TEENS (1E8C7,1E8D0) ? The description is ambiguous, and probably both sequences should produce the equivalent glyph. However the letter PU (when meaning number 10) looks more like the glyph produced by ONE+TEN (1E8C7,1E8D1). >>> Then how to represent zero ? Probably by a syllable or word meaning "none" (don't know which it is), or by using European or Arabic digits (as indicated in Chapter 19). >>> >>> >>> 2016-06-10 8:15 GMT+02:00 Andrew Cunningham <lang.supp...@gmail.com>: >>>> >>>> Ok looking at issue again I guess the other alternative is to have a discontiguous set of numbers. Represent 10 as U+1E8C7 U+1E8D1 and map it within the font to the PU glyph. >>>> >>>> And hope that font developers don't create a glyph based on shape of U+1E8C7 and U+1E8D1, but PU instead. >>>> >>>> Andrew >>>> >>>> On Friday, 10 June 2016, Andrew Cunningham <lang.supp...@gmail.com> wrote: >>>> > Hi, >>>> > Currently I am doing some work on the Mende Kikakui script, and I was wondering what the best way was to represent the number 10. >>>> > In the early proposals for the script there was a glyph and codepoint specifically for the number 10. When the model for Mende Kikakui numbers was changed before the finalising of the code block, the number ten was removed. But using existing digits and numbers we can produce 1-9 and 11 -> but we can not produce the number 10 from digits and numbers. >>>> > The number ten uses the same glyph as syllable PU U+1E88E. >>>> > Should I use U+1E88E to represent both the number 10 and the syllable PU? >>>> > Andrew >>>> > >>>> > -- >>>> > Andrew Cunningham >>>> > lang.supp...@gmail.com >>>> > >>>> > >>>> > >>>> >>>> -- >>>> Andrew Cunningham >>>> lang.supp...@gmail.com >>>> >>>> >>>> >>> > > -- Andrew Cunningham lang.supp...@gmail.com
Mende Kikakui Number 10
I'd agree that it is likely ONE+TENS. Looking at the original proposal and articles on the number system it was originally 1-9, 10, 11-19, 20-99 etc But became 1-9, 11-19, 20-99, etc during the deliberations on the model the numbers would follow. A. At least thats how I reconstrct it from the public documrnts I have seen. On Friday, 10 June 2016, Philippe Verdy <verd...@wanadoo.fr> wrote: > I do not contest that about number 11, and it was not the question ! > The question was about number **10**: > * ONE+TENS or ONE+TEENS ? > This is NOT specified clearly in TUS Chapter 19 which speaks about numbers 1-9 then 11-19 for TEENS, and TENS for numbers 20-99. > The question is the same about 110,210,...,910: > * (ONE..NINE)+HUNDREDS+ONE+TENS or (ONE..NINE)+HUNDREDS+ONE+TEENS ? > For me it seems that both questions will repy with ONE+TENS, not ONE+TEENS. > > 2016-06-10 9:00 GMT+02:00 Andrew Cunningham <lang.supp...@gmail.com>: >> >> Hi Phillipe, >> >> ONE+TEENS (1E8C7,1E8D0) is definitely the number 11 >> >> A. >> >> On 10 Jun 2016 4:53 pm, "Philippe Verdy" <verd...@wanadoo.fr> wrote: >>> >>> Given that there's no digit for zero, you need to append combining characters to digits 1-9 in order to multiply them by a base 10/100/1,000/10,000/100,000/1,000,000. The system is then additive. I don't know how zero is represented. Note that for base 10, when the first digit is 1 (i.e. for numbers 11-19), the combining character is not 1E8D1 (TENS) but 1E8D0 (TEENS), i.e. the slash-like glyph. But the description says that TEENS is only for numbers 11-19, not for number 10. >>> But I agree that there should be a reference in http://www.unicode.org/charts/PDF/U1E800.pdf, to the description in http://www.unicode.org/versions/Unicode8.0.0/ch19.pdf (section 19.8, pages 722-723) that would explain how to render 10 (add some rows in table 19-6 for the numbers 10/100/.../1,000,000). >>> This leaves a hole in the description. I'm not sure that the glyph for PU is exactly the glyph for 10. Or what is the appropriate sequence: ONE+TENS (1E8C7,1E8D1) or ONE+TEENS (1E8C7,1E8D0) ? The description is ambiguous, and probably both sequences should produce the equivalent glyph. However the letter PU (when meaning number 10) looks more like the glyph produced by ONE+TEN (1E8C7,1E8D1). >>> Then how to represent zero ? Probably by a syllable or word meaning "none" (don't know which it is), or by using European or Arabic digits (as indicated in Chapter 19). >>> >>> >>> 2016-06-10 8:15 GMT+02:00 Andrew Cunningham <lang.supp...@gmail.com>: >>>> >>>> Ok looking at issue again I guess the other alternative is to have a discontiguous set of numbers. Represent 10 as U+1E8C7 U+1E8D1 and map it within the font to the PU glyph. >>>> >>>> And hope that font developers don't create a glyph based on shape of U+1E8C7 and U+1E8D1, but PU instead. >>>> >>>> Andrew >>>> >>>> On Friday, 10 June 2016, Andrew Cunningham <lang.supp...@gmail.com> wrote: >>>> > Hi, >>>> > Currently I am doing some work on the Mende Kikakui script, and I was wondering what the best way was to represent the number 10. >>>> > In the early proposals for the script there was a glyph and codepoint specifically for the number 10. When the model for Mende Kikakui numbers was changed before the finalising of the code block, the number ten was removed. But using existing digits and numbers we can produce 1-9 and 11 -> but we can not produce the number 10 from digits and numbers. >>>> > The number ten uses the same glyph as syllable PU U+1E88E. >>>> > Should I use U+1E88E to represent both the number 10 and the syllable PU? >>>> > Andrew >>>> > >>>> > -- >>>> > Andrew Cunningham >>>> > lang.supp...@gmail.com >>>> > >>>> > >>>> > >>>> >>>> -- >>>> Andrew Cunningham >>>> lang.supp...@gmail.com >>>> >>>> >>>> >>> > > -- Andrew Cunningham lang.supp...@gmail.com -- Andrew Cunningham lang.supp...@gmail.com
Re: Mende Kikakui Number 10
Hi Phillipe, ONE+TEENS (1E8C7,1E8D0) is definitely the number 11 A. On 10 Jun 2016 4:53 pm, "Philippe Verdy" <verd...@wanadoo.fr> wrote: > Given that there's no digit for zero, you need to append combining > characters to digits 1-9 in order to multiply them by a base > 10/100/1,000/10,000/100,000/1,000,000. The system is then additive. I don't > know how zero is represented. Note that for base 10, when the first digit > is 1 (i.e. for numbers 11-19), the combining character is not 1E8D1 (TENS) > but 1E8D0 (TEENS), i.e. the slash-like glyph. But the description says that > TEENS is only for numbers 11-19, not for number 10. > > But I agree that there should be a reference in > http://www.unicode.org/charts/PDF/U1E800.pdf, to the description in > http://www.unicode.org/versions/Unicode8.0.0/ch19.pdf (section 19.8, > pages 722-723) that would explain how to render 10 (add some rows in table > 19-6 for the numbers 10/100/.../1,000,000). > > This leaves a hole in the description. I'm not sure that the glyph for PU > is exactly the glyph for 10. Or what is the appropriate sequence: > ONE+TENS (1E8C7,1E8D1) or ONE+TEENS (1E8C7,1E8D0) ? The description is > ambiguous, and probably both sequences should produce the equivalent glyph. > However the letter PU (when meaning number 10) looks more like the glyph > produced by ONE+TEN (1E8C7,1E8D1). > > Then how to represent zero ? Probably by a syllable or word meaning "none" > (don't know which it is), or by using European or Arabic digits (as > indicated in Chapter 19). > > > > 2016-06-10 8:15 GMT+02:00 Andrew Cunningham <lang.supp...@gmail.com>: > >> Ok looking at issue again I guess the other alternative is to have a >> discontiguous set of numbers. Represent 10 as U+1E8C7 U+1E8D1 and map it >> within the font to the PU glyph. >> >> And hope that font developers don't create a glyph based on shape of >> U+1E8C7 and U+1E8D1, but PU instead. >> >> Andrew >> >> >> On Friday, 10 June 2016, Andrew Cunningham <lang.supp...@gmail.com> >> wrote: >> > Hi, >> > Currently I am doing some work on the Mende Kikakui script, and I was >> wondering what the best way was to represent the number 10. >> > In the early proposals for the script there was a glyph and codepoint >> specifically for the number 10. When the model for Mende Kikakui numbers >> was changed before the finalising of the code block, the number ten was >> removed. But using existing digits and numbers we can produce 1-9 and 11 -> >> but we can not produce the number 10 from digits and numbers. >> > The number ten uses the same glyph as syllable PU U+1E88E. >> > Should I use U+1E88E to represent both the number 10 and the syllable >> PU? >> > Andrew >> > >> > -- >> > Andrew Cunningham >> > lang.supp...@gmail.com >> > >> > >> > >> >> -- >> Andrew Cunningham >> lang.supp...@gmail.com >> >> >> >> >
Re: Mende Kikakui Number 10
Ok looking at issue again I guess the other alternative is to have a discontiguous set of numbers. Represent 10 as U+1E8C7 U+1E8D1 and map it within the font to the PU glyph. And hope that font developers don't create a glyph based on shape of U+1E8C7 and U+1E8D1, but PU instead. Andrew On Friday, 10 June 2016, Andrew Cunningham <lang.supp...@gmail.com> wrote: > Hi, > Currently I am doing some work on the Mende Kikakui script, and I was wondering what the best way was to represent the number 10. > In the early proposals for the script there was a glyph and codepoint specifically for the number 10. When the model for Mende Kikakui numbers was changed before the finalising of the code block, the number ten was removed. But using existing digits and numbers we can produce 1-9 and 11 -> but we can not produce the number 10 from digits and numbers. > The number ten uses the same glyph as syllable PU U+1E88E. > Should I use U+1E88E to represent both the number 10 and the syllable PU? > Andrew > > -- > Andrew Cunningham > lang.supp...@gmail.com > > > -- Andrew Cunningham lang.supp...@gmail.com
Re: Joined "ti" coded as "O" in PDF
The t_i instance will depend on the quality of the font. If its a standard ligature there should be a glyph to codepoints assignment in the cmap table or the ToUnicode mapping in the PDF file. As David indicates, it isnt a Unicode issue. It is an issue with the font used and/or the tools used. PDFs have always been problematic. That isn't going to change anytime soon. Partly for archiveable or accessible PDFs, the person generating the PDFs should select the best tools for the job and test the PDF. Then fix any problems. Andrew On Sunday, 8 May 2016, David Perry <hospe...@scholarsfonts.net> wrote: > I agree that it's a real-world problem -- PDFs really should be searchable -- but I do not see that it's a Unicode issue. Unicode defines the basic building blocks of LATIN SMALL LETTER T and LATIN SMALL LETTER I; that's its job. Unicode is not responsible for font construction or creating PDF software. Furthermore, even if Unicode did want to do something about it, I can't imagine what that could be -- aside perhaps from using its bully pulpit to urge PDF creators and font creators to do their jobs better. > > The fact that some PDF apps do not search and copy/paste text correctly when unencoded characters are given PUA values has been known for many years. In the case of Calibri, I looked at the font (version installed on my Win7 system) and found that the 'ti' ligature is named t_i, which follows good naming practices, and it does not have a PUA assignment. Given this, any well-constructed PDF app should be able to decode the ligature correctly. > > David > > On 5/6/2016 11:49 AM, Steve Swales wrote: >> >> This discussion seems to have fizzled out, but I’m concerned that >> there’s a real world problem here which is at least partially the >> concern of the consortium, so let me stir the pot and see if there’s >> still any meat left. >> >> On the current release of MacOS (including the developer beta, for >> your reference, Peter), if you use Calibri font, for example, in any >> app (e.g. notes), to write words with “ti” (like >> internationalization), then press “Print" and “Open PDF in Preview”, >> you get a PDF document with the joined “ti”. Subsequently cutting and >> pasting produces mojibake, and searching the document for words >> with“ti” doesn’t work, as previously noted. >> >> I suppose we can look on this as purely a font handling/MacOS bug, but >> I’m wondering if we should be providing accommodations or conveniences >> in Unicode for it to work as desired. >> >> -steve >> > -- Andrew Cunningham lang.supp...@gmail.com
Re: Joined "ti" coded as "O" in PDF
My understand ing is searchability comes down to twho factors: 1) the ToUnicode mapping ...I which mapps glyphs in the font or subsetted font to Unicode codepoints. Mappings take the form of one glyph to one codepoint or one glyph to two or more codepoints. Obviously any glyph that doesnt resolve by default to a codepoint isn't in the mapping , nor does the mapping handle glyphs that have been visually reordered during rendering. 2) the next step is to tag the PDF then use the ActualText label of each tag. So for some languages with the right fonts step one is all that is needed. And this is fairly standard in pdf generation tools. The font itself can impact on this of course. But for other languages you need to go to the second step. Woth languages I work with I might have some pdfs tat just require the visible text layer.others will have a visible text layer. For the pdf to be eearchable, the search tools not only need to be able to handle the text layer but also actualtext attributes when necessary. And that all comes down to decisions the tool developer has taken on how to handle searching when both visible text layers and ActualText labels are present. I have been told in accessibility lists that the PDF specs leave that implementation detail to the developer based on their requirements. So in some cases you need to go the extra step and ActualText. But you also need to evaluate your search tools to ensure they fo what you expect. Andrew On Saturday, 7 May 2016, Steve Swales <st...@swales.us> wrote: > This discussion seems to have fizzled out, but I’m concerned that there’s a real world problem here which is at least partially the concern of the consortium, so let me stir the pot and see if there’s still any meat left. > On the current release of MacOS (including the developer beta, for your reference, Peter), if you use Calibri font, for example, in any app (e.g. notes), to write words with “ti” (like internationalization), then press “Print" and “Open PDF in Preview”, you get a PDF document with the joined “ti”. Subsequently cutting and pasting produces mojibake, and searching the document for words with“ti” doesn’t work, as previously noted. > I suppose we can look on this as purely a font handling/MacOS bug, but I’m wondering if we should be providing accommodations or conveniences in Unicode for it to work as desired. > -steve > > > On Mar 21, 2016, at 1:40 AM, Philippe Verdy <verd...@wanadoo.fr> wrote: > Are those PDF supposed to be searchable inside of them ? For archival purpose, the PDF are stored in their final form, and search is performed by creating a database of descriptive metadata. Each time one wants formal details, they have to read the original the way it was presented (many PDFs are jsut scanned facsimiles of old documents which originately were not even in numeric plain-text, they were printed or typewritten, frequently they include graphics, handwritten signatures, stamped seals...) > Being able to search plain-text inside a PDF is not the main objective (and not the priority). The archival however is a top priority (and there's no money to finance a numerisation and no human resource available to redo this old work, if needed other contributors will recreate a plain-text version, possibly with rich-text features, e.g. in Wikisource for old documents that fall in the public domain). > PDF/A-1a is meant only for creating new documents from a original plain-text or rich-text document created with modern word-processing applications. But this specification will frequently have to be broken, if there's the need to include handwritten or supplementary elements (signatures, seals...) whose source is not the original electronic document but the printed paper over which the annotations were made: it is this paper document, not the electronic document which is the official final source (we've got some important legal paper whose original has other marks including traces of beer or coffee, or partly burnt, the paper itself has several alterations, but it is the original "as is", and for legal purpose the only acceptable archival form as a PDF must ignore all the PDF/A-1a constraints, not meant to represent originals accurately). > 2016-03-20 20:52 GMT+01:00 Tom Gewecke <t...@bluesky.org>: >> >> > On Mar 20, 2016, at 12:24 PM, Asmus Freytag (t) < asmus-...@ix.netcom.com> wrote: >> > >> > Usually, the archive feature pertains only to the fact that you can reproduce the final form, not to being able to get at the correct source (plain text backbone) for the document. >> >> My understanding is that PDF/A-1a is supposed to be searchable. >> >> >> > > > -- Andrew Cunningham lang.supp...@gmail.com
Re: Non-standard 8-bit fonts still in use
Don, Most African communities I work with within diaspora are using Unicode. Although 8 bit legacy content is still in use. Probably the most use I see of legacy encodings is among the Karen languages. Sgaw Karen uses seem to still be using 8-bit fonts. There is a psuedo-Unicode solution but 8-bit fonts dominate still. The problem for Karen is that the default rendering for Unicode fonts isn't suitable. And locl support in applications has been lagging. The ideal Unicode font for Myanmar script would have somewhere between 8-10 language systems. Cross platform support is lacking. Currently best approach is a separate font for each language system. Andrew On Friday, 16 October 2015, Don Osborn <d...@bisharat.net> wrote: > I was surprised to learn of continued reference to and presumably use of 8-bit fonts modified two decades ago for the extended Latin alphabets of Malian languages, and wondered if anyone has similar observations in other countries. Or if there have been any recent studies of adoption of Unicode fonts in the place of local 8-bit fonts for extended Latin (or non-Latin) in local language computing. > > At various times in the past I have encountered the idea that local languages with extended alphabets in Africa require special fonts (that region being my main geographic area of experience with multilingual computing), but assumed that this notion was fading away. > > See my recent blog post for a quick and by no means complete discussion about this topic, which of course has to do with more than just the fonts themselves: http://niamey.blogspot.com/2015/10/the-secret-life-of-bambara-arial.html > > TIA for any feedback. > > Don Osborn > > > -- Andrew Cunningham lang.supp...@gmail.com
Joined "ti" coded as "Ɵ" in PDF
Janusz, It is all smoke and mirrors. For English you have to choose the roght font. Simple, no advanced features disable advanced typographic features in application if you can. Ensure the cmap table in the font is sufficiently comprehensive The issues Don raise still exist in PDF/A. You would need to make fundamental changes to the PDF spec for it to work for any language. For other languages, esp those in complex scripts the situation is more dire ... esp when glyphs have been reordered. The accepted work around is ActualText. But you don't necessarily need ActualText. Depends on font and language. But the rub is that it is left to implementors to decide if and when the ActualText is used. All aspects of the document ecosystem needs to be looked at. Which tools can use ActualText instead of the visible text layer. The PDF/UA spec is probably closer to the mark than the PDF/A spec. But since most archives have no control over pdf production, authors' or publishers' font selection, tools used, etc, then working with PDFs can be fairly hit and miss. For languages written in complex scripts, its usially a miss rather than a miss. I rarely see ActualText in PDF files ,even in those that need it. Andrew On Sunday, 20 March 2016, Janusz S. Bien <jsb...@mimuw.edu.pl> wrote: > Quote/Cytat - Andrew Cunningham <lang.supp...@gmail.com> (Sun 20 Mar 2016 12:06:29 AM CET): > >> Hi Don, >> >> Latin is fine if you keep to simple well made fonts and avoid using more >> sophisticated typographic features available in some fonts. >> >> Dumb it down typographically and it works fine. PDF, despite all the >> current rhetoric coming from PDF software developers, is a preprint format. >> Not an archival format. > > What about PDF/A, ISO 19005-1:2005 Document Management – Electronic document file format for long term preservation? > > Best regards > > Janusz > > -- > Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) > Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department) > jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ > > -- Andrew Cunningham lang.supp...@gmail.com
Re: Joined "ti" coded as "Ɵ" in PDF
Hi Don, Latin is fine if you keep to simple well made fonts and avoid using more sophisticated typographic features available in some fonts. Dumb it down typographically and it works fine. PDF, despite all the current rhetoric coming from PDF software developers, is a preprint format. Not an archival format. The PDF format is less than ideal. But it is widely used, often in a way the format was never really created for. There are alternatives that preserve the text. But they have never really taken off (compared to PDF)for various reasons. Andrew On Sunday, 20 March 2016, Don Osborn <d...@bisharat.net> wrote: > Thanks Andrew, Looking at the issue of ToUnicode mapping you mention, why in the 1-many mapping of ligatures (for fonts that have them) do the "many" not simply consist of the characters ligated? Maybe that's too simple (my understanding of the process is clearly inadequate). > > The "string of random ASCII characters" (per Leonardo) used in the Identity H system for hanzi raise other questions: (1) How are the ASCII characters interpreted as a 1-many sequence representing a hanzi rather than just a series of 1-1 mappings of themselves? (2) Why not just use the Unicode code point? > > The details may or may not be relevant to the list topic, but as a user of documents in PDF format, I fail to see the benefit of such obscure mappings. And as a creator of PDFs ("save as") looking at others' PDFs I've just encountered with these mappings, I'm wondering how these concerned about how the font & mapping results turned out as they did. It is certain that the creators of the documents didn't intend results that would not be searchable by normal text, but it seems possible their a particular font choice with these ligatures unwittingly produced these results. If the latter, the software at the very least should show a caveat about such mappings when generating PDFs. > > Maybe it's unrealistic to expect a simple implication of Unicode in PDFs (a topic we've discussed before but which I admit not fully grasping). Recalling I once had some wild results copy/pasting from an N'Ko PDF, and ended up having to obtain the .docx original to obtain text for insertion in a blog posting. But while it's not unsurprising to encounter issues with complex non-Latin scripts from PDFs, I'd gotten to expect predictability when dealing with most Latin text. > > Don > > > > On 3/17/2016 7:34 PM, Andrew Cunningham wrote: > > There are a few things going on. > > In the first instance, it may be the font itself that is the source of the problem. > > My understanding is that PDF files contain a sequence of glyphs. A PDF file will contain a ToUnicode mapping between glyphs and codepoints. This iseither a 1-1 mapping or a 1-many mapping. The 1-many mapping provides support for ligatures and variation sequences. > > I assume it uses the data in the font's cmap table. If the ligature isn't mapped then you will have problems. I guess the problem could be either the font or the font subsetting and embedding performed when the PDF is generated. > > Although, it is worth noting that in opentype fonts not all glyphs will have mappings in the cmap file. > > The remedy, is to extensively tag the PDF and add ActualText attributes to the tags. > > But the PDF specs leave it up to the developer to decide what happens in there is both a visible text layer and ActualText. So even in an ideal PDF, tesults will vary from software to software when copying text or searching a PDF. > > At least thatsmy current understanding. > > Andrew > > On 18 Mar 2016 7:47 am, "Don Osborn" <d...@bisharat.net> wrote: >> >> Thanks all for the feedback. >> >> Doug, It may well be my clipboard (running Windows 7 on this particular laptop). Get same results pasting into Word and EmEditor. >> >> So, when I did a web search on "internaƟonal," as previously mentioned, and come up with a lot of results (mostly PDFs), were those also a consequence of many not fully Unicode compliant conversions by others? >> >> A web search on what you came up with - "InternaƟonal" - yielded many more (82k+) results, again mostly PDFs, with terms like "interna onal" (such as what Steve noted) and "interna<onal" and perhaps others (given the nature of, or how Google interprets, the private use character?). >> >> Searching within the PDF document already mentioned, "international" comes up with nothing (which is a major fail as far as usability). Searching the PDF in a Firefox browser window, only "internaƟonal" finds the occurrences of what displays as "international." However after downloading the document and searching it in Acrobat, only a search for "internaƟonal" will find
Re: Joined "ti" coded as "Ɵ" in PDF
There are a few things going on. In the first instance, it may be the font itself that is the source of the problem. My understanding is that PDF files contain a sequence of glyphs. A PDF file will contain a ToUnicode mapping between glyphs and codepoints. This iseither a 1-1 mapping or a 1-many mapping. The 1-many mapping provides support for ligatures and variation sequences. I assume it uses the data in the font's cmap table. If the ligature isn't mapped then you will have problems. I guess the problem could be either the font or the font subsetting and embedding performed when the PDF is generated. Although, it is worth noting that in opentype fonts not all glyphs will have mappings in the cmap file. The remedy, is to extensively tag the PDF and add ActualText attributes to the tags. But the PDF specs leave it up to the developer to decide what happens in there is both a visible text layer and ActualText. So even in an ideal PDF, tesults will vary from software to software when copying text or searching a PDF. At least thatsmy current understanding. Andrew On 18 Mar 2016 7:47 am, "Don Osborn"wrote: > Thanks all for the feedback. > > Doug, It may well be my clipboard (running Windows 7 on this particular > laptop). Get same results pasting into Word and EmEditor. > > So, when I did a web search on "internaƟonal," as previously mentioned, > and come up with a lot of results (mostly PDFs), were those also a > consequence of many not fully Unicode compliant conversions by others? > > A web search on what you came up with - "InternaƟonal" - yielded many > more (82k+) results, again mostly PDFs, with terms like "interna onal" > (such as what Steve noted) and "interna nature of, or how Google interprets, the private use character?). > > Searching within the PDF document already mentioned, "international" comes > up with nothing (which is a major fail as far as usability). Searching the > PDF in a Firefox browser window, only "internaƟonal" finds the occurrences > of what displays as "international." However after downloading the document > and searching it in Acrobat, only a search for "internaƟonal" will find > what displays as "international." > > A separate web search on "Eīects" came up with 300+ results, including > some GoogleBooks which in the texts display "effects" (as far as I > checked). So this is not limited to Adobe? > > Jörg, With regard to "Identity H," a quick search gives the impression > that this encoding has had a fairly wide and not so happy impact, even if > on the surface level it may have facilitated display in a particular style > of font in ways that no one complains about. > > Altogether a mess, from my limited encounter with it. There must have been > a good reason for or saving grace of this solution? > > Don > > On 3/17/2016 2:17 PM, Steve Swales wrote: > >> Yes, it seems like your mileage varies with the PDF >> viewer/interpreter/converter. Text copied from Preview on the Mac replaces >> the ti ligature with a space. Certainly not a Unicode problem, per se, but >> an interesting problem nevertheless. >> >> -steve >> >> On Mar 17, 2016, at 11:11 AM, Doug Ewell wrote: >>> >>> Don Osborn wrote: >>> >>> Odd result when copy/pasting text from a PDF: For some reason "ti" in the (English) text of the document at http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf is coded as "Ɵ". Looking more closely at the original text, it does appear that the glyph is a "ti" ligature (which afaik is not coded as such in Unicode). >>> When I copy and paste the PDF text in question into BabelPad, I get: >>> >>> InternaƟonal Order and the DistribuƟon of IdenƟty in 1950 (By invitaƟon only) >>> The "ti" ligatures are implemented as U+10019F, a Plane 16 private-use >>> character. >>> >>> Truncating this character to 16 bits, which is a Bad Thing™, yields >>> U+019F LATIN CAPITAL LETTER O WITH MIDDLE TILDE. So it looks like either >>> Don's clipboard or the editor he pasted it into is not fully >>> Unicode-compliant. >>> >>> Don's point about using alternative characters to implement ligatures, >>> thereby messing up web searches, remains valid. >>> >>> -- >>> Doug Ewell | http://ewellic.org | Thornton, CO >>> >>> >>> >> >
Re: Windows keyboard restrictions
On Saturday, 8 August 2015, Richard Wordingham richard.wording...@ntlworld.com wrote: Michael did do a series of blog posts on building TSF based input methods years ago. Something I tinkered with off and on. What we're waiting for is a guide we can follow, or some code we can ape. Such should be, or should have been, available in a Tavultesoft Keyman rip-off. I don't believe in rip-offs esp when there a free versions and the enhanced version doesn't cost much. But that said there is KMFL on linux which handles a subset of the keyman definition files. And Keith Striebly, before he died, did a port of the kmfl lib to windows. But I doubt anyone is maintaining it. But reality is that the use cases discussed in this and related threads do not need fairly complex or sophisticated layouts. So kmfl and derivates should be fine respite how limited I consider them. Alternative there are a range of input frameworks developed in se asia that would be easy to work with as well. Alternative input frameworks have been around for years. Its up to use them or not use them. I don't see much point bleating about the limitations of the win32 keyboard model. Just use amlternative input framework .. wether it is TSF table based input, keyman , kmfl port to windows or any of a large slather of input frameworks that are available out there. Andrew -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunning...@slv.vic.gov.au lang.supp...@gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/
Re: Unicode of Death
Geez Philippe, It was tounge in cheek. A. On Saturday, 30 May 2015, Philippe Verdy verd...@wanadoo.fr wrote: 2015-05-28 23:36 GMT+02:00 Andrew Cunningham lang.supp...@gmail.com: Not the first time unicode crashes things. There was the google chrome bug on osx that crashed the tab for any syriac text. Unicode crashes things? Unicode has nothing to do in those crashes caused by bugs in applications that make incorrect assumptions (in fact not even related to characters themselves but to the supposed behavior of the layout engine. Programmers and designers for example VERY frequently forget the constraints for RTL languages and make incorrect assumptions about left and right sides when sizing objects, or they don't expect that the cursor will advance backward and forget that some measurements can be negative: if they use this negative value to compute the size of a bitmap redering surface, they'll get out of memory, unchecked null pointers returned, then they will crash assuming the buffer was effectively allocated. These are the same kind of bugs as with the too common buffer overruns with unchecked assumtions: the code is kept because it works as is in their limited immediate tests. Producing full coverage tests is a difficult and lengthy task, that programmers not always have the time to do, when they are urged to produce a workable solution for some clients and then given no time to improve the code before the same code is distributed to a wider range of clients. Commercial staffs do that frequently, they can't even read the technical limitations even when they are documented by programmers... in addition the commercial staff like selling softwares that will cause customers to ask for support... that will be billed ! After that, programmers are overwhelmed by bug reports and support requests, and have even less time to design other thigs that they are working on and still have to produce. QA tools may help programmers in this case by providing statistics about the effective costs of producing new software with better quality, and the cost of supporting it when it contains too many bugs: commercial teams like those statistics because they can convert them to costs, commercial margins, and billing rates. (When such QA tools are not used, programmers will rapidly leave the place, they are fed up by the growing pressure to do always more in the same time, with also a growing number of urgent support requests.). Those that say Unicode crashes things do the same thing: they make broad unchecked assumptions about how things are really made or how things are actually working. -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunning...@slv.vic.gov.au lang.supp...@gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/
Re: Unicode of Death
Not the first time unicode crashes things. There was the google chrome bug on osx that crashed the tab for any syriac text. A. On Friday, 29 May 2015, Bill Poser billpos...@gmail.com wrote: No doubt the evil Unicode Consortium is in league with the Trilateral Commission, the Elders of Zion,and the folks at NASA who faked the moon landing :) On Thu, May 28, 2015 at 7:53 AM, Doug Ewell d...@ewellic.org wrote: Unicode is in the news today as some folks with waaay too much time on their hands have discovered a string consisting of Latin, Arabic, Devanagari, and CJK characters that crashes Apple devices when it appears as a pop-up message. Although most people seem to identify it correctly as a CoreText bug, there are a handful, as you might expect, who attribute it to some shady weirdness in Unicode itself. My favorite quote from a Reddit user was this: Every character you use has a unicode value which tells your phone what to display. One of the unicode values is actually never-ending and so when the phone tries to read it it goes into an infinite loop which crashes it. I've read TUS Chapter 4 and UTR #23 and I still can't find the never-ending Unicode property. Perhaps astonishingly to some, the string displays fine on all my Windows devices. Not all apps get the directionality right, but no crashes. -- Doug Ewell | http://ewellic.org | Thornton, CO -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunning...@slv.vic.gov.au lang.supp...@gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/
Re: Combined Yorùbá characters with dot below and tonal diacritics
On 12/04/2015 7:27 PM, Ilya Zakharevich nospam-ab...@ilyaz.org wrote: On Sun, Apr 12, 2015 at 07:07:01AM +0200, Philippe Verdy wrote: MSKLC does not provide a way to build another geometry and map geometric keys to vkeys (or the revers). Again, this has nothing to do with MSKLC. If you are compiling a keyboard driver from source, then it has nothing to do with MSKLC. But for a general answer, for the average user who needs to develop a keyboard, then MSKLC is very pertinent. Note also that (since always), MSKLC generated drivers have never allowed us to change the mapping of scancodes (from hardware keyboards) to virtual keys, aka vkeys, or to WM_SYSKEY (this is hardwired in a lower internal level). Wrong. Look for any French or German keyboard. Microsoft has a tendency never to change a keyboard or how it operates, there is a lot of bad design decisions and cruft that is still there. Just because something can be done, doesn't mean it should be done. These drivers only map sequences of one or more vkeys (and a few supported states, it's not possible to add keyboard states other than CTRL, SHIFT, CAPSLOCK, ALTGR2, and custom states for dead keys) How do you think I do it in my layout? There are Microft keyboard layouts that use other states, the Canadian multilingual keyboard comes to mind, mainly to comply with a canadian standard. But microsoft themselves recommend remaining to the four keyboard states Phillipe lists. to only one WM_CHAR. I have no idea why you would mix in WM_* stuff into this discussion… Depending on your perspective it is pertinent or not. And it's not possible to change the mapping of vkeys to WM_SYSCHAR (this is also hardwired at a lower level). I have no clue what you are talking about now… Andrew
Re: Combined Yorùbá characters with dot below and tonal diacritics
Hi Ilya, The problem with approach documented below is two fold: 1) the characters required do not all exist as precomposed characters thus microsoft's dead key sequences will not work for yoruba. 2) certaon alt-gr sequences are not quaranteed to work in all programs. Some programs treat the Alt-Gr sequence as the equivalent to the Alt key sequence. With program shortcuts overriding keyboard input. From memory this was a problem we would have with MS Word. Care needs to be taken selecting AltGr sequences to implement in keyboard. And adding frequently typed characters like vowels and tone marks to altgr is usually a bad idea. Easier to move less needed sequences to the altgr state putting feequently type characters on the normal and shift states Andrew On Sunday, 12 April 2015, Ilya Zakharevich nospam-ab...@ilyaz.org wrote: On Sat, Apr 11, 2015 at 01:19:23AM +0100, Luis de la Orden wrote: Thanks for challenging my understanding of dead keys. I have a layout in my Mac that works like a charm to write Yorùbá, Portuguese and Spanish with the UK layout. I am having trouble with the Windows layout and should have mentioned that more clearly. Nevertheless, I was using Microsoft Keyboard Layout Creator and assumed that the limitations of the software (or the limitations of my knowledge of the software) were the limitations of the technology as a whole. I see no problem with using MSKLC with Yorùbá. Just make AltGr-e, AltGr-o, AltGr-s produce e̩,o̩, ands̩. Then make AltGr--, AltGr-' and AltGr-` into prefix keys (deadkeys) converting characters into accented forms. IIRC, this would work fine also with “base keys” producing Unicode clusters (like those above) (check in the document below). For details, see the corresponding sections of http://search.cpan.org/~ilyaz/UI-KeyboardLayout/lib/UI/KeyboardLayout.pm [I do not think the “standard” keyboard input on Windows is documented anywhere else :-( ]. Hope this helps, Ilya -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunning...@slv.vic.gov.au lang.supp...@gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/
Re: Avoidance variants
Or is it a markup issue rather than something for plain text? On 26 March 2015 at 13:30, Mark E. Shoulson m...@kli.org wrote: So, not much in the way of discussion regarding the TETRAGRAMMATON issue I raised the other week. OK; someone'll eventually get to it I guess. Another thing I was thinking about, while toying with Hebrew fonts. Often, letters are substituted in _nomina sacra_ in order to avoid writing a holy name, much as the various symbols for the tetragrammaton are used. And indeed, sometimes they're used in that name too, as I mentioned, usages like ידוד or ידוה and so on. There's an example in the paper that shows אלדים instead of אלהים. Much more common today would be אלקים and in fact people frequently even pronounce it that way (when it refers to big-G God, in non-sacred contexts. But for little-g gods, the same word is pronounced without the avoidance, because it isn't holy. It's weird.) I wonder if it makes sense maybe to encode not a codepoint, but a variant sequence(s) to represent this sort of defaced or altered letter HEH. It's still a HEH, it just looks like another letter, right? (QOF or DALET or occasionally HET) That would keep some consistency to the spelling. On the other hand, the spelling with a QOF is already well entrenched in texts all over the internet. But maybe it isn't right. And what about the use of ה׳ or ד׳ for the tetragrammaton? Are they both HEHs, one altered, or is one really a DALET? Any thoughts? (and seriously, what to do about all those tetragrammaton symbols?) ~mark ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunning...@slv.vic.gov.au lang.supp...@gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Android 5.1 ships with support for several minority scripts
Comment on Cham was informational. What is in unicode charts was based on Eastern Cham only. Proposals to add Cham and Arabic characters to needed to support Western Cham are underdevelopment. Testing on Thai Tham will occur ... I was curious as to what the original design parameters forvthe font was. It is easier to evaluate a fonts language support knowing what was originally indended. For instance I do not assume that the myanmar font was designed to support all languages that use the myanmar script. I can also make assumptions about Latin script coverage and languages that are supported/unsupported. Andrew On Sunday, 15 March 2015, Roozbeh Pournader rooz...@unicode.org wrote: Andrew, I don't know the answer to your questions unfortunately. You can investigate the fonts yourself (they are available at https://code.google.com/p/noto/), or ask for support for Western Cham (assuming it's already properly encoded at Unicode) at the Noto issue tracker at https://code.google.com/p/noto/issues/entry. On Fri, Mar 13, 2015 at 8:27 PM, Andrew Cunningham lang.supp...@gmail.com wrote: Hi Roozbeh, a point of clarification and a question: * the Cham font is actually an Eastern Cham font supporting Akhar Thrah the Eastern variety of the script. Akhar Srak . Western Cham script remains unsupported. Which languages was the Thai Tham font designed to support? And which variety of the script? Andrew On Saturday, 14 March 2015, Roozbeh Pournader rooz...@unicode.org wrote: Android 5.1, released earlier this week, has added support for 25 minority scripts. The wide coverage can be reproduced by almost everybody for free, thanks to the Noto and HarfBuzz projects, both of which are open source. (Android itself is open source too.) By my count, these are the new scripts added in Android 5.1: Balinese, Batak, Buginese, Buhid, Cham, Coptic, Glagolitic, Hanunnoo, Javanese, Kayah Li, Lepcha, Limbu, Meetei Mayek, Ol Chiki, Oriya, Rejang, Saurashtra, Sundanese, Syloti Nagri, Tagbanwa, Tai Le, Tai Tham, Tai Viet, Thaana, and Tifinagh. (Android 5.0, released last year, had already added the Georgian lari, complete Unicode 7.0 coverage for Latin, Greek, and Cyrillic, and seven new scripts: Braille, Canadian Aboriginal Syllabics, Cherokee, Gujarati, Gurmukhi, Sinhala, and Yi.) Note that different Android vendors and carriers may choose to ship more fonts or less, but Android One phones and most Nexus devices will support all the above scripts out of the box. None of this would have been possible without the efforts of Unicode volunteers who worked hard to encode the scripts in Unicode. Thanks to the efforts of Unicode, Noto, and HarfBuzz, thousands of communities around the world would can now read and write their language on smartphones and tablets for the first time. -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunning...@slv.vic.gov.au lang.supp...@gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/ -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunning...@slv.vic.gov.au lang.supp...@gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Android 5.1 ships with support for several minority scripts
Hi Roozbeh, a point of clarification and a question: * the Cham font is actually an Eastern Cham font supporting Akhar Thrah the Eastern variety of the script. Akhar Srak . Western Cham script remains unsupported. Which languages was the Thai Tham font designed to support? And which variety of the script? Andrew On Saturday, 14 March 2015, Roozbeh Pournader rooz...@unicode.org wrote: Android 5.1, released earlier this week, has added support for 25 minority scripts. The wide coverage can be reproduced by almost everybody for free, thanks to the Noto and HarfBuzz projects, both of which are open source. (Android itself is open source too.) By my count, these are the new scripts added in Android 5.1: Balinese, Batak, Buginese, Buhid, Cham, Coptic, Glagolitic, Hanunnoo, Javanese, Kayah Li, Lepcha, Limbu, Meetei Mayek, Ol Chiki, Oriya, Rejang, Saurashtra, Sundanese, Syloti Nagri, Tagbanwa, Tai Le, Tai Tham, Tai Viet, Thaana, and Tifinagh. (Android 5.0, released last year, had already added the Georgian lari, complete Unicode 7.0 coverage for Latin, Greek, and Cyrillic, and seven new scripts: Braille, Canadian Aboriginal Syllabics, Cherokee, Gujarati, Gurmukhi, Sinhala, and Yi.) Note that different Android vendors and carriers may choose to ship more fonts or less, but Android One phones and most Nexus devices will support all the above scripts out of the box. None of this would have been possible without the efforts of Unicode volunteers who worked hard to encode the scripts in Unicode. Thanks to the efforts of Unicode, Noto, and HarfBuzz, thousands of communities around the world would can now read and write their language on smartphones and tablets for the first time. -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunning...@slv.vic.gov.au lang.supp...@gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Western Cham in Akhar Jawi
Thanks Roozbeh, I will most likely write a proposal, at the moment I am still mapping character usage to see if other unencoded characters pop up. Also doing the same for the western cham script, some of the more recent reforms (within past 10 years) in Cambodia don't appear to be encoded. Andrew On 28 October 2014 02:26, Roozbeh Pournader rooz...@unicode.org wrote: This is the first time I'm seeing the character. I suggest writing a Unicode proposal. On Oct 26, 2014 10:42 PM, Andrew Cunningham lang.supp...@gmail.com wrote: Hi all, When Western Cham is written in the Arabic script, there is regional variation in the Arabic characters used. Two varieties I am looking at use a character that i can't see in the Unicode charts, although I may have missed it. The character is a alef with three dots above (with the dots pointing upwards), see the attached images. has anyone come across this character used in other contexts? Andrew -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunning...@slv.vic.gov.au lang.supp...@gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunning...@slv.vic.gov.au lang.supp...@gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Western Cham in Akhar Jawi
Hi all, When Western Cham is written in the Arabic script, there is regional variation in the Arabic characters used. Two varieties I am looking at use a character that i can't see in the Unicode charts, although I may have missed it. The character is a alef with three dots above (with the dots pointing upwards), see the attached images. has anyone come across this character used in other contexts? Andrew -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunning...@slv.vic.gov.au lang.supp...@gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Current support for N'Ko
On 29/09/2014 11:02 PM, Frédéric Grosshans frederic.grossh...@gmail.com wrote: Le 27/09/2014 01:10, Andrew Cunningham a écrit : * NEVER try to copy and paste text from PDF. It is a preprint format and should be treated as such. Well... Having access to the raw text is often useful (for example, to allow blinds to have acces to the content of pdf documents, or to search a word in a scanned historical document), and cut and pasting text from PDF often works, even if the “rich text” formating is lost. The problem is that often the actual text isnt necessarily ths same as the original text used to generate the pdf. Results will vary according to fonts used and tools used to generate the pdf. Even adobe acrobat contains different tools which can give vastly different results. It is best to think of PDF as dealing with glyphs rather than characters. I tend to mainly work with complex scripts, and the results with those is usually not encouraging. I know there is ActualText, but honestly I dont actually ever remember seeing a complex script PDF I could copy and paste from without post-processing of the text. The average person creating PDF files has no knowledge of how to achieve optimal results. Nko is one of the easier scripts to deal with thankfully. In the case of the Ebola FAQs ( https://sites.google.com/site/athinkra/ebola-faqs) discussed here, it almost worked perfectly on my computer (Ubuntu Linux 14.04) for N’Ko (diacritics are shifted by one character) and Vai. Of course, the Adlam was not working (somehow converted to Arabic), bus it was expected, since Adlam is not (yet?) in Unicode. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Current support for N'Ko
On 30/09/2014 4:11 AM, David Starner prosfil...@gmail.com wrote: On Fri, Sep 26, 2014 at 4:10 PM, Andrew Cunningham lang.supp...@gmail.com wrote: * NEVER try to copy and paste text from PDF. It is a preprint format and should be treated as such. I'd try and cut and paste from print if I could. People are going to cut and paste from anything if it saves them a little time. If you disable cut and pasting from PDF, those who have easy access to OCR may just print to image and OCR it to cut and paste. To say don't do this is unproductive. Ok what I should say is that in best case scenario for complex script text you can copy and paste nd then do post processing on extracted text to get the actual text. Post processing may involve reordering characters, or systematic conversions of glyph sequences. In worse case scenario you get utter garbage you can not reconstruct pdf files from. Searching and indexing is even more problematic. Honestly, for languages I work with it would be quicker and more accurate in many csses to use OCR (even at 80% accuracy) that cut and paste from PDF. As I said in previous email results and effectiveness will differ depending on fonts used and PDF generator used. PDF was designed for preprint, not archival purposes. -- Kie ekzistas vivo, ekzistas espero. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Current support for N'Ko
Hi Don, I will give a detailed reply offline to you and Charles. I am slowly working on notes on web deployment of various languages in my spare time. Been held up unpicking Myanmar script and possible errata/additions to UTN11 But N'ko is on my list of scripts to document. I will need to look at your pages and unpick them. But couple of reflections. Your blog post is dealing with multiple issues. * bidi support in html5 and css3, and to what extent are scipts like N'ko taken into account. * What rendring system is being usd by browser * what font is being used: opentype,graphite, aat .. this will affect rendring in browsers. For opentype which script is being used which will affect which opentype features will be processed. so getting the font stackright is important. and the font stack will differ from browser to browser. I need to check for existance of a cross platform N'ko font. * NEVER try to copy and paste text from PDF. It is a preprint format and should be treated as such. Andrew On 27/09/2014 12:45 AM, d...@bisharat.net wrote: Some observations concerning N'Ko support in browsers may be of interest: http://niamey.blogspot.com/2014/09/nko-on-web-review-of-experience-with.html This is pursuant to reposting a translation in N'Ko of a World Heath Organization FAQ on ebola. That translation was one of several facilitated by Athinkra LLC, and available at https://sites.google.com/site/athinkra/ebola-faqs Don Osborn ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Editing Sinhala and Similar Scripts
LOL, that's why, if the input framework allows it, its easier to support both approachable to backspace or at least an option to choose one or the other. ; ) Andrew On 19/03/2014 11:37 PM, Doug Ewell d...@ewellic.org wrote: Richard Wordingham richard dot wordingham at ntlworld dot com wrote: Typing is a nightmare. When you backspace it destroys multiple keystrokes. I suspect this is a widespread and unsolved problem. There are two types of people: 1. those who fully expect Backspace to erase a single keystroke, and feel it is a fatal flaw if it erases an entire combination, and 2. those who fully expect Backspace to erase an entire combination, and feel it is a fatal flaw if it erases just a single keystroke. Unfortunately, both types exist in significant numbers. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Editing Sinhala and Similar Scripts
There is also a distinction between editing an existing document that you opened as distinct from writing a document, going back to a certain point in document and editing that section within the same editing session. In the first case their is no history, in the second case their may be history to work with. Andrew On 20 March 2014 14:43, Peter Constable peter...@microsoft.com wrote: If you click into the existing text in this email and backspace, what keystroke will you expect to be erased? Your system has no way of knowing what keystroke might have been involved in creating the text. What is _can_ make sense to talk about is to say that a user expects execution of a particular key sequence, such as pressing a Backspace key, to have a particular editing effect on the content of text. Erasing a keystroke and keystrokes resulting in edits are different things. One makes sense, the other does not. It may seem like I'm being pedantic, but I think the distinction is important. Our failure is in framing our thinking from years of experience (and perhaps some behaviours originally influenced by typewriter and teletype technologies) in which a keyboard has a bunch of keys that add characters, and variations on that that even include a lot of logic to get input keying sequences that can generate tens of thousands of different character; but then one or two keys (delete, backspace) that can only operate in very dumb ways. (We've also always assumed that any logic in keying behaviours can be conditioned only by the input sequences, but not by any existing content, but that steps beyond my earlier point.) These constraints in how we think limit possibilities Peter -Original Message- From: Doug Ewell [mailto:d...@ewellic.org] Sent: March 19, 2014 9:39 AM To: Peter Constable; unicode@unicode.org Subject: RE: Editing Sinhala and Similar Scripts Peter Constable petercon at microsoft dot com wrote: There are two types of people: 1. those who fully expect Backspace to erase a single keystroke It is nonsensical to talk about erasing a _keystroke_. But that's what they expect. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunning...@slv.vic.gov.au lang.supp...@gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Editing Sinhala and Similar Scripts
On 20 March 2014 15:17, J. Leslie Turriff jlturr...@centurylink.net wrote: Perhaps it might be useful to be able to distinguish between an editing mode and a composition mode: editing mode would be active when a document is first loaded into the editor, when the editor has no keystroke history to consult, and in this mode the backspace key would merely remove text glyph by glyph, so to speak, as happens with ASCII text; composition mode would be active when keystrokes have been recorded in a buffer, so that backspace could be used to unstroke the original strokes; the unstroke operations would mimic the order in which the originals were entered, even if the editor had optomized the composition. Although that requires an input framework and application that utilise that buffer in various ways during composition mode. It is possible, and in the past I have written a manual and run training on advanced editing for Dinka language translators on how to utilise such features. But not many applications support such features. Andrew -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunning...@slv.vic.gov.au lang.supp...@gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka)
Chris, Keyman is capable of doing that and a lot more, but few keyboard layout developers use it to its full potential. As an example, I was asked by Harari teachers here in Melbourne to develop a set of three keyboard layouts for them and their students. The three keyboards were for three different orthographies in the following scripts: 1) Latin 2) Ethiopic 3) Arabic They wanted all three layouts to work identically, using the keystrokes used on the Latin keyboard. The Ethiopic and Arabic keyboard layouts required extensive remapping of key sequences to output. If I was a programmer I could have done something more elegant by building an external library Keyman could call but as it is we could do a lot inside the Keyman keyboard layout itself. For Myanmar script keyboard layouts we allow visual input for the e-vowel sign and medial Ra, with the layout handling reordering. One of the Latin layouts I use, supports combining diacritics and reorders sequences of diacritics to their canonical order regardless of order of input. Assuming a maximum of one diacritic below and two diacrtics above base character. Analysis and creativity can produce some very effective Keyman layouts. Andrew On 18/03/2014 7:23 PM, Christopher Fynn chris.f...@gmail.com wrote: MSKLC and KeyMan are fairly crude ways of creating input methods For what you want to - you probably need a memory resident program that traps the Latin input from the keyboard, processes the (transliterated) input strings converting them into unicode Sinhala strings, and then injects these back into the input queue in place of the Latin characters. There are a couple of utilities that do this for typing transliterated/romanised Tibetan in Windows and getting Tibetan Unicode output. http://tise.mokhin.org/ http://www.thubtenrigzin.fr/denjongtibtype/en.html But I think both of these were written in C as they have to do a lot of processing which is far beyond what can be accomplished with MSKLC and even KeyMan - C ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka)
I suspect it was a fishing expedition to illustrate how awkward it is to type on Unicode keyboard layouts versus his system. Ie still no clear separation of input and encoding in his responses. On 19/03/2014 6:39 AM, Doug Ewell d...@ewellic.org wrote: Tom, with typo spotted and corrected by Jean-François, seems to have found it: කාර්ය්යාලවල යනහ්ර පඩකහි The sequence of code points would thus be: 0D9A 0DCF 0DBB 0DCA 200D 0DBA 0DCA 200D 0DBA 0DCF 0DBD 0DC0 0DBD 0020 0DBA 0DB1 0DC4 0DCA 200D 0DBB 0020 0DB4 0DA9 0D9A 0DC4 0DD2 Naena, is this what you were looking for? -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka)
Different individuals, groups and communities can bring their own expectations to input layout designs. Design is a balance between capabilities and limitations of the input framework versus the expectations of the user community around how they language should work. I work with multiple operating systems and even more input frameworks. I have my preferred input frameworks. But it ultimately air is a question of knowing your tools. For instance, if you compile a keyborad layout from the commandline with MSKLC you can chain deadkeys, build against custom locales in Vista and Win7, or build against unsupported language codes in Win8+ Andrew On 19/03/2014 9:13 AM, Tom Gewecke t...@bluesky.org wrote: On Mar 18, 2014, at 12:52 PM, Andrew Cunningham wrote: I suspect it was a fishing expedition to illustrate how awkward it is to type on Unicode keyboard layouts versus his system. Interesting question perhaps. Is it more awkward to type 14 strokes as k a a r y y a a l a v a l a or to type 9 as ක ා ර ්ය ්ය ා ල ව ල ? ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka)
On 18/03/2014 11:23 AM, Naena Guru naenag...@gmail.com wrote: I tried to make a phonetic one to kind of relate to the English keys. Still, you need to have many shifted keys to get common letters. No you don't, you just need to understand the possibilities of what your input framework is capable of and the best way to implement what you want to achieve. The windows input system is probably the most contrained, but to look at a good phonetic layout have a look at the Cherokee Phonetic layout on Windows 8+ Designing a god layout requires using the right tools, knowing the limits and capabilities of those tools, and using them in creative ways. On Mon, Mar 17, 2014 at 11:38 AM, Doug Ewell d...@ewellic.org wrote: Naena Guru naenaguru at gmail dot com wrote: Making a keyboard [layout] is not hard. You can either edit an existing one or make one from scratch. I made the latest Romanized Singhala one from scratch. The earlier one was an edit of US- International. I've made a couple dozen of them myself, with MSKLC. When you type a key on the physical keyboard, you generate what is called a scan-code of that key so that the keyboard driver knows which key was pressed. (During DOS days, we used to catch them to make menus.) Now, you assign one or a sequence of Unicode characters you want to generate for the keypress. Precisely. As Marc Durdin said, you can create a keyboard layout just as easily for Unicode characters as for ASCII and Latin-1 characters. You can also assign a combination of characters to a single key. So it is not true that typing Unicode Sinhala requires you to learn a key map that is entirely different from the familiar English keyboard, while losing some marks and signs too. Unicode does not prescribe any key map. You can have whatever layout you like. As Marc also said, if you think there are marks and signs missing from Unicode, that is another matter. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Romanized Singhala got great reception in Sri Lanka
On 17/03/2014 6:55 AM, Jean-François Colson j...@colson.eu wrote: Le 16/03/14 14:10, William_J_G Overington a écrit : Is the Romanized Singhala system a way to enter the characters into a computer using only a QWERTY keyboard? It is easy to input (phonetically) using a keyboard layout slightly altered from QWERTY. How is the keyboard altered from QWERTY please? Are you publishing the font please? In fact, I think he was speaking of the bare American (US) qwerty. An international version of it should do the job. Looking at his site http://lovatasinhala.com/ and making a copy and paste of the page contents, you see he uses 7-bit ASCII, a few Latin-1 accented vowels, and a few additional “letters” such as ð, Ð, þ, æ and µ. He also makes a case distinction, where upper and lowercase versions of some characters produce different Sinhala characters. Naena Guru’s aim is not to make an input method to type Sinhalese. Sinhalese keyboards layouts already exist: http://www.microsoft.com/resources/msdn/goglobal/keyboards/kbdsn1.html http://www.microsoft.com/resources/msdn/goglobal/keyboards/kbdsw09.html http://kaputa.com/uniwriter/apple.gif http://www.nongnu.org/sinhala/doc/keymaps/sinhala-keyboard_3.html His aim is rather to make an 8-bit font to replace that “difficult” and “expensive” Unicode compliant Sinhalese. Creating a new set of difficulties. Andrew ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
RE: Diacritical marks: Single character or combined character?
To add to low the other comments. I would add two points: 1. It depends on the language you deal with. 2. It depends on the input framework you are using. A number of the languages I deal with use combinations of base characters and diacrtics where some combinations have precomposed forms and others don't. When developing keyboard layouts for such languages using simple input frameworks you have to use combining diacritics or a wierd mix of combining and composed. With more sophisticated input frameworks you have more flexibility and control. Andrew
Re: Dotted Circle plus Combining Mark as Text
I suspect it is a font issue, rather than a renderer issue, but then using a dotted circle is a convention used in the unicode charts and in the unicode spec. It is not a combination I'd expect a font developer from SE Asia to necessarily support. Since publications in SE Asia have their own typographic conventions for displaying isolated combining marks. Andrew
Re: Can a single text document use multiple character encodings?
I can think of a few websites that mix legacy encoded content withina utf-8 document. Often done as a practicality. Or alternatively mixing Unicode and pseudo-Unicode in same document. Andrew On 30/08/2013 11:14 PM, Ilya Zakharevich nospam-ab...@ilyaz.org wrote: On Wed, Aug 28, 2013 at 07:07:23PM +, Costello, Roger L. wrote: For example, can some text be encoded as UTF-8 while other text is encoded as UTF-16 - within the same document? I think it is a very interesting question. A Perl program is (obviously) a text document. On the other hand, in two minutes I could deduce a few ways to mix many different encodings into the same document. My current record is 5 different encodings; some of them are arbitrary, some of them should satisfy certain compatibility requirements (something like =cut CR and =pod CR being encoded the same in two encodings). And, on top of this, is yet another way to mix encodings arbitrarily. The tricks are threefold: ◌ First, a Perl program is actually a mixture of 3 different documents: the program stream, the data-for-the-program stream, and the documentation stream. There are certain rules for interleaving them (except for DATA which should be at the end!), and there are documented way to specify encodings of the streams. ◌ Second, the string and regular-expression literals are “interpreted” by the lexer: there is a way for the program to specify a way to “massage” the literals before they are handled to interpreter. This gives yet other ways to have strings and/or regular expressions to be in a different encoding. (Note that this may lead to “doubly encoded” phenomena if the “ambient” encoding is not “raw”.) ◌ Third, there is a way to switch the encoding of a Perl program on the fly (at the end-of-line of current encoding). To be honest, I should have better tested all this before posting — but I did not. On the practical side, how is this useful? Having different encoding for DATA and the program, and/or documentation and the program may be quite widely used. The other hacks may have been used at least in the (enormous!) Perl test suite. Ilya
Re: Ways to show Unicode contents on Windows?
Although writing an IME from scratch is beyond the skill set of a few of us. Although there are text services framework table based IMEs. Although I did here a romour that support for those may disappear. Not sure if that is true or not. But since Windows 8, it has become even more difficult to track what is happening in terms of input, esp since there are more input frameworks than there used to be. One of the reasons I prefer using non-Microsoft tools for complex input requirements. Microsoft typography team has done same very good work. But Microsoft is so large, things are becoming fragmented. Interesting tools like locale builder were never maintained. And it is becoming more difficult to develop solutions for lesser used languages. It is the nature of the beast, not just an issue with Microsoft and Windows 8, but with internationalisatin support in many large projects. Andrew On 19/07/2013 5:47 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: On Thu, 18 Jul 2013 17:11:45 -0700 Ilya Zakharevich nospam-ab...@ilyaz.org wrote: Just in case: do you realize that out-of-BMP must be specified via LIGATURES section? Yes, for 'character' read UTF-16 code element. Even worse, you can't use dead keys outside the BMP, which prevents one using MSKLC for typing in natural language in cuneiform orthography. (Plain text Egyptian is no more supported than is plain text calculus.) However, I recall that one can use a simple IME instead. Richard.
Re: Ways to show Unicode contents on Windows?
Hi Ilya, That is part of the story. There are tidbits scattered all through Michael's blog. On 19/07/2013 11:53 AM, Ilya Zakharevich nospam-ab...@ilyaz.org wrote: On Wed, Jul 17, 2013 at 12:04:10AM +0100, Richard Wordingham wrote: (LCID); I don't see any way to check what the general .klc file format is - the format seemed very delicate when I had to edit it by hand, at least, not for the SMP. I wonder whether this link is relevant to what you discuss: http://blogs.msdn.com/b/michkap/archive/2013/04/16/10233999.aspx Myself, I found very few problems with manipulation of .klc files. (See the first dozen of Windows GOTCHAS in http://search.cpan.org/~ilyaz/UI-KeyboardLayout/lib/UI/KeyboardLayout.pm ) Just in case: do you realize that out-of-BMP must be specified via LIGATURES section? (Put %% instead of the characters, and put in LIGATURES: the VK, the modification column, and the “content”: up to 4 16-bit numbers.) My sources are in k.ilyaz.org/iz/windows/src.zip. Yours, Ilya
Re: Ways to show Unicode contents on Windows?
Hi Richard, Yes you can build against a custom locale in VIsta onwards. Requires editing source file in text editor, then building keyboard layouts from the command line using MSKLC Andrew On 17 July 2013 09:04, Richard Wordingham richard.wording...@ntlworld.comwrote: On Mon, 15 Jul 2013 18:19:34 +1000 Andrew Cunningham lang.supp...@gmail.com wrote: On 15/07/2013 6:02 PM, Christopher Fynn chris.f...@gmail.com wrote: MS Office seems to want to do is apply fonts based on the language being used - the input language being determined by the keyboard or IME currently selected. When using a custom keyboard (e.g. one created with MSKLC) or IME MS Office frequently does not accuratly determine the language and consequently overides your font selection. I am wondering if building a MSKLC against a custom locale will get around the problem, or would make no difference? Can one actually build MSKLC against a custom locale? The documentation on the easiest way of building a custom locale implies that it is only available for Windows Vista, whereas I only have XP and Windows 7. The .klc files MSKLC created use a numerical locale ID (LCID); I don't see any way to check what the general .klc file format is - the format seemed very delicate when I had to edit it by hand, at least, not for the SMP. Neither Akkadian nor Hittite comes up on the pick list. (I might choose Hittite because the cuneiform font I have is for Hittite.) I suppose I might have problems with cuneiform because I chose the only Mesopotamian locale available - Iraqi Arabic. Richard. -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunning...@slv.vic.gov.au lang.supp...@gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/
Re: Ways to show Unicode contents on Windows?
On 15/07/2013 6:02 PM, Christopher Fynn chris.f...@gmail.com wrote: MS Office seems to want to do is apply fonts based on the language being used - the input language being determined by the keyboard or IME currently selected. When using a custom keyboard (e.g. one created with MSKLC) or IME MS Office frequently does not accuratly determine the language and consequently overides your font selection. I am wondering if building a MSKLC against a custom locale will get around the problem, or would make no difference? Andrew
Re: Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?
Hi Roger, The situation is complex. Few applications and web services bother with normalisation, so what you get, I.e. NFC or NFD or other ... often depends on which language you are using and what input framework you are using. Some keyboard layouts will produce NFC output, some keyboard layouts will not produce either NFC or NFD. some keyboard layouts will produce NFD. some keyboards layouts may produce NFD if the typist enters the characters in the right order, if the language uses multiple combining diacritics and some of combining diacritics do not interact typographically. You need very specific input frameworks supporting constraints and reordering to guarantee either NFC or NFD for some languages. And for some languages, different keyboard layouts will produce different output. Ie some Vietnamese input tools produce NFC, while others do not produce NFC or NFD. Library data is also problematic. Some ILMs will out put NFC but this is not the norm. Usually they will leave it in its internal format. For MARC21, the character repertoire taken as a whole will produce data that is northern NFC nor NFD, but if you look at subsets of data by language, a lot of the data is effectively NFD. But not all. Andrew On Feb 2, 2013 1:19 AM, Costello, Roger L. coste...@mitre.org wrote: Hi Folks, The W3C recommends [1] text sent out over the Internet be in Normalized Form C (NFC): This document therefore chooses NFC as the base for Web-related early normalization. So why would one ever generate text in decomposed form (NFD)? Do any programming languages output text in NFD? Does Java? Python? C#? Perl? JavaScript? Do any tools produce text in NFD? Should I assume that any text my applications receive will always be normalized to NFC form? Is NFD dead? /Roger [1] http://www.w3.org/TR/charmod-norm/#sec-ChoiceNFC
Re: Normalization rate on the Web
Hi Denis, A fea thoughts ... library data may be nfc or nfd, but is more likely to conform to the MARC character repetoire, so isn't exactly NFD. Vietnamese data is either 1) NFC or 2) neither NFC nor NFD It would be rare to find vietnamese data in NFD For a range of afrjcan languages, maily ones uskng diacriti s anx diacritic stackkng, it may be 1) NFC, 2) NFD or 3) niether NFC nor NFD depending on the input framework used. On Jan 22, 2013 3:26 AM, Denis Jacquerye moy...@gmail.com wrote: Does anybody have any idea of how much of the Web is normalized in NFC or NFD? Or how much not normalized? How would one find out or try to make a smart guess? I know a lot of library catalogue data is in NFD or somewhat decomposed. Is there any other field that heavily uses decomposition? -- Denis Moyogo Jacquerye African Network for Localisation http://www.africanlocalisation.net/ Nkótá ya Kongó míbalé --- http://info-langues-congo.1sd.org/ DejaVu fonts --- http://www.dejavu-fonts.org/
Re: cp1252 decoder implementation
Hi On 21 November 2012 16:42, Philippe Verdy verd...@wanadoo.fr wrote: But may be we could ask to Microsoft to map officially C1 controls on the remaining holes of windows-1252, to help improve the interoperability in HTML5 with a predictable and stable behavior across HTML5 applications. In that case the W3C needs not doing anything else and there's no need to update the IANA registry. Not sure what the purpose or need for this would be, let alone the need for it. Seems to be a vision of an ideal world that does not exist. If such remapping occurred then some legacy content would be potentially broken. Many languages, and many character encodings did not go through a formal standardization or registration. Thus not officially supported, and most of the time worked by 1) declaring themselves as iso-8859-1 or windows-1252 and 2) specifying specific fonts. Web browsers support a small limited number of character encodings, and redefining and changing how key character encodings work will have implications for legacy data and for languages currently unsupported by Unicode or languages with limited practical support from vendors. Ok, not many but there are a few still out there, and i still do come across content being created in legacy encodings. Andrew Project Manager, Research and Development Social and Digital Inclusion Unit Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunning...@slv.vic.gov.au lang.supp...@gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/
Re: [indic] Re: Lack of Complex script rendering support on Android
I'd agree with Ed, its a broader problem than just India, and a problem not just based on market segments. I use Android devices often, but can not use them as a serious tool for work because of what I would classify as serious limitations in the OS and its internationalisation model. At moment mobile devices are still toys unable to deal with most community languages I need to work with or support in Australia, including many Latin script languages. This is a limitation in all mobile OSes not just Android. I just have to look through my facebook and google+ accounts to see many messages in African and South East Asian languages that will not display. I do love the irony of Android devices not being able to display all content available through Google services. Andrew
Re: Multiple private agreements (was: RE: Code pages and Unicode)
so you will end up with the CSUR AND the registry Pilippe is suggesting AND all the existing uses of PUA that will not end up in CSUR or the other registry. sounds like it will be a mess. its bad enough dealing with Unicode and pseudo-Unicode in the Myanmar script, adding PUA potentially into the mix ummm... On 25 August 2011 11:55, Philippe Verdy verd...@wanadoo.fr wrote: 2011/8/24 Doug Ewell d...@ewellic.org: As Richard said, and you probably already know, there is no chance that UTC will ever do anything with the PUA, especially anything that gives the appearance of endorsing its use. I'm just thankful they haven't deprecated it. The appearance of endorsing its use would only come if the website describing the registry was using a frame using the Unicode logo. It can act exactly like the CSUR registry, as an independant project (with its own membership and participation policies), that would also be helpful for collaborating with liaison members, ISO NB's, or some local cultural organizations or collaborative projects. The focus of this registry would only be for helping the encoding process: registered PUAs or PUA ranges would not survive to finalized proposals that were formally proposed and rejected by both the UTC and WG2, and abandonned as well by its iniital promoters in the registry (no new updated proposal), or to proposals that have been finally released in the UCS (and there would likely be a short timeframe for the death of these registrations, probably not exceeding one year). It would be different from the CSUR, because CSUR also focuses on supported PUAs that will never be suppoorted in the UCS (for example, due to legal reasons, such as copyright which would restrict the publication of any representative glyph in the UCS charts), or creative/artistic designs (For example, I'm still not convinced that Klingon qualifies for encoding in the UCS, because of copyright restrictions and absence of a formal free licence from right owners; the same would apply to any collection of logos, including the logos of national or international standard bodies that you can find on lots of manufactured products and in their documentation, because the usage of these logos is severely restricted and often implies contractual assessments by those displaying it on their products or publications; this would also apply to corporate logos, even if they are widely used, sometimes with permission, but this time because these logos frequently change for marketing reasons). -- Andrew Cunningham Senior Project Manager, Research and Development Vicnet State Library of Victoria Australia andr...@vicnet.net.au lang.supp...@gmail.com
Re: ch ligature in a monospace font
On 30 June 2011 07:59, Richard Wordingham richard.wording...@ntlworld.com wrote: On Wed, 29 Jun 2011 03:49:42 + Peter Constable peter...@microsoft.com wrote: That would appear to be a limitation of the input method. It is indeed a limitation of X. I get round it on Ubuntu by using IBus and KMFL (Keyman for Linux), which then allows me to use dead keys for sequences, something which is (or used to be) beyond MSKLC. I assume you mean KMFL (Key Manager for Linux) with uses an extremely old version of the Keyman syntax. From memory you should be able to get more mileage out of MSKLC than you seem to have. Andrew -- Andrew Cunningham Senior Project Manager, Research and Development Vicnet State Library of Victoria Australia andr...@vicnet.net.au lang.supp...@gmail.com
Re: Using Javascript to Detect Script Support in a Browser
hi Ed, On 22 June 2010 11:51, Ed Trager ed.tra...@gmail.com wrote: Thanks, Andrew! I like Keith's approach. I have been looking at Lanna a little bit and I am not sure if *any* OS shaper currently really has fully implemented correct shaping support for Lanna? In any event, Lanna is quite similar to Myanmar, so Keith's approach could be used very successfully. since there are no specific guidance for developing Lanna or Myanmar OpenType fonts, I assume that Lanna fonts have developed using some of the more generic OpenType features much like Myanmar and thus should shape correctly on the same platforms as Myanmar Unicode fonts do. I guess the key issue is who your target audience are and what is the oldest OS versions likely to be used on your site? Out of curiosity, which OpenType fonts are you using for Lanna? When experimenting with Myanmar web fonts, I made one big mistake, relied on some of the web based tools for generating web fonts, which broke the complex rendering. Best to generate the web fonts form available command line tools. It might be interesting to see if Keith's approach can be generalized a bit to detect whether correct rendering is available for a number of those related S and SE Asian scripts: Myanmar, Lanna, Khmer, Kannada, etc. should be possible. -- Andrew Cunningham Senior Project Manager, Research and Development Vicnet State Library of Victoria Australia andr...@vicnet.net.au lang.supp...@gmail.com
Re: Using Javascript to Detect Script Support in a Browser
it is an issue that we've struggled with for a while eot, ttf font linking, woff and svg fonts all play a part in a possible solution. for my projects i also have to consider if clients are likely to be using older operating systems, and thus may not have rendering support. SO detecting if appropriate fonts are available, doesn't Keith Stribley used a similar approach, see http://www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/WebDevelopers/#detect For Myanmar he compared U+1000 U+1000 to U+1000 U+1039 U+1000 which not only allowed him to see if an appropriate font was a available, but whether appropriate rendering was occurring. On 17 June 2010 07:29, Ed Trager ed.tra...@gmail.com wrote: On Tue, Jun 15, 2010 at 5:52 PM, Marc Durdin Couldn't you do this just using font fallback in CSS, and just leave it to the user agent to sort out? Two examples: P { font-family: Code2000, MyCode2000; } �...@font-face { font-family: MyCode2000; src: url('code2000.ttf'); } Or: P { font-family: MyCode2000; } �...@font-face { font-family: MyCode2000; src: local(Code2000), url('code2000.ttf'); } for the browsers that can handle src local() syntax. I cannot conclusively say at this point whether my planned dynamic solution is better than a more static let the UA figure it out approach, but I'm going to try it and see how it goes. both approaches have there benefits, really depends on what you are trying to achieve. But i suspect that the static solution is more scaleable. -- Andrew Cunningham Senior Project Manager, Research and Development Vicnet State Library of Victoria Australia andr...@vicnet.net.au lang.supp...@gmail.com
Re: OpenType vs TrueType (was current version of unicode-font)
Ok, slight variation on the questions to date. which opentype fonts (other than Dolous SIL and Code2000) support the placement of combining diacritics? Andrew Andrew Cunningham andj_c at iprimus.com.au andrewc at vicnet.net.au
Re: Combining diacriticals and Cyrillic
Hi Vladimir yes in theory your answer is Unicode, i.e. cyrillic plus combining diacritics. Although the actual application of the theory will differ from operating system to operating system. I did a quick test on windows in both word processors and web browsers. Everything displayed correctly (given certain combinations of fonts and applications). There are two elements that need to be addressed: 1) appropraite fonts. I only know of two that are suitable: Code2000 (v. 1.13) has the appropriate opentype tables (I believe it uses the OpenType MarkToBase feature - others on the list will correct me if my memory is faulty). The second font is Doulos SIL (v 0.6 - Beta). This font has both OpenType tables and Graphite tables. Graphite is a rendering system developed by SIL International. 2) You need a rendering system that supports the features. On Windows, this means that you will need a version of Uniscribe that supports the use of combining diacritics with cyrillic characters. Currently none are available, except for the version in the MS Office 2003 Beta. I did a quick test using the two fonts above, and the characters displayed correctly. So from the point of view of word processing, there is a solution coming. This approach will also work with other applications that support uniscribe. Although you might ahve to wait until Microsoft release a service pack that contains the uniscribe update. I assume that Microsoft will update one or more fonts with the necessary features when they release Office 2003. I also tested the software in some graphite enabled software (WorldPad and a graphite enabled version of Mozilla). It seemed to work fine as well. [EMAIL PROTECTED] wrote: Dear Ladys and Gentlemen, Currently there is an ongoing effort in Bulgaria trying to resolve an issuie concerning the way we write in Bulgarian. Our problem is: Usually a bulgarian regular user does not need to write accented characters. There is one middle-sized exclusion of this, but generally we do fine without accented characters. The problem is that in some special cases or more serious lingustic work, one definetely needs to be able to write accented characters (accented vowels). One of the ideas is to invent a new ASCII-based encodings, containing the accented characters we need. This would introduce an additional disorder in the current mess of cyrillic encodings, and would introduce problems with automated spellcheck. Generally I beleive it would be best to invent a Unicode based solution. Such a solution is for example, combining diacritical signs with the cyrillic symbols. I composed a demo page: http://v.bulport.com/bugs/opera/426/balhaah_lonex_org/ and then made 10-20 shots of the results on Opera and IE on Linux, Windows 98 and Windows XP: http://v.bulport.com/bugs/opera/426/balhaah_lonex_org/shots.html You can see that this approach yields _quite_ incosistent and useless results, depending on the font, application and operating system being used. Finally, I wonder if you could give us some advice: 1. Is it possible somehow to improve this approach? I imagine eg., if the font can provide prepared combined symbols whenever the application asks for a combined cyrillic+diacritical, instead of leaving the application to do the combination. 2. Do you see other unicode based approach to the Bulgarian problem? 3. Do you beleive the approach should be looked for outside Unicode? Please excuse me for wasting your time, Vladimir, Bulgaria . -- Andrew Cunningham Multilingual Technical Officer Online Projects Team, Vicnet State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia [EMAIL PROTECTED] Ph. +61-3-8664-7430 Fax: +61-3-9639-2175 http://www.openroad.net.au/ http://www.libraries.vic.gov.au/ http://www.vicnet.net.au/
Re: 4701
From memory, although my memory may be faulty, there are some slight differences between the animals assigned in the Chinese calendars and the animals assigned in the Vietnamese calendar. in the Vietnamese sequence, it is goat. while most chinese sources indicate sheep (occasionally they say ram, but sheep is most common) at least thats what i seem to remember. But then three's been so many fire crackers going off over the three days of tet, that something might have rattled loose in my memory. Andrew Michael Everson wrote: At 10:19 -0800 2003-02-01, Eric Muller wrote: Michael Everson wrote: Happy New Year of the Yáng to everybody! (I can't work out whether it's the Year of the Sheep, the Goat, or the Ram.) Ram. europe.cnn.com (which I was looking at for other, sadder reasons), says Goat. My local Superquinn's (large grocery chain) has had signs on all the Chinese food for weeks which says Ram. My Chinese dictionary says Sheep.
Re: glyph selection for Unicode in browsers
Hi Tex Texin wrote: In the case of HTML, XML, CSS, ways to specify typographic preferences exist, and language can be expressed via lang. We just need browsers and other user agents to make use of the lang information as part of font selection. For me, this is the crux: that browsers have not implimented the css :lang selector. Things would be easier if we could tie presentation (via css) to the specified language of a document or part of a document. Andrew -- Andrew Cunningham Multilingual Technical Officer OPT, Vicnet State Library of Victoria Australia [EMAIL PROTECTED] Ph: +61-3-8664-7001 Fax: +61-3-9639-2175 http://home.vicnet.net.au/~andrewc/ http://www.openroad.net.au/
Re: Can browsers show text? I don't think so
I was intending to avoid this whole thread. But considering some of the comments that have been made in the thread, I'm forced to make one comment: I find the naivety displayed on this list, relating to issues about multilingual PUBLIC internet access, is disturbing. Andrew Andrew Cunningham [EMAIL PROTECTED]
Re: Unicode Latin combining diacritics - Looking for real-world example documents
Hi Chris, I'm currently type setting some Dinka poetry for a friend. Dinka requires a combining diaeresis with open-o and open-e info at http://www.openroad.net.au/languages/african/dinka-2.html a sample utf-8 web page is at http://www.openroad.net.au/languages/samples/dinka.html this poem plus another one arre attached as text documents (utf-8) Additionally, for linguistical purposes you could also add tone markers as well (grave and acute) but this isn't used in day to day writing. I'll try to source some Nuer text which uses a macron below. Chris Pratley wrote: Does anyone have real-world documents in Unicode that take advantage of Latin Combining Diacritics (U+0300 range and others) to accurately represent the text content? If so, I would appreciate links or docs mailed to me. We’re doing some testing of Latin Diacritic support for IPA and African languages, romanizations, etc., and it is (understandably) very hard to find any “real” text in languages that require this support where the diacritics have not been left out in order to work around the lack of software support. (Catch-22!). I’m looking for text (especially with stacked diacritics) in IPA, Hausa, Ewe, or other West African languages, Mixtec or other Mexican languages, Dinka, Nuer, etc. Basically anything that is real-world and shows off typical or tricky diacritic combinations. If you could include an image or at least a verbal description to show what the display would be if it were correct, that would be lovely. I’m not promising anything, but I know that there are several (many) people on this list who would be interested in having this support in Word or other Microsoft products, so now’s your chance to influence the outcome – if we’re going to get it done right I need your help! Thanks in advance, Chris Pratley Group Program Manager Microsoft Word Sent with OfficeXP on WindowsXP Yeŋa ba wɛ̈l ɣakɔ̈u M. A. M. Yeŋa ba wɛ̈l ɣakɔ̈u Yeŋa ba wɛ̈l ɣakɔ̈u Yeŋa bï kɔ̈ɔ̈c ɣakɔ̈u Yeŋa ba wɛ̈l ɣakɔ̈u. Na tiëŋ cam ku cuëc, Ka cïn raan kääc ke ɣɛɛn. Na liɛɛc ɣakɔ̈u ku ɣanhom, Ka cïn raan, kääc ke ɣɛn. Kuatdiɛ̈ adaai të nɛ̈k alei ɣa thïn Apirika acä wɛ̈l yekɔ̈u. Yeŋö cï wuɔ̈ɔ̈t ɣa maan Yeŋö cï wuɔ̈ɔ̈t ɣa maan Kɔc wäär bïï yanhden Aa kääc roor të mec Ku alei anɛ̈k ɣa yanhden. Yeŋö ye wuɔ̈ɔ̈t yethok mat Bïk ɣa cɔl adhur kuat ce thok mat. Yeŋö ye alei lɔ̈ɔ̈r yupic Yeŋo yen lɔ̈ɔ̈r guɔ̈t köök Yeŋo ye alei ɣa guɔ̈t pïny wakɔ̈u. Acï weŋ peec ku peec thɔ̈k Acï Deŋ peec ku peec Nyankiir Adhur ɣɛn ke wämäth akën Ku kɔc ken ye kek yanh tök theek Ku acïn raan kony ɣɛɛn. Yeŋa ba wɛ̈l ɣakɔ̈u Yeŋa ba wɛ̈l ɣakɔ̈u Apirika, acä wɛ̈l yekɔ̈u Acä päl alei. Bï alei ɣa näk bɛɛŋ Ku ke yic yen anɛ̈k ɣɛɛn Yiny wïc ɣɛn piɛnydiɛ̈. Aŋic pinynhom lɔc cï Muɔnyjäŋ thuɔu gam Ku thou atɛɛr ke pïïr Na bɔ̈ piɛnydiɛ̈ bei Ke Nhialic abä bɛn cuɔ̈ɔ̈t Köŋ cuëëc ë Deŋ Abuk Aba bɛn cuɔ̈ɔ̈t Ku abä wɛ̈ɛ̈r bei. Yic acie tiaam Yic acie thou Cɔk run bɛ̈ɛ̈n Loŋär bɛ̈n Ke Nhialic abä bɛn cuɔ̈ɔ̈t. Acïn të liu yuɔmkiɛ̈ thïn Yïn kiir ɣer ku kiir col Cäk ɣa päl alei Cäk ɣa päl alei Yeŋa cït ɣɛɛn Yeŋa cï yɔŋ yaaŋ yic buɔɔt Ɣɛn acï cuäny ɣöt ke miëthkiɛ̈ Arak thiäär. Ɣɛn acï nɔ̈k bï ɣa luɔ̈i akuut Ku acïn raan cï alei thiëëc. Kɔc ë pinynhom kɔc tɛ̈k yiith Cäk la dë Cï alei week dɔ̈m määth Na week kɔc ye ɣok, ke yanh tök theek Cï alei week ɣɔɔc Na cäk jai Jesu, Ke ɣok aabï rɔ̈m pan nhial ë Kristo. Wun ë Tiɛɛl acie Mac ku Pan Cïnic Bɛ̈ny ee riääk aköl Muɔrwël Ater Muɔrwël Jɔlku muɔ̈l teer Ku yeku röt nhiaar Acïn raan ben Muɔnyjäŋ nhiaar Ee Muɔnyjäŋ yen acï ya anyaar Yen ajɔl wuɔ̈ɔ̈t thäär. Ariɔ̈ny kiith wäär cï thiaan Kek aa jam ka bï Muɔnyjäŋ tiaam Ku këden acä ye bɛ̈ɛ̈r Ɣok aa kɔc thiääk Aköl le ɣɛn rɔ̈m ke keek Aabï dhiau arak thiäär Rin ɣɛn ee moc arak thiäär. Ee raan dhɛ̈l ɣa yen acä ye ŋuään Ku yen aye cuɔ̈p teer Ku na cɔk kë ɣa keer Ke ɣɛn acï kɛt keek. Thɔndït ee nöök të piiny, Ɣɛn ee Muɔnyjäŋ. Ɣɛn ee Muɔnyjäŋ. Jɔlku muɔ̈l teer Ku yeku röt theek Rin puɔ̈n cïnic teer Yen abï Nhialic ɣok thiee Ku yen abï Nhialic ɣok röt kueeŋ Ku yen abï ɣok röt deer. Na cuk röt ë theek Ka alei ee ɣo wïc yiic Ë raan cuai bï keeth Ku rum käkua bï pïïr ë keek Ku benku ya dhiau ɣok aacï alei peec Ku ke tiɛldan cï ɣok ë mat Yen acï alei ɣok ë theek. Duɔ̈kkë ye mïth ë röt Acï kɔc ë leec Ku ee yï mac theer Ku acïn Muɔnyjäŋ ye mac theer Diët Muɔnyjäŋ acie baai keer Aa mïïth kek acï ɣok yiëk teer Ku yen acï alei ɣok ë theek Ku rum käkua bï pïïr ë keek Ku benku ya dhiau ɣok aacï alei peec Ku ke tiɛldan cï ɣok ë mat Yen acï alei ɣok ë theek. Jɔlku muɔ̈l teer Alei acï ɣok leel Buk yanhde ya theek Ku cuk yäthkua ye theek Ku na cɔkku yanhden theek Ka ŋuɔt cïk gam buk nhïïm thöŋ ke keek Rin alei, ku cɔk yiëk moor Ka cï yï kɔŋ leec. Matku ɣo yiic Matku ɣo yiic Ku yeku röt deet Ëtë bï ɣok piir thïn ke keek Ee
Re: How to make oo with combining breve/macron over pair?
Hi Dan, At 08:39 PM 3/3/02 -0800, Dan Wood wrote: Hi, I'm not finding hints of this in any of the FAQ or where's my character docs I'm trying to create (or find) the oo pair with a combining macron (0304) and combining breve (0306) over the pair of them together, as in these images: http://wwwbartlebycom/images/pronunciation/oomacrgif is it a combining macron you need? or a combining overline from my understanding the overline is supposed to connect on the left and right, and i'd assume the acron isn't supposed to so maybe U+006F U+0305 U+006F U+0305 would suit either that or we need two additional combining double diacritics added to unicode Andrew
Re: Unicode Search Engines
At 08:13 AM 2/19/02 -0800, Doug Ewell wrote: Asmus Freytag [EMAIL PROTECTED] wrote: So if some language turns out to need a with horn in the future, its readers will have to cross its fingers that rendering engines become capable of displaying U+0061 U+031B properly. Support for such arbitrary combination is apparently in the works in several camps - it's needed in African languages for one. And judging from Marco's unrelated post about Yoruba q-tilde, in which I *did* see the tilde positioned correctly (more or less) over the q, I guess support is more advanced than I thought. Terrific. Ummm ... may work for lower case ... if you're not fussy about precise location of the diacritic , i suspect that the diacritic would overstrike the uppercase character though === Andrew Cunningham Multilingual Technical Officer Accessibility and Evaluation Unit, VICNET State Library of VIctoria Australia http://www.openroad.net.au/ [EMAIL PROTECTED] +61-3-8664-7001 ===
Unicode-Afrique forum
Hi everyone, thought I'd pass on the info below. A French language forum discussing the potential of Unicode for African langauges has been launched. Details below. Andrew == Unicode-Afrique http://groups.yahoo.com/group/Unicode-Afrique/ L'Unicode représente probablement la meilleure chance pour favoriser l'informatique et le contenu d'Internet en langues africaines. La pluralité actuelle de polices et des systèmes de coding non-intercompatibles pour les caractères spéciales ou non-Latins empêche un vrai plurilinguisme des NTIC en Afrique (et le monde). Cet e-groupe existe pour: donner publicité aux projets en Afrique utilisant l'Unicode; discuter des questions et problèmes pratiques avec Unicode et les jeux de caractères pour des langues africaines; et partager des expériences utiles sur le développement et utilisation des polices unicodes pour langues africaines. Donc il n'est pas en concurrence ni avec le newsgroup sur l'Unicode "fr.comp.normes.unicode," ni avec les listes de discussion générale sur les NTIC en Afrique tel que "afrique-informatique."
Re: Problems with viewing Hindi Unicode Page
- Original Message - From: [EMAIL PROTECTED] The version of Arial Unicode MS on my system does have layout tables for Devanagari. I don't know with what product this version was introduced to my system -- I've got Win2K, IE5.5 and Office XP. I guess the question becomes, which version of Arial Unicode MS? I suspect that the version of Arial Unicode MS you have must be form Office XP Andj
Re: Inuktitut, Cree, Ojibwe input methods?
There is also CreeKeyUni which uses Tavultesoft Keyman 5 available at http://www.creeculture.ca/e/language/fonts_kbds.html Andrew Cunningham At 03:09 PM 10/29/01 -0800, John Hudson wrote: At 10:43 10/29/2001, Mark Leisher wrote: Does anyone have any pointers to keyboard layouts/input methods for these (or related) languages? There is an official Inuktitut keyboard developed for the government, language commission and land rights organisations in Nunavut. A driver has been made for Windows NT/2000/XP, and my understanding is that Microsoft are reviewing this for possible inclusion in the OS. This keyboard driver is downloadable from http://www.assembly.nu.ca/test/unicode/, which also some fonts and utilities for converting documents using older non-standard encodings to Unicode. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] Afghan warlord kills own troops, sells drugs, plays with dead goats - and he's on our side. National Post headline Friday, October 19, 2001
Re: Inuktitut, Cree, Ojibwe input methods?
Hi Peter and everyone, I'd be interested in seeing they keyman file you generated for Eastern Cree. Most of the keyboards i've seen have been designed for specific langauges. Has anyone come across a single keyborad layout intended to support all of UCAS? A friend at the national library of canada was interested in a single keyboard layout that their staff and teh public could use to access unicode based catalogues and databases. On public workstations it would be easier to support a single layout, rather than different layouts for different languages that use Syllabics. Andj Andrew Cunningham Multilingual Technical Officer Accessibility and Evaluation Unit, Vicnet State Library of Victoria Australia At 10:21 PM 10/29/01 -0600, [EMAIL PROTECTED] wrote: On 10/29/2001 04:13:39 PM James Kass wrote: And, here is a page which illustrates different layouts for Eastern and Western Syllabics (and has fonts, too): http://www.knet.on.ca/keyboard.html I just did a quick Keyman file for one of these layouts (the Easter Syllabics layout -- generating Unicode, not the custom encoding of their fonts), though there are a few symbols in their chart where it's not clear to me just what they want. Anyway, I'll make it available if anyone wants to use it. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: OT Nastaleeq conforming to Unicode
Hi Abdul-Majid, I'd be very interested in hearing more about your font development project. Andj. At 10:53 AM 9/6/01 -0700, Majid Bhurgri wrote: A few days ago I posted following message which was received well and I received quite a few responses. But as I was on vacation, I only breifly reviewed some of the messages and somehow, in the meanwhile, all the messages got deleted before I could respond or even save these. I apologize for the inconveniece, and request you to kindly resend your messages to me so that I can respond to you individually. Thanks regards. Abdul-Majid Bhurgri I have developed a prototype Nastaleeq (Urdu) font of the same quality as the currently available Nastaleeq fonts used for typesetting, which also conforms to the Unicode Standards and OpenType specs and as such works smoothly in MSWindows and multilingual Windows applications (MS Word, Excel, Access etc.) Completion of the project, needs time and resources. Anyone interested may contact me at http://lw2fd.hotmail.msn.com/cgi-bin/compose?curmbox=F1a=26b2c2aca 40ca330d18d7dd54bab6734mailto=1[EMAIL PROTECTED]msg=MSG999774807.4 start=2361053len=3196src=type=x[EMAIL PROTECTED] -- Get your FREE download of MSN Explorer at 'http://go.msn.com/bql/hmtag_itl_EN.asp'http://explorer.msn.com
Re: Latin w/ diacritics (was Re: benefits of unicode)
Quoting John Hudson [EMAIL PROTECTED]: Although there has not been any official announcement from Microsoft, and no release date, my understanding is that 'generic' shaping is being added to Uniscribe. This includes support for diacritic composition using OpenType mark-to-base and mark-to-mark positioning lookups. The font support is already in place (see the OpenType specification v1.3, published last week, at http://www.microsoft.com/typography ), and the system support is on the way. This is good news, whenever it does finally eventuate. I'll look at the new spec. Andrew Cunningham Multilingual Technical Project Officer Vicnet, State Library of Victoria [EMAIL PROTECTED]
Re: Latin w/ diacritics (was Re: benefits of unicode)
Quoting James Kass [EMAIL PROTECTED]: Waiting isn't much of an option, the users need results now. Even when the rendering technology catches up, the old 386's and such that are in use in places like the Sudan may not be able to support an OS capable of using new rendering technology. Similar circumstances may apply to many of the hundreds or thousands of 'Unicode-challenged' writing systems mentioned by Peter Constable. actuallu not unicode-challenged, since unicode has a mechanism to support them, more OS- and software-challenged. Andrew also mentioned custom (8-bit) code pages, which are widely used. Lately, people who haven't considered the lack of alternatives have taken to criticizing such practicality, calling it "font-hacks" and actually i don't think they're widely used. But I'd rather not get into Sudanese politics at the moment. so forth. If you do make custom code page web sites, perhaps you should consider maintaining duplicate web pages in Unicode. Even though the Unicode pages wouldn't display, they would be handy to send as links in response to anyone complaining about non-standard code pages. our intially intention was to use a unicode solution, but have also investigated a custom 8-bit code page. One of the areas that has interested me for a while is teh area of langauge retention among refugee communities. My Dinka friends are hopping to develop a trilingual web site (Engliah, Arabic, Dinka) that would provide information about their culture and provide resources that can be used to teach their children their own langauge. This could be done in print, the reason that they wish to place the resources online, is to provide these resources to other Dinka refugees that have settled in other countries. Whether the PUA or custom code pages are used, some kind of software which converts to and from Unicode would be helpful to assure that users of older hardware can continue to communicate with the "modern" world. philosophically I'd prefer not to use the PUA. Its quite possible that we'll used a 8-bit character set initially, and that i'll construct unicode versions for private testing and evaluating. since i'm not a programmer, I'm not able to throw together such a utility. I've seen a number of utilities that allow you to convert between unicode and a range of defined character sets and encodings, but I haven't found a utility that does this and that would allow you to easily construct custom mapping tables to use with it as well. If anyone is aware of such a tool, I'd be interested in hearing about it. Andj. Andrew Cunningham Multilingual Technical Project Officer Vicnet, State Library of Victoria [EMAIL PROTECTED]
Re: benefits of unicode
Quoting "Michael (michka) Kaplan" [EMAIL PROTECTED]: From: "Andrew Cunningham" [EMAIL PROTECTED] Well, I guess this is one of those huge "maybe" type questions, since there is no universal definition of what "supports Unicode x.xx" means. Here are some sample posers: LOL yep i understand and agree ... I suppose that working predominately with community languages in Australia, I tend to get asked more often for those scripts in unicode 3.0 that Microsoft don't support yet in any way shape or form. *shrugs* 'tis the weave. One of the inherent problems with working with multilingual community information. Life would be easier if I was working on teh business side reather than teh community side of the field. and if only they did allow latin script support in uniscribe but i guess support for african langaguageds is extremely low on their list of priorities. I would not ever presume such a thing... what issues in latin scripts are you referring to? I am not sure Uniscribe is where such a fix would be (all the issues I know of would involve keyboards and potentially fonts). Lets see ... one problem i'm having at the moment .. is how to support Dinka (Southern Sudan) in Unicode on web pages displaying on windows 95/98/ME/NT4/2000. four characters come to mind, each of teh four characters can be represented ideally by a pair of code points ... U+0254 U+0308 LATIN SMALL LETTER OPEN O + COMBINING DIAERESIS U+0186 U+0308 LATIN CAPITAL LETTER OPEN O + COMBINING DIAERESIS U+025B U+0308 LATIN SMALL LETTER OPEN E + COMBINING DIAERESIS U+0190 U+0308 LATIN CAPITAL LETTER OPEN E + COMBINING DIAERESIS also there is a convention for indicating tone that is not part of teh formal orthography of teh langauge, but is useful in materials deisgned for students learning teh language. A set of combining diacritics are used to indicate tone. Riusing tone indicated by an acute, and falling tone indicated by a breve. so U+0254, U+0186, U+025B, U+0190 would have to combine with an acute and a grave. All breathy vowels (indicated by a diaeresis) would also have to combine with a grave or acute .. so you'd have a base vowel: a,e, open e, i, o, open o, and u each with two combining diacritics, one a diaeresis and teh second an acute or a grave. theoretically I know what unicode characters would be in teh data stream, I could use keyman for instance to input teh appropriate characters/vowels and teh combining diacritics. The problem comes with display. I can cheat ... and create glyphs in teh PUA for all teh necessary charcater .. that would mena that instead of entering U+0254 U+0308, I'd have the input software input a single code point in teh PUA ... a rather daft approach for future compatability since an appropriate codepoint sequence already exists (U+0254 U+0308). In theory this could be handled using glyph substitution .. its possible to create an open type font that uses glyph substitution to render the required glyphs. buut this is where the problems start, from my understanding adobe's indesign supports some open type font features using teh latin script, but Microsoft's uniscribe does not support Latin script. since my knowledge of font renderimg technology is rather limited, are you aware of another way i can render these characters in IE5+ on various windows platforms I suppoose if i restricted myself to fixed width fonts I could create combining diacritics that would be correctly spaced ... but since I really need proportional fonts ... I'm not sure how to proceed. currently we're using custom character sets (8-bit) taht were explicitly made for the Dinka language. This problem isn't unique to Dinka, you'll find it exists in other african and some australian aboriginal languages. So teh question is ... how should one handle kllangauges that use combinations of latin letters and diacritics and where a precomposed form does not exist? Andj. Andrew Cunningham Multilingual Technical Project Officer Vicnet, State Library of Victoria [EMAIL PROTECTED]
Re: benefits of unicode
Hi James, Quoting James Kass [EMAIL PROTECTED]: Many African adaptations of the Latin script require characters which aren't precomposed in Unicode. yep, you can add a number of australian aboriginal languages to that list as well One example of a common problem is with combining diacritics designed for lower case letters. When the diacritic is used with a capital letter, it appears at the default height and is superimposed on the capital letter rather than appearing above it. The new OpenType font specifications enable such combinations to be displayed correctly. Uniscribe is the mechanism that accesses OpenType features on Microsoft OS. The version of Uniscribe currently available doesn't yet have support for Latin OpenType features enabled. I seem to recall having read recently that Latin features would soon be enabled in Uniscribe, perhaps as early as this summer. I hope so, last comment i remeber reading on the volt mailing list seemed to indicate tehy weren't overly interesed in supporting latin. Hope they do. And depending on how they do it, it might make unicode for african langauges possible. Andj. Andrew Cunningham Multilingual Technical Project Officer Vicnet, State Library of Victoria [EMAIL PROTECTED]
Re: benefits of unicode
- Original Message - From: Michael (michka) Kaplan [EMAIL PROTECTED] It DOES, however, underscore the fact that Unicode support is so much easier than supporting every random code page that the only reasonable way vendors can keep up with every single market is to have a good story for Unicode support. true, personally i'd rather seem Microsft complete their unicode support first before doing anything with other character sets ... quite a few years off full support for unicode 3.0 and 3.1 and if only they did allow latin script support in uniscribe but i guess support for african langaguageds is extremely low on their list of priorities. Andj.
[OT] Re: relation between unicode and font
Hi everyone, actually there is a bug in the browsers, or at least in internet explorer. Its been there in versions 4,5,5.5 Yes a lot of 8-bit fonts exist, Many of these 8-bit fonts follow MIcrosoft's codepages rather that the iso-8859 series, in that that place characters in the C1 zone. For instance, if i was creating a vietnamese page in VISCII encoding, I'd associate the VISCII fonts with the user defined encoding in the web browsers. This works fine in Netscape, but doesn't work in Internet Explorer. For some reason only known to Microsoft, since version 4 of their browser ... the User Defined slot carries out a similar conversion to the Western (Windows) encoding ... the characters in the C1 zone are remapped based on Win-1252 to the appropraite values in Unicode. Why this mapping was ever applied to the user defined slot, I'll never know. If you prepare a VISCII web page containing all the lower case Vietnamese vowels, you'll discover that some of the vowels can not be displayed in internet explorer at all. While Netscape 4.x passes these through as is and will display. Unicode is a boon these days .. it menas I can create a Vietnamese web page that can display on netscape AND internet explorer ... Any custom 8-bit encoding that has characters in the C1 zone may have the same problem. working with multilingual public internet access becomes problematic .. IE is only suitable for encodings that have inbuilt support in the browser .. and useless for encodings like VISCII that are tarnsformed by the browser making some of the characters undisplayable ... one of the reasons that my industry hasn't widely accepted internet explorer as a default browser. It cann't handle the langauges we need to use, community langauges rather than commercial languages. and also one of the reasons that we try to encourage the use of unicode. Andj Andrew Cunningham Multilingual Technical Project Officer VICNET, State Library of Victoria Australia [EMAIL PROTECTED] - Original Message - From: Yung-Fong Tang [EMAIL PROTECTED] To: Unicode List [EMAIL PROTECTED] Cc: Unicode List [EMAIL PROTECTED] Sent: Saturday, January 06, 2001 6:29 AM Subject: Re: relation between unicode and font Not really a browser bug. It is a bug in the FONT. Some of the font basically claim they are design for a certain encoding which 0x00-0x7F represent ASCII while the glyph in that font in those position have shape in non ASCII. If font author *lie* to browser, in the information which encoded in the font, there are no thing the browser (or browser developer) can do. [EMAIL PROTECTED] wrote: On Thu, 4 Jan 2001, sreekant wrote: font face="Tikkana"A B /font is being shown as some telugu characters. That's basically a browser bug, though some people have seen it as a method of extending character repertoire. It has absolutely nothing to do with Unicode. For an explanation of the fallacy, see http://ppewww.ph.gla.ac.uk/%7eflavell/charset/fontface-harmful.html http://babel.alis.com/web_ml/html/fontface.html
Re: Mixing languages on a Web site
Hi Mike To use microsoft's global IME for Japanese on NT4, there is one very important step you need to do ... install NT4 Japanese support .. there are a few articles about it in the Microsoft knowledge base .. i have the urls at work, don't have them with me at the moment ... on the win NT4 cdrom there is a folder somewhere called langpacks ... use windows explorer to look in it ... there is a file called japanese.inf .. right mouse click on it .. a pop up menu will appear ... on of the menu items is 'install' .. select this .. and it will install NT4's Japanese langauge support .. this should be installed before the global IME for Japanese ... otherwise it will not work ... at least that's the story ... ciao Andrew Andrew Cunningham [EMAIL PROTECTED] - Original Message - From: Ayers, Mike [EMAIL PROTECTED] To: Unicode List [EMAIL PROTECTED] Sent: Saturday, 1 July 2000 3:49 Subject: RE: Mixing languages on a Web site From: Michael (michka) Kaplan [mailto:[EMAIL PROTECTED]] Sent: Friday, June 30, 2000 4:28 AM To prove #4 will work, see http://www.trigeminal.com/samples/provincial.html Along with 102 other languages, this page includes both Japanese and Turkish. UTF-8 is what makes that possible michka I checked it out, and with IE5 I can now view almost all of it. There are 5 lines that I cannot view and for which there are no fonts available, but otherwise great. Netscape does not show nearly as many (hints?). On a possibly entirely unrelated subject, I downloaded Microsoft's IMEs for Chinese and Japanese, hoping to learn to use them. However, I cannot figure out how to enable them, and can't locate any helpful info on Microsoft's site. I am running NT4. Any tips greatly appreciated. Thanks, /|/|ike