Re: Encoding italic

2019-01-31 Thread Andrew Cunningham via Unicode
On Thursday, 31 January 2019, James Kass via Unicode 
wrote:.
>
>
> As for use of other variant letter forms enabled by the math
> alphanumerics, the situation exists.  It’s an interesting phenomenon which
> is sometimes worthy of comment and relates to this thread because the math
> alphanumerics include italics.  One of the web pages referring to
> third-party input tools calls the practice “super cool Unicode text magic”.
>
>
Although not all devices can render such text. Many Android handsets on the
market do not have a sufficiently recent version of Android to have system
fonts that can render such existing usage.




-- 
Andrew Cunningham
lang.supp...@gmail.com


Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread Andrew Cunningham via Unicode
On Sunday, 27 January 2019, Asmus Freytag via Unicode 
wrote:

>
> Choice of quotation marks is language-based and for novels, many times
> there are
> additional conventions that may differ by publisher.
>
> Wonder why the publisher is forcing single quotes on them
>

In theory quotation marks are language based but many languages have had
the puntuation and typographic conventions of colonial languages  imposed,
even when it isn't the best choice.

And publishers are following established patterns. The publishers that care
about the language do try to distinguish or refine these characters
typographically.

Andrew


-- 
Andrew Cunningham
lang.supp...@gmail.com


Re: Encoding italic

2019-01-25 Thread Andrew Cunningham via Unicode
Assuming some mechanism for italics is added to Unicode,  when converting
between the new plain text and HTML there is insufficient information to
correctly convert to HTML. many elements may have italic stying and there
would be no meta information in Unicode to indicate the appropriate HTML
element.




On Friday, 25 January 2019, wjgo_10...@btinternet.com via Unicode <
unicode@unicode.org> wrote:

> Asmus Freytag wrote;
>
> Other schemes, like a VS per code point, also suffer from being different
>> in philosophy from "standard" rich text approaches. Best would be as
>> standard extension to all the messaging systems (e.g. a common markdown
>> language, supported by UI). A./
>>
>
> Yet that claim of what would be best would be stateful and statefulness is
> the very thing that Unicode seeks to avoid.
>
> Plain text is the basic system and a Variation Selector mechanism after
> each character that is to become italicized is not stateful and can be
> implemented using existing OpenType technology.
>
> If an organization chooses to develop and use a rich text format then that
> is a matter for that organization and any changing of formatting of how
> italics are done when converting between plain text and rich text is the
> responsibility of the organization that introduces its rich text format.
>
> Twitter was just an example that someone introduced along the way, it was
> not the original request.
>
> Also this is not only about messaging. Of primary importance is the
> conservation of texts in plain text format, for example, where a printed
> book has one word italicized in a sentence and the text is being
> transcribed into a computer.
>
> William Overington
> Friday 25 January 2019
>
>

-- 
Andrew Cunningham
lang.supp...@gmail.com


Re: Encoding italic (was: A last missing link)

2019-01-16 Thread Andrew Cunningham via Unicode
HI Victor, an off list reply. The contents are just random thoughts sparked
by an interesting conversation.

On Wed, 16 Jan 2019 at 22:44, Victor Gaultney via Unicode <
unicode@unicode.org> wrote:

>
> - It finally, and conclusively, would end the decades of the mess in HTML
> that surrounds  and .
>

I am not sure that would fix the issue, more likely compound the issue
making it even more blurry what the semantic purpose is. HTML5 make both
 and  semantic ... and by the definition the style of the elements is
not necessarily italic.  for instance would be script dependant, 
may be partially script dependant when another appropriate semantic tag is
missing. A character/encoding level distinction is just going to compound
the mess.

And then there are all the other script specific typographic / typesetting
conventions that should also be considered.


> My main point in suggesting that Unicode needs these characters is that
> italic has been used to indicate specific meaning - this text is somehow
> special - for over 400 years, and that content should be preserved in plain
> text.
>
>
> Underlying, bold text, interletter spacing, colour change, font style
change all are used to apply meaning in various ways. Not sure why italic
is special in this sense. Additionally without encoding the meaning of
italic, all you know is that it is italic, not what convention of semantic
meaning lies behind it.

And I am curious on your thoughts, if we distinguish italic in Unicode,
encode some way of spacifying italic text, wouldn't it make more sense to
do away with italic fonts all together? and just roll the italic glyphs
into the regular font?

In theory changing italic from a stylistic choice as it currently is to a
encoding/character level semantic is a paradigmn shift. We dont have
separate fonts for variation selectors or any other mecahanism in
unicode,and it would seem to make sense to roll character glyph variation
into a single font. And potentially exclude italicisation from being a
viable axis in a variable font. Just speculation on my part.

To clarify I am neither for nor against encoding italics. But so far there
doesn't seem to be a robust case for it. But it it were introduced I would
prefer a system that was more inclusive of all scripts, giving proper
analysis of typeseting and typographic conventions in each script and well
founded decisions on which should be encoded. Cherry picking one feature
relevant to a small set of scripts seems to be a problematic path.

I have enough trouble with ordered and unordered lists and list markers in
HTML without expaning the italics mess in HTML.

-- 
Andrew Cunningham
lang.supp...@gmail.com


Re: Private Use areas

2018-08-21 Thread Andrew Cunningham via Unicode
On Wednesday, 22 August 2018, Mark E. Shoulson via Unicode <
unicode@unicode.org> wrote:

> On 08/21/2018 02:03 PM, Ken Whistler via Unicode wrote:
>
>>
>>
> Best we can do is shout loudly at OpenType tables and hope to cram in
> behavior (or at least appearance, which is more likely all we can get) that
> vaguely resembles what we're after.  And that's not SO awful, given what
> we're dealing with.
>
>>
>>
At the moment I am looking at implementing three unencoded Arabic
characters in  the PUA.

For the foreseeable future OpenType is a non-starter, so I will look at
implementing them in Graphite tables in a font.

Andrew



-- 
Andrew Cunningham
lang.supp...@gmail.com


Re: Unicode 11 Georgian uppercase vs. fonts

2018-07-28 Thread Andrew Cunningham via Unicode
On Saturday, 28 July 2018, Asmus Freytag (c) via Unicode <
unicode@unicode.org> wrote:

>
>
> A real plan would have consisted of documentation suggesting how to roll
> out library update, whether to change/augment CSS styling keywords, what
> types of locale adaptations of case transforms should be implemented, how
> to get OSs to deliver fonts to people, etc., etc..
>
>
It can be dealt with in various ways in CSS as it is. The question is why
the designer chose to apply capitals, the purpose behind it, and how that
should be appropriately internationalised. For instance for Cherokee you
may want to lowercase instead of upercase, assuming this is wise. Other
languages you may want to embolden text, Italian it, underline it, change
colour, change interchanged or integral heme spacing, etc .

Ultimately it's a question of whether you want a single UI design or a
language responsive UI design.




-- 
Andrew Cunningham
lang.supp...@gmail.com


Re: Northern Khmer on iPhone

2017-02-28 Thread Andrew Cunningham
On iOS it is fairly straightforward to arrange solutions for minority
languages.

Android has always been a challenge.

Older versions of Android might not rendering support for the script.

Most handset manufactorers dont allow users to chamge fonts.

A couple of handset manufactorers allow users to change between
preinstalled fonts and in some cases allow installation of fonts via
licensed solutions like flipfont.

There are a few apps available that allow you to install additional fonts.
But changing the fonts is still device dependent unless you jailbreak the
handset.

If you want to discuss specific devices or approaches easiest to do it
offlist.

Andrew

On Wednesday, 1 March 2017, Richard Wordingham <
richard.wording...@ntlworld.com> wrote:
> On Tue, 28 Feb 2017 23:09:05 +0100
> Philippe Verdy <verd...@wanadoo.fr> wrote:
>
>> ... default stock fonts will be enough if they fit the basic
>> need for the language users want to use and will be rarely updated,
>> unless they buy a new phone with a newer version of the OS featuring
>> better stock fonts.
>
> I'm not sure that that applies to minority languages.  I'm currently
> exploring the hypothesis that there is very little in the way of
> Northern Khmer on the web in the Thai script because input methods or
> rendering prevent or penalise (e.g. by dotted circles) its use.  I am
> therefore interested in how compatible it is with mobile phones.
> Chatting with family and childhood friends is one place where using
> one's mother tongue might make good sense.
>
> Richard.
>

-- 
Andrew Cunningham
lang.supp...@gmail.com


Re: Possible to add new precomposed characters for local language in Togo?

2016-11-04 Thread Andrew Cunningham
Thanks Doug,

That would be welcome.


On Saturday, 5 November 2016, Doug Ewell <d...@ewellic.org> wrote:
> I am seeking technical information from a Microsoft team member.
> Hopefully we will soon have definitive answers to replace all the
> controversy.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>

-- 
Andrew Cunningham
lang.supp...@gmail.com


Re: font-encoded hacks

2016-10-07 Thread Andrew Cunningham
HI Neil,

I tend to prefer refering to them as Pseudo-Unicode solutions, rather than
hacked fonts or adhoc fonts, and differentiating them from legacy or 8-bit
solutions.

My preferred approach would to be to treat them as a separate encoding. But
I doubt that will likely happen.

It doesn't help that a mobile devices I purchase in Australia will ship
with a Unicode font installed, but the same device and model, may ship with
a non-Unicode font installed in Myanmar and potentially other parts of SE
Asia.

Andrew

On 7 Oct 2016 22:04, "Neil Harris"  wrote:

> On 07/10/16 07:42, Denis Jacquerye wrote:
>
>> In may case people resort to these hacks because it is an easier short
>> term
>> solution. All they have to do is use a specific font. They don't have to
>> switch or find and install a keyboard layout and they don't have to
>> upgrade
>> to an OS that supports their script with Unicode properly. Because of
>> these
>> sort term solutions it's hard for a switch to Unicode to gain proper
>> momentum. Unfortunately, not everybody sees the long term benefit, or
>> often
>> they see it but cannot do it practically.
>>
>> Too often Unicode compliant fonts or keyboard layouts have been lacking or
>> at least have taken much longer to be implemented.
>> One could wonder if a technical group for keyboards layouts would help
>> this
>> process.
>>
>
> What might also help is a reconceptualization of these hacks as being in
> effect non-standard character encodings: the existing software
> infrastructure for handling charsets could then be co-opted to convert them
> to (and possibly from) Unicode if desired.
>
> Neil
>
>


Re: font-encoded hacks

2016-10-07 Thread Andrew Cunningham
Hi Mark,

The converters would be interesting to see, and would be personally useful
to me.

But the type of keyboard layouts and input frameworks reflected in CLDR
have limited bearing on issues related to the uptake of Unicode for Myanmar
script.

Andrew

On 7 Oct 2016 17:54, "Mark Davis ☕️" <m...@macchiato.com> wrote:

> We do provide data for keyboard mappings in CLDR (http://unicode.org/cldr/
> charts/latest/keyboards/index.html). There are some further pieces we
> need to put into place.
>
>1. Provide a bulk uploader that applies our sanity-checking tests for
>a proposed keyboard mapping, and provides real-time feedback to users about
>the problems they need to fix.
>2. Provide code that converts from the CLDR format into the major
>platforms' formats (we have the reverse direction already).
>3. (Optional) Prettier charts!
>
>
> Mark
>
> On Fri, Oct 7, 2016 at 8:42 AM, Denis Jacquerye <moy...@gmail.com> wrote:
>
>> In may case people resort to these hacks because it is an easier short
>> term solution. All they have to do is use a specific font. They don't have
>> to switch or find and install a keyboard layout and they don't have to
>> upgrade to an OS that supports their script with Unicode properly. Because
>> of these sort term solutions it's hard for a switch to Unicode to gain
>> proper momentum. Unfortunately, not everybody sees the long term benefit,
>> or often they see it but cannot do it practically.
>>
>> Too often Unicode compliant fonts or keyboard layouts have been lacking
>> or at least have taken much longer to be implemented.
>> One could wonder if a technical group for keyboards layouts would help
>> this process.
>>
>> On Fri, Oct 7, 2016, 07:12 Martin J. Dürst <due...@it.aoyama.ac.jp>
>> wrote:
>>
>>> Hello Andrew,
>>>
>>> On 2016/10/07 11:11, Andrew Cunningham wrote:
>>> > Considering the mess that adhoc fonts create. What is the best way
>>> forward?
>>>
>>> That's very clear: Use Unicode.
>>>
>>> > Zwekabin, Mon, Zawgyi, and Zawgyi-Tai and their ilk?
>>> >
>>> > Most governemt translations I am seeing in Australia for Burmese are in
>>> > Zawgyi, while most of the Sgaw Karen tramslations are routinely in
>>> legacy
>>> > 8-bit fonts.
>>>
>>> Why don't you tell the Australian government?
>>>
>>> Regards,   Martin.
>>>
>>
>


Re: font-encoded hacks

2016-10-07 Thread Andrew Cunningham
Hi Denis,

In some ways, it was easier. But looking at each language, the issues seem
to be have a slightly different slant.

Sgaw Karen is interesting in comparison to Burmese. There is some use of
the hacked Zwekabin font by bloggers, but most content, and key media still
use 8 bit fonts. Although little use of Unicode.

The lack of uptake of Unicode fonts seems to lie in the fact that the
default rendering for most Myanmar script fonts is Burmese. If Sgaw Karen,
etc are supported it is via locl features. If a Sgaw Karen user is using
the font in software when they can't control the necessary opentype
features, or don't know they can and need to  you will eventually get a
perception that their language isn't supported.

There are font developers among the Burmese,  Mon, Shan ethnic groups
developing Unicode fonts tailored for there needs.

Burmese situation is quite different. A topic that I have discussed often
with Burmese colleagues. I have my theories. But the current resurgence of
Zawgyi very much depends on the ability of mobile devices to render Myanmar
Unicode, and the choices telcos and handset manufacturers make regarding
system fonts.

Regarding keyboards, it is interesting comparing Khmer and Burmese. Uptake
of Unicode was earlier and quicker for Khmer. When Khmer keyboards were
developed, the Khmer developers chose to live with the severe limitations
of system level input frameworks. It is only this year that I have started
to see truly innovative research into what a Khmer input system should be.

Burmese Unicode developers on the other hand were never satisfied with
those limitations, and various developers looked into alternatives.

Andrew

On 7 Oct 2016 17:42, "Denis Jacquerye" <moy...@gmail.com> wrote:
>
> In may case people resort to these hacks because it is an easier short
term solution. All they have to do is use a specific font. They don't have
to switch or find and install a keyboard layout and they don't have to
upgrade to an OS that supports their script with Unicode properly. Because
of these sort term solutions it's hard for a switch to Unicode to gain
proper momentum. Unfortunately, not everybody sees the long term benefit,
or often they see it but cannot do it practically.
>
> Too often Unicode compliant fonts or keyboard layouts have been lacking
or at least have taken much longer to be implemented.
> One could wonder if a technical group for keyboards layouts would help
this process.
>
>
> On Fri, Oct 7, 2016, 07:12 Martin J. Dürst <due...@it.aoyama.ac.jp> wrote:
>>
>> Hello Andrew,
>>
>> On 2016/10/07 11:11, Andrew Cunningham wrote:
>> > Considering the mess that adhoc fonts create. What is the best way
forward?
>>
>> That's very clear: Use Unicode.
>>
>> > Zwekabin, Mon, Zawgyi, and Zawgyi-Tai and their ilk?
>> >
>> > Most governemt translations I am seeing in Australia for Burmese are in
>> > Zawgyi, while most of the Sgaw Karen tramslations are routinely in
legacy
>> > 8-bit fonts.
>>
>> Why don't you tell the Australian government?
>>
>> Regards,   Martin.


Re: font-encoded hacks

2016-10-07 Thread Andrew Cunningham
On 7 Oct 2016 17:08, "Martin J. Dürst" <due...@it.aoyama.ac.jp> wrote:
>
> Hello Andrew,
>
>
> On 2016/10/07 11:11, Andrew Cunningham wrote:
>>
>> Considering the mess that adhoc fonts create. What is the best way
forward?
>
>
> That's very clear: Use Unicode.
>

LOL, thanks Martin. That has been my position for a long time.

>
>> Zwekabin, Mon, Zawgyi, and Zawgyi-Tai and their ilk?
>>
>> Most governemt translations I am seeing in Australia for Burmese are in
>> Zawgyi, while most of the Sgaw Karen tramslations are routinely in legacy
>> 8-bit fonts.
>
>
> Why don't you tell the Australian government?

Easier to tell the state governments, than the Federal government. But it
is something I am working on.

>
> Regards,   Martin.


font-encoded hacks

2016-10-06 Thread Andrew Cunningham
Considering the mess that adhoc fonts create. What is the best way forward?

Zwekabin, Mon, Zawgyi, and Zawgyi-Tai and their ilk?

Most governemt translations I am seeing in Australia for Burmese are in
Zawgyi, while most of the Sgaw Karen tramslations are routinely in legacy
8-bit fonts.

Andrew

On Friday, 7 October 2016, Ken Whistler <kenwhist...@att.net> wrote:
> By the way, the biggest ongoing problem we deal with here is the
continuing urge to proliferate font-encoded hacks for particular languages
and writing systems. The text interchange problems that such schemes pose
on an ongoing basis far far outweigh issues like what to do with a Shibuya
109 emoji, imo.



-- 
Andrew Cunningham
lang.supp...@gmail.com


Myanmar Scripts and Languages FAQ

2016-09-26 Thread Andrew Cunningham
Hì,

I just finished looking at the Myanmar Scripts and Languages FAQ.

A few comments.

Most of the questions and answers are specific to the Myanmar (Burmese)
language.

When discussing the ad hoc fonts, it would be useful to indicate that the
ones already mentioned are Burmese specific, and that each of the major
languages has its own ad hoc font(s). Mon,  Shan and Sgaw Karen & Western
Pwo Karen have their own specific fonts.

It is also worth warning that most detectors and convertors are language
specific. If your data has content in a range of Myanmar script languages,
the results from such detectors and converters will be less than ideal.

Andrew


RE: Myanmar character set

2016-08-13 Thread Andrew Cunningham
Hi Andrew,

I assume the issue is with mym2 shaper?

Andrew C

On 13 Aug 2016 5:02 am, "Andrew Glass" <andrew.gl...@microsoft.com> wrote:
>
> Hi Taylor and Andrew,
>
>
>
> This is a known issue with the Myanmar engine on Windows. We are tracking
the issue, but don’t have a date for the fix at this time.
>
>
>
> Cheers,
>
>
>
> Andrew
>
>
>
> From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Andrew
Cunningham
> Sent: Thursday, August 11, 2016 8:51 PM
> To: Taylor Canning <taylorcann...@outlook.com>
> Cc: Unicode Mailing List <unicode@unicode.org>
>
> Subject: Re: Myanmar character set
>
>
>
> Hi Taylor,
>
> This should work fine in theory. Are you using a mymr or mym2 style
opentype font?
>
> What applications, operating system and fonts are you using?
>
> Andrew
>
>
>
> On 12 Aug 2016 12:55 pm, "Taylor Canning" <taylorcann...@outlook.com>
wrote:
>>
>> Hi there, has anyone had any issues with the Myanmar character set – i
have raised an issue recently where the combination ၣ and ် does not
combine correctly to make ၣ် on my windows devices. It used to work just
fine. It is am extremely common tonal marker and is a big issue for anyone
who types the S’Gaw Karen language, which is a lot of people !
>>
>> Thanks, Taylor
>>
>>
>>
>> Sent from my Windows 10 phone
>>
>>


Re: Myanmar character set

2016-08-11 Thread Andrew Cunningham
Hi Taylor,

This should work fine in theory. Are you using a mymr or mym2 style
opentype font?

What applications, operating system and fonts are you using?

Andrew

On 12 Aug 2016 12:55 pm, "Taylor Canning"  wrote:

> Hi there, has anyone had any issues with the Myanmar character set – i
> have raised an issue recently where the combination ၣ and ် does not
> combine correctly to make ၣ် on my windows devices. It used to work just
> fine. It is am extremely common tonal marker and is a big issue for anyone
> who types the S’Gaw Karen language, which is a lot of people !
>
> Thanks, Taylor
>
>
>
> Sent from my Windows 10 phone
>
>
>


Re: Mende Kikakui Number 10

2016-06-11 Thread Andrew Cunningham
Marcel, it isn't so much that the conversation was exhausted, rather that
the original question has been sufficienlty answered.

A.



On Sunday, 12 June 2016, Marcel Schneider <charupd...@orange.fr> wrote:
> On Sat, 11 Jun 2016 12:25:39 +0200, Philippe Verdy wrote:
>>
>> Exactly, Unicode should not create its own logic about scripts or
numeral systems.
>>
>> All looks like the encoding of 10 as a pair (ONE+combining TENS) was a
severe
>> conceptual error that could have been avoided by NOT encoding "TENS" as
combining
>> but as a regular number/digit TEN usable isolately, and forming a
contectual
>> ligature with a previous digit from TWO to NINE.
>>
>> The encoding of 10 as (ONE+TENS) is superfluously needing an artificial
leading
>> ONE. This is purely an Unicode construction, foreign to the logic of the
numeral
>> system.
>>
>
>
> Seeing the discussion exhausted, I join my hope to Philippe Verdyʼs,
> and reinforce by quoting Asmus Freytag on backcompat vs enhancement,
> before bringing another concern:
>
> «If you add a feature to match behavior somewhere else,
> it rarely pays to make that perform "better", because
> it just means it's now different and no longer matches.
> The exception is a feature for which you can establish
> unambiguously that there is a metric of correctness or
> a widely (universally?) shared expectation by users
> as to the ideal behavior. In that case, being compatible
> with a broken feature (or a random implementation of one)
> may in fact be counter productive.»
>
> http://www.unicode.org/mail-arch/unicode-ml/y2016-m03/0109.html
>
> Being bound with stability guarantees, Unicode could eventually add a
_new_
>
> *1E8D7 MENDE KIKAKUI NUMBER TEN
>
> Best wishes,
>
> Marcel
>
>

-- 
Andrew Cunningham
lang.supp...@gmail.com


Re: Mende Kikakui Number 10

2016-06-10 Thread Andrew Cunningham
rlig is the quickest and easiest approach. But in theory could be done
other more complicated ways.

There are currently no opentype implementations that I know of. And no
known shapers. rlig hopefully works with general shapers. But what what ot
features will be expected by script specific shaper is still an unknown.

On Saturday, 11 June 2016, Michael Everson <ever...@evertype.com> wrote:
> On 11 Jun 2016, at 02:47, Andrew Cunningham <lang.supp...@gmail.com>
wrote:
>
>> It can be done via a ligature. It would have to be a required ligature.
Since other ligature types may or may not be enabled in various contexts.
And we dont want default substitution and mark positioning to generate a
non-ligature equivalent.
>
> Aren’t all of the number combinations required ligatures?
>
> Michael
>

-- 
Andrew Cunningham
lang.supp...@gmail.com


Re: Mende Kikakui Number 10

2016-06-10 Thread Andrew Cunningham
I am not suggesting it needs to be encoded. And I did suggest that using
the digit one and the symbol for tens was an option.

It can be done via a ligature. It would have to be a required ligature.
Since other ligature types may or may not be enabled in various contexts.
And we dont want default substitution and mark positioning to generate a
non-ligature equivalent.

A.

An it will be interesting to see which rendering engines handle kikakui.

A.


On Saturday, 11 June 2016, Ken Whistler <kenwhist...@att.net> wrote:
>
> On 6/10/2016 5:34 PM, Andrew Cunningham wrote:
>>
>> There are two few descriptions of the system for me to be definitive
 but the number ten seems hold a unique position within the numeral
system.
>
> As does the number 10 in every decimal numeral system. ;-)
>
> But that doesn't automatically require that it be *encoded* with a single
character. After all the number 10 in the European decimal numeral system
is also represented with a character sequence: <0031, 0030>.
>
> --Ken
>
>

-- 
Andrew Cunningham
lang.supp...@gmail.com


Re: Mende Kikakui Number 10

2016-06-10 Thread Andrew Cunningham
On Saturday, 11 June 2016, Ken Whistler <kenwhist...@att.net> wrote:
>
> I disagree about that. There is no reason to depart from the logic of the
system for this one value. Add one ligature glyph to your font for the
sequence for 10, and you're done.
>
>

There is the logic of how kikakui numbers are encoded in Unicode and there
is the internal logic of the numeral system itself. They are not
necessarily the same.

There are two few descriptions of the system for me to be definitive 
but the number ten seems hold a unique position within the numeral system.

A.

-- 
Andrew Cunningham
lang.supp...@gmail.com


Re: Mende Kikakui Number 10

2016-06-10 Thread Andrew Cunningham
Hi Phillipe,

On Saturday, 11 June 2016, Philippe Verdy <verd...@wanadoo.fr> wrote:

> OK, <ONE;combining TEENS> represents 11, but <ONE;combining TENS> is not
clearly represents 10, and the proposals do not exhibit 10 with the same
glyph as PU (even if it is based on it, in fact the combining TENS is a
small subscript glyph variant of letter/syllable PU intended to mark
digits).
>

Mende Kikakui script disolays a high degree of glyph variation. Some
variations minor, some variations more substantive.

The syllable PU can be found as it is in the charts, it can be found
looking like the number 10. Other variations are also observed.

The ideal situation would have been to encode the number 10. But in its
absence, I guess ONE+TENS may be the approach. Even though it seems less
than ideal.

A.

A.

-- 
Andrew Cunningham
lang.supp...@gmail.com


Re: Mende Kikakui Number 10

2016-06-10 Thread Andrew Cunningham
The original proposals inluded a specific numbr 10 codepoint. I assume it
was removed and its representation was to be generated by use of the
combining characters

In the original proposal there was nothing corresponding to ONE+TENS
instead there was a distinct number TEN. The glyph for number 10 was
identical to glyph for syllable PU.

A.



On Friday, 10 June 2016, Philippe Verdy <verd...@wanadoo.fr> wrote:
> I do not contest that about number 11, and it was not the question !
> The question was about number **10**:
> * ONE+TENS or ONE+TEENS ?
> This is NOT specified clearly in TUS Chapter 19 which speaks about
numbers 1-9 then 11-19 for TEENS, and TENS for numbers 20-99.
> The question is the same about 110,210,...,910:
> * (ONE..NINE)+HUNDREDS+ONE+TENS or (ONE..NINE)+HUNDREDS+ONE+TEENS ?
> For me it seems that both questions will repy with ONE+TENS, not
ONE+TEENS.
>
> 2016-06-10 9:00 GMT+02:00 Andrew Cunningham <lang.supp...@gmail.com>:
>>
>> Hi Phillipe,
>>
>> ONE+TEENS (1E8C7,1E8D0) is definitely the number 11
>>
>> A.
>>
>> On 10 Jun 2016 4:53 pm, "Philippe Verdy" <verd...@wanadoo.fr> wrote:
>>>
>>> Given that there's no digit for zero, you need to append combining
characters to digits 1-9 in order to multiply them by a base
10/100/1,000/10,000/100,000/1,000,000. The system is then additive. I don't
know how zero is represented. Note that for base 10, when the first digit
is 1 (i.e. for numbers 11-19), the combining character is not 1E8D1 (TENS)
but 1E8D0 (TEENS), i.e. the slash-like glyph. But the description says that
TEENS is only for numbers 11-19, not for number 10.
>>> But I agree that there should be a reference in
http://www.unicode.org/charts/PDF/U1E800.pdf, to the description in
http://www.unicode.org/versions/Unicode8.0.0/ch19.pdf (section 19.8, pages
722-723) that would explain how to render 10 (add some rows in table 19-6
for the numbers 10/100/.../1,000,000).
>>> This leaves a hole in the description. I'm not sure that the glyph for
PU is exactly the glyph for 10. Or what is the appropriate sequence:
ONE+TENS (1E8C7,1E8D1) or ONE+TEENS (1E8C7,1E8D0) ? The description is
ambiguous, and probably both sequences should produce the equivalent glyph.
However the letter PU (when meaning number 10) looks more like the glyph
produced by ONE+TEN (1E8C7,1E8D1).
>>> Then how to represent zero ? Probably by a syllable or word meaning
"none" (don't know which it is), or by using European or Arabic digits (as
indicated in Chapter 19).
>>>
>>>
>>> 2016-06-10 8:15 GMT+02:00 Andrew Cunningham <lang.supp...@gmail.com>:
>>>>
>>>> Ok looking at issue again I guess the other alternative is to have a
discontiguous set of numbers. Represent 10 as U+1E8C7 U+1E8D1 and map it
within the font to the PU glyph.
>>>>
>>>> And hope that font developers don't create a glyph based on shape of
 U+1E8C7 and U+1E8D1,  but PU instead.
>>>>
>>>> Andrew
>>>>
>>>> On Friday, 10 June 2016, Andrew Cunningham <lang.supp...@gmail.com>
wrote:
>>>> > Hi,
>>>> > Currently I am doing some work on the Mende Kikakui script, and I
was wondering what the best way was to represent the number 10.
>>>> > In the early proposals for the script there was a glyph and
codepoint specifically for the number 10. When the model for Mende Kikakui
numbers was changed before the finalising of the code block, the number ten
was removed. But using existing digits and numbers we can produce 1-9 and
11 -> but we can not produce the number 10 from digits and numbers.
>>>> > The number ten uses the same glyph as  syllable PU U+1E88E.
>>>> > Should I use U+1E88E to represent both the number 10 and the
syllable PU?
>>>> > Andrew
>>>> >
>>>> > --
>>>> > Andrew Cunningham
>>>> > lang.supp...@gmail.com
>>>> >
>>>> >
>>>> >
>>>>
>>>> --
>>>> Andrew Cunningham
>>>> lang.supp...@gmail.com
>>>>
>>>>
>>>>
>>>
>
>

-- 
Andrew Cunningham
lang.supp...@gmail.com


Mende Kikakui Number 10

2016-06-10 Thread Andrew Cunningham
I'd agree that it is likely ONE+TENS.

Looking at the original proposal and articles on the number system  it
was originally 1-9, 10, 11-19, 20-99 etc

But became 1-9, 11-19, 20-99, etc during the deliberations on the model the
numbers would follow.

A.

At least thats how I reconstrct it from the public documrnts I have seen.

On Friday, 10 June 2016, Philippe Verdy <verd...@wanadoo.fr> wrote:
> I do not contest that about number 11, and it was not the question !
> The question was about number **10**:
> * ONE+TENS or ONE+TEENS ?
> This is NOT specified clearly in TUS Chapter 19 which speaks about
numbers 1-9 then 11-19 for TEENS, and TENS for numbers 20-99.
> The question is the same about 110,210,...,910:
> * (ONE..NINE)+HUNDREDS+ONE+TENS or (ONE..NINE)+HUNDREDS+ONE+TEENS ?
> For me it seems that both questions will repy with ONE+TENS, not
ONE+TEENS.
>
> 2016-06-10 9:00 GMT+02:00 Andrew Cunningham <lang.supp...@gmail.com>:
>>
>> Hi Phillipe,
>>
>> ONE+TEENS (1E8C7,1E8D0) is definitely the number 11
>>
>> A.
>>
>> On 10 Jun 2016 4:53 pm, "Philippe Verdy" <verd...@wanadoo.fr> wrote:
>>>
>>> Given that there's no digit for zero, you need to append combining
characters to digits 1-9 in order to multiply them by a base
10/100/1,000/10,000/100,000/1,000,000. The system is then additive. I don't
know how zero is represented. Note that for base 10, when the first digit
is 1 (i.e. for numbers 11-19), the combining character is not 1E8D1 (TENS)
but 1E8D0 (TEENS), i.e. the slash-like glyph. But the description says that
TEENS is only for numbers 11-19, not for number 10.
>>> But I agree that there should be a reference in
http://www.unicode.org/charts/PDF/U1E800.pdf, to the description in
http://www.unicode.org/versions/Unicode8.0.0/ch19.pdf (section 19.8, pages
722-723) that would explain how to render 10 (add some rows in table 19-6
for the numbers 10/100/.../1,000,000).
>>> This leaves a hole in the description. I'm not sure that the glyph for
PU is exactly the glyph for 10. Or what is the appropriate sequence:
ONE+TENS (1E8C7,1E8D1) or ONE+TEENS (1E8C7,1E8D0) ? The description is
ambiguous, and probably both sequences should produce the equivalent glyph.
However the letter PU (when meaning number 10) looks more like the glyph
produced by ONE+TEN (1E8C7,1E8D1).
>>> Then how to represent zero ? Probably by a syllable or word meaning
"none" (don't know which it is), or by using European or Arabic digits (as
indicated in Chapter 19).
>>>
>>>
>>> 2016-06-10 8:15 GMT+02:00 Andrew Cunningham <lang.supp...@gmail.com>:
>>>>
>>>> Ok looking at issue again I guess the other alternative is to have a
discontiguous set of numbers. Represent 10 as U+1E8C7 U+1E8D1 and map it
within the font to the PU glyph.
>>>>
>>>> And hope that font developers don't create a glyph based on shape of
 U+1E8C7 and U+1E8D1,  but PU instead.
>>>>
>>>> Andrew
>>>>
>>>> On Friday, 10 June 2016, Andrew Cunningham <lang.supp...@gmail.com>
wrote:
>>>> > Hi,
>>>> > Currently I am doing some work on the Mende Kikakui script, and I
was wondering what the best way was to represent the number 10.
>>>> > In the early proposals for the script there was a glyph and
codepoint specifically for the number 10. When the model for Mende Kikakui
numbers was changed before the finalising of the code block, the number ten
was removed. But using existing digits and numbers we can produce 1-9 and
11 -> but we can not produce the number 10 from digits and numbers.
>>>> > The number ten uses the same glyph as  syllable PU U+1E88E.
>>>> > Should I use U+1E88E to represent both the number 10 and the
syllable PU?
>>>> > Andrew
>>>> >
>>>> > --
>>>> > Andrew Cunningham
>>>> > lang.supp...@gmail.com
>>>> >
>>>> >
>>>> >
>>>>
>>>> --
>>>> Andrew Cunningham
>>>> lang.supp...@gmail.com
>>>>
>>>>
>>>>
>>>
>
>

--
Andrew Cunningham
lang.supp...@gmail.com






-- 
Andrew Cunningham
lang.supp...@gmail.com


Re: Mende Kikakui Number 10

2016-06-10 Thread Andrew Cunningham
Hi Phillipe,

ONE+TEENS (1E8C7,1E8D0) is definitely the number 11

A.
On 10 Jun 2016 4:53 pm, "Philippe Verdy" <verd...@wanadoo.fr> wrote:

> Given that there's no digit for zero, you need to append combining
> characters to digits 1-9 in order to multiply them by a base
> 10/100/1,000/10,000/100,000/1,000,000. The system is then additive. I don't
> know how zero is represented. Note that for base 10, when the first digit
> is 1 (i.e. for numbers 11-19), the combining character is not 1E8D1 (TENS)
> but 1E8D0 (TEENS), i.e. the slash-like glyph. But the description says that
> TEENS is only for numbers 11-19, not for number 10.
>
> But I agree that there should be a reference in
> http://www.unicode.org/charts/PDF/U1E800.pdf, to the description in
> http://www.unicode.org/versions/Unicode8.0.0/ch19.pdf (section 19.8,
> pages 722-723) that would explain how to render 10 (add some rows in table
> 19-6 for the numbers 10/100/.../1,000,000).
>
> This leaves a hole in the description. I'm not sure that the glyph for PU
> is exactly the glyph for 10. Or what is the appropriate sequence:
> ONE+TENS (1E8C7,1E8D1) or ONE+TEENS (1E8C7,1E8D0) ? The description is
> ambiguous, and probably both sequences should produce the equivalent glyph.
> However the letter PU (when meaning number 10) looks more like the glyph
> produced by ONE+TEN (1E8C7,1E8D1).
>
> Then how to represent zero ? Probably by a syllable or word meaning "none"
> (don't know which it is), or by using European or Arabic digits (as
> indicated in Chapter 19).
>
>
>
> 2016-06-10 8:15 GMT+02:00 Andrew Cunningham <lang.supp...@gmail.com>:
>
>> Ok looking at issue again I guess the other alternative is to have a
>> discontiguous set of numbers. Represent 10 as U+1E8C7 U+1E8D1 and map it
>> within the font to the PU glyph.
>>
>> And hope that font developers don't create a glyph based on shape of
>>  U+1E8C7 and U+1E8D1,  but PU instead.
>>
>> Andrew
>>
>>
>> On Friday, 10 June 2016, Andrew Cunningham <lang.supp...@gmail.com>
>> wrote:
>> > Hi,
>> > Currently I am doing some work on the Mende Kikakui script, and I was
>> wondering what the best way was to represent the number 10.
>> > In the early proposals for the script there was a glyph and codepoint
>> specifically for the number 10. When the model for Mende Kikakui numbers
>> was changed before the finalising of the code block, the number ten was
>> removed. But using existing digits and numbers we can produce 1-9 and 11 ->
>> but we can not produce the number 10 from digits and numbers.
>> > The number ten uses the same glyph as  syllable PU U+1E88E.
>> > Should I use U+1E88E to represent both the number 10 and the syllable
>> PU?
>> > Andrew
>> >
>> > --
>> > Andrew Cunningham
>> > lang.supp...@gmail.com
>> >
>> >
>> >
>>
>> --
>> Andrew Cunningham
>> lang.supp...@gmail.com
>>
>>
>>
>>
>


Re: Mende Kikakui Number 10

2016-06-10 Thread Andrew Cunningham
Ok looking at issue again I guess the other alternative is to have a
discontiguous set of numbers. Represent 10 as U+1E8C7 U+1E8D1 and map it
within the font to the PU glyph.

And hope that font developers don't create a glyph based on shape of
 U+1E8C7 and U+1E8D1,  but PU instead.

Andrew

On Friday, 10 June 2016, Andrew Cunningham <lang.supp...@gmail.com> wrote:
> Hi,
> Currently I am doing some work on the Mende Kikakui script, and I was
wondering what the best way was to represent the number 10.
> In the early proposals for the script there was a glyph and codepoint
specifically for the number 10. When the model for Mende Kikakui numbers
was changed before the finalising of the code block, the number ten was
removed. But using existing digits and numbers we can produce 1-9 and 11 ->
but we can not produce the number 10 from digits and numbers.
> The number ten uses the same glyph as  syllable PU U+1E88E.
> Should I use U+1E88E to represent both the number 10 and the syllable PU?
> Andrew
>
> --
> Andrew Cunningham
> lang.supp...@gmail.com
>
>
>

-- 
Andrew Cunningham
lang.supp...@gmail.com


Re: Joined "ti" coded as "O" in PDF

2016-05-08 Thread Andrew Cunningham
The t_i instance will depend on the quality of the font. If its a standard
ligature there should be a glyph to codepoints assignment in the cmap table
or the ToUnicode mapping in the PDF file.

As David indicates, it isnt a Unicode issue.

It is an issue with the font used and/or the tools used.

PDFs have always been problematic. That isn't going to change anytime soon.
Partly for archiveable or accessible PDFs, the person generating the PDFs
should select the best tools for the job and test the PDF. Then fix any
problems.

Andrew

On Sunday, 8 May 2016, David Perry <hospe...@scholarsfonts.net> wrote:
> I agree that it's a real-world problem -- PDFs really should be
searchable -- but I do not see that it's a Unicode issue.  Unicode defines
the basic building blocks of LATIN SMALL LETTER T and LATIN SMALL LETTER I;
that's its job. Unicode is not responsible for font construction or
creating PDF software.  Furthermore, even if Unicode did want to do
something about it, I can't imagine what that could be -- aside perhaps
from using its bully pulpit to urge PDF creators and font creators to do
their jobs better.
>
> The fact that some PDF apps do not search and copy/paste text correctly
when unencoded characters are given PUA values has been known for many
years.  In the case of Calibri, I looked at the font (version installed on
my Win7 system) and found that the 'ti' ligature is named t_i, which
follows good naming practices, and it does not have a PUA assignment. Given
this, any well-constructed PDF app should be able to decode the ligature
correctly.
>
> David
>
> On 5/6/2016 11:49 AM, Steve Swales wrote:
>>
>> This discussion seems to have fizzled out, but I’m concerned that
>> there’s a real world problem here which is at least partially the
>> concern of the consortium, so let me stir the pot and see if there’s
>> still any meat left.
>>
>> On the current release of MacOS (including the developer beta, for
>> your reference, Peter), if you use Calibri font, for example, in any
>> app (e.g. notes), to write words with “ti” (like
>> internationalization), then press “Print" and “Open PDF in Preview”,
>> you get a PDF document with the joined “ti”.  Subsequently cutting and
>> pasting produces mojibake, and searching the document for words
>> with“ti” doesn’t work, as previously noted.
>>
>> I suppose we can look on this as purely a font handling/MacOS bug, but
>> I’m wondering if we should be providing accommodations or conveniences
>> in Unicode for it to work as desired.
>>
>> -steve
>>
>

-- 
Andrew Cunningham
lang.supp...@gmail.com


Re: Joined "ti" coded as "O" in PDF

2016-05-06 Thread Andrew Cunningham
My understand ing is searchability comes down to twho factors:

1) the ToUnicode mapping ...I which mapps glyphs in the font or subsetted
font to Unicode codepoints. Mappings take the form of one glyph to one
codepoint or one glyph to two or more codepoints.

Obviously any glyph that doesnt resolve by default to a codepoint isn't in
the mapping , nor does the mapping handle glyphs that have been visually
reordered during rendering.

2) the next step is to tag the PDF then use the ActualText label of each
tag.

So for some languages with the right fonts step one is all that is needed.
And this is fairly standard in pdf generation tools. The font itself can
impact on this of course.

But for other languages you need to go to the second step.

Woth languages I work with I might have some pdfs tat just require the
visible text layer.others will have a visible text layer. For the pdf to be
eearchable, the search tools not only need to be able to handle the text
layer but also actualtext attributes when necessary.

And that all comes down to decisions the tool developer has taken on how to
handle searching when both visible text layers and ActualText labels are
present.

I have been told in accessibility lists that the PDF specs leave that
implementation detail to the developer based on their requirements.

So in some cases you need to go the extra step and ActualText. But you also
need to evaluate your search tools to ensure they fo what you expect.

Andrew



On Saturday, 7 May 2016, Steve Swales <st...@swales.us> wrote:
> This discussion seems to have fizzled out, but I’m concerned that there’s
a real world problem here which is at least partially the concern of the
consortium, so let me stir the pot and see if there’s still any meat left.
> On the current release of MacOS (including the developer beta, for your
reference, Peter), if you use Calibri font, for example, in any app (e.g.
notes), to write words with “ti” (like internationalization), then press
“Print" and “Open PDF in Preview”, you get a PDF document with the joined
“ti”.  Subsequently cutting and pasting produces mojibake, and searching
the document for words with“ti” doesn’t work, as previously noted.
> I suppose we can look on this as purely a font handling/MacOS bug, but
I’m wondering if we should be providing accommodations or conveniences in
Unicode for it to work as desired.
> -steve
>
>
> On Mar 21, 2016, at 1:40 AM, Philippe Verdy <verd...@wanadoo.fr> wrote:
> Are those PDF supposed to be searchable inside of them ? For archival
purpose, the PDF are stored in their final form, and search is performed by
creating a database of descriptive metadata. Each time one wants formal
details, they have to read the original the way it was presented (many PDFs
are jsut scanned facsimiles of old documents which originately were not
even in numeric plain-text, they were printed or typewritten, frequently
they include graphics, handwritten signatures, stamped seals...)
> Being able to search plain-text inside a PDF is not the main objective
(and not the priority). The archival however is a top priority (and there's
no money to finance a numerisation and no human resource available to redo
this old work, if needed other contributors will recreate a plain-text
version, possibly with rich-text features, e.g. in Wikisource for old
documents that fall in the public domain).
> PDF/A-1a is meant only for creating new documents from a original
plain-text or rich-text document created with modern word-processing
applications. But this specification will frequently have to be broken, if
there's the need to include handwritten or supplementary elements
(signatures, seals...) whose source is not the original electronic document
but the printed paper over which the annotations were made: it is this
paper document, not the electronic document which is the official final
source (we've got some important legal paper whose original has other marks
including traces of beer or coffee, or partly burnt, the paper itself has
several alterations, but it is the original "as is", and for legal purpose
the only acceptable archival form as a PDF must ignore all the PDF/A-1a
constraints, not meant to represent originals accurately).
> 2016-03-20 20:52 GMT+01:00 Tom Gewecke <t...@bluesky.org>:
>>
>> > On Mar 20, 2016, at 12:24 PM, Asmus Freytag (t) <
asmus-...@ix.netcom.com> wrote:
>> >
>> > Usually, the archive feature pertains only to the fact that you can
reproduce the final form, not to being able to get at the correct source
(plain text backbone) for the document.
>>
>> My understanding is that PDF/A-1a is supposed to be searchable.
>>
>>
>>
>
>
>

-- 
Andrew Cunningham
lang.supp...@gmail.com


Re: Non-standard 8-bit fonts still in use

2016-04-30 Thread Andrew Cunningham
Don,

Most African communities I work with within diaspora are using Unicode.
Although 8 bit legacy content is still in use.

Probably the most use I see of legacy encodings is among the Karen
languages. Sgaw Karen uses seem to still be using 8-bit fonts. There is a
psuedo-Unicode solution but 8-bit fonts dominate still.

The problem for Karen is that the default rendering for Unicode fonts isn't
suitable. And locl support in applications has been lagging.

The ideal Unicode font for Myanmar script would have somewhere between 8-10
language systems. Cross platform support is lacking. Currently best
approach is a separate font for each language system.

Andrew

On Friday, 16 October 2015, Don Osborn <d...@bisharat.net> wrote:
> I was surprised to learn of continued reference to and presumably use of
8-bit fonts modified two decades ago for the extended Latin alphabets of
Malian languages, and wondered if anyone has similar observations in other
countries. Or if there have been any recent studies of adoption of Unicode
fonts in the place of local 8-bit fonts for extended Latin (or non-Latin)
in local language computing.
>
> At various times in the past I have encountered the idea that local
languages with extended alphabets in Africa require special fonts (that
region being my main geographic area of experience with multilingual
computing), but assumed that this notion was fading away.
>
> See my recent blog post for a quick and by no means complete discussion
about this topic, which of course has to do with more than just the fonts
themselves:
http://niamey.blogspot.com/2015/10/the-secret-life-of-bambara-arial.html
>
> TIA for any feedback.
>
> Don Osborn
>
>
>

-- 
Andrew Cunningham
lang.supp...@gmail.com


Joined "ti" coded as "Ɵ" in PDF

2016-03-20 Thread Andrew Cunningham
Janusz,

It is all smoke and mirrors.

For English  you have to choose the roght font. Simple, no advanced
features  disable advanced typographic features in application if you
can.

Ensure the cmap table in the font is sufficiently
comprehensive 

The issues Don raise still exist in PDF/A. You would need to make
fundamental changes to the PDF spec for it to work for any language.

For other languages, esp those in complex scripts the situation is more
dire ... esp when glyphs have been reordered.

The accepted work around is ActualText. But you don't necessarily need
ActualText. Depends on font and language.

But the rub is that it is left to implementors to decide if and when the
ActualText is used.  All aspects of the document ecosystem needs to be
looked at. Which tools can use ActualText instead of the visible text layer.

The PDF/UA spec is probably closer to the mark than the PDF/A spec.

But since most archives have no control over pdf production, authors' or
publishers' font selection, tools used, etc, then working with PDFs can be
fairly hit and miss.  For languages written in complex scripts, its usially
a miss rather than a miss.

I rarely see ActualText in PDF files ,even in those that need it.

Andrew

On Sunday, 20 March 2016, Janusz S. Bien <jsb...@mimuw.edu.pl> wrote:
> Quote/Cytat - Andrew Cunningham <lang.supp...@gmail.com> (Sun 20 Mar 2016
12:06:29 AM CET):
>
>> Hi Don,
>>
>> Latin is fine if you keep to simple well made fonts and avoid using more
>> sophisticated typographic features available in some fonts.
>>
>> Dumb it down typographically and it works fine. PDF, despite all the
>> current rhetoric coming from PDF software developers, is a preprint
format.
>> Not an archival format.
>
> What about PDF/A, ISO 19005-1:2005 Document Management – Electronic
document file format for long term preservation?
>
> Best regards
>
> Janusz
>
> --
> Prof. dr hab. Janusz S. Bień -  Uniwersytet Warszawski (Katedra
Lingwistyki Formalnej)
> Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics
Department)
> jsb...@uw.edu.pl, jsb...@mimuw.edu.pl,
http://fleksem.klf.uw.edu.pl/~jsbien/
>
>

-- 
Andrew Cunningham
lang.supp...@gmail.com


Re: Joined "ti" coded as "Ɵ" in PDF

2016-03-19 Thread Andrew Cunningham
Hi Don,

Latin is fine if you keep to simple well made fonts and avoid using more
sophisticated typographic features available in some fonts.

Dumb it down typographically and it works fine. PDF, despite all the
current rhetoric coming from PDF software developers, is a preprint format.
Not an archival format.

The PDF format is less than ideal. But it is widely used, often in a way
the format was never really created for. There are alternatives that
preserve the text. But they have never really taken off (compared to
PDF)for various reasons.

Andrew







On Sunday, 20 March 2016, Don Osborn <d...@bisharat.net> wrote:
> Thanks Andrew, Looking at the issue of ToUnicode mapping you mention, why
in the 1-many mapping of ligatures (for fonts that have them) do the "many"
not simply consist of the characters ligated? Maybe that's too simple (my
understanding of the process is clearly inadequate).
>
> The "string of random ASCII characters" (per Leonardo) used in the
Identity H system for hanzi raise other questions: (1) How are the ASCII
characters interpreted as a 1-many sequence representing a hanzi rather
than just a series of 1-1 mappings of themselves? (2) Why not just use the
Unicode code point?
>
> The details may or may not be relevant to the list topic, but as a user
of documents in PDF format, I fail to see the benefit of such obscure
mappings. And as a creator of PDFs ("save as") looking at others' PDFs I've
just encountered with these mappings, I'm wondering how these concerned
about how the font & mapping results turned out as they did. It is certain
that the creators of the documents didn't intend results that would not be
searchable by normal text, but it seems possible their a particular font
choice with these ligatures unwittingly produced these results. If the
latter, the software at the very least should show a caveat about such
mappings when generating PDFs.
>
> Maybe it's unrealistic to expect a simple implication of Unicode in PDFs
(a topic we've discussed before but which I admit not fully grasping).
Recalling I once had some wild results copy/pasting from an N'Ko PDF, and
ended up having to obtain the .docx original to obtain text for insertion
in a blog posting. But while it's not unsurprising to encounter issues with
complex non-Latin scripts from PDFs, I'd gotten to expect predictability
when dealing with most Latin text.
>
> Don
>
>
>
> On 3/17/2016 7:34 PM, Andrew Cunningham wrote:
>
> There are a few things going on.
>
> In the first instance, it may be the font itself that is the source of
the problem.
>
> My understanding is that PDF files contain a sequence of glyphs. A PDF
file will contain a ToUnicode mapping between glyphs and codepoints. This
iseither a 1-1 mapping or a 1-many mapping. The 1-many mapping provides
support for ligatures and variation sequences.
>
> I assume it uses the data in the font's cmap table. If the ligature
isn't  mapped then you will have problems. I guess the problem could be
either the font or the font subsetting and embedding performed when the PDF
is generated.
>
> Although, it is worth noting that in opentype fonts not all glyphs will
have mappings in the cmap file.
>
> The remedy, is to extensively tag the PDF and add ActualText attributes
to the tags.
>
> But the PDF specs leave it up to the developer to decide what happens in
there is both a visible text layer and ActualText. So even in an ideal PDF,
tesults will vary from software to software when copying text or searching
a PDF.
>
> At least thatsmy current understanding.
>
> Andrew
>
> On 18 Mar 2016 7:47 am, "Don Osborn" <d...@bisharat.net> wrote:
>>
>> Thanks all for the feedback.
>>
>> Doug, It may well be my clipboard (running Windows 7 on this particular
laptop). Get same results pasting into Word and EmEditor.
>>
>> So, when I did a web search on "internaƟonal," as previously mentioned,
and come up with a lot of results (mostly PDFs), were those also a
consequence of many not fully Unicode compliant conversions by others?
>>
>> A web search on what you came up with - "InternaƟonal" - yielded many
more (82k+) results, again mostly PDFs, with terms like "interna onal"
(such as what Steve noted) and "interna<onal" and perhaps others (given the
nature of, or how Google interprets, the private use character?).
>>
>> Searching within the PDF document already mentioned, "international"
comes up with nothing (which is a major fail as far as usability).
Searching the PDF in a Firefox browser window, only "internaƟonal" finds
the occurrences of what displays as "international." However after
downloading the document and searching it in Acrobat, only a search for
"internaƟonal" will find

Re: Joined "ti" coded as "Ɵ" in PDF

2016-03-19 Thread Andrew Cunningham
There are a few things going on.

In the first instance, it may be the font itself that is the source of the
problem.

My understanding is that PDF files contain a sequence of glyphs. A PDF file
will contain a ToUnicode mapping between glyphs and codepoints. This
iseither a 1-1 mapping or a 1-many mapping. The 1-many mapping provides
support for ligatures and variation sequences.

I assume it uses the data in the font's cmap table. If the ligature isn't
mapped then you will have problems. I guess the problem could be either the
font or the font subsetting and embedding performed when the PDF is
generated.

Although, it is worth noting that in opentype fonts not all glyphs will
have mappings in the cmap file.

The remedy, is to extensively tag the PDF and add ActualText attributes to
the tags.

But the PDF specs leave it up to the developer to decide what happens in
there is both a visible text layer and ActualText. So even in an ideal PDF,
tesults will vary from software to software when copying text or searching
a PDF.

At least thatsmy current understanding.

Andrew
On 18 Mar 2016 7:47 am, "Don Osborn"  wrote:

> Thanks all for the feedback.
>
> Doug, It may well be my clipboard (running Windows 7 on this particular
> laptop). Get same results pasting into Word and EmEditor.
>
> So, when I did a web search on "internaƟonal," as previously mentioned,
> and come up with a lot of results (mostly PDFs), were those also a
> consequence of many not fully Unicode compliant conversions by others?
>
> A web search on what you came up with - "InternaƟonal" - yielded many
> more (82k+) results, again mostly PDFs, with terms like "interna onal"
> (such as what Steve noted) and "interna nature of, or how Google interprets, the private use character?).
>
> Searching within the PDF document already mentioned, "international" comes
> up with nothing (which is a major fail as far as usability). Searching the
> PDF in a Firefox browser window, only "internaƟonal" finds the occurrences
> of what displays as "international." However after downloading the document
> and searching it in Acrobat, only a search for "internaƟonal" will find
> what displays as "international."
>
> A separate web search on "Eīects" came up with 300+ results, including
> some GoogleBooks which in the texts display "effects" (as far as I
> checked). So this is not limited to Adobe?
>
> Jörg, With regard to "Identity H," a quick search gives the impression
> that this encoding has had a fairly wide and not so happy impact, even if
> on the surface level it may have facilitated display in a particular style
> of font in ways that no one complains about.
>
> Altogether a mess, from my limited encounter with it. There must have been
> a good reason for or saving grace of this solution?
>
> Don
>
> On 3/17/2016 2:17 PM, Steve Swales wrote:
>
>> Yes, it seems like your mileage varies with the PDF
>> viewer/interpreter/converter.  Text copied from Preview on the Mac replaces
>> the ti ligature with a space.  Certainly not a Unicode problem, per se, but
>> an interesting problem nevertheless.
>>
>> -steve
>>
>> On Mar 17, 2016, at 11:11 AM, Doug Ewell  wrote:
>>>
>>> Don Osborn wrote:
>>>
>>> Odd result when copy/pasting text from a PDF: For some reason "ti" in
 the (English) text of the document at

 http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf
 is coded as "Ɵ". Looking more closely at the original text, it does
 appear that the glyph is a "ti" ligature (which afaik is not coded as
 such in Unicode).

>>> When I copy and paste the PDF text in question into BabelPad, I get:
>>>
>>> InternaƟonal Order and the DistribuƟon of IdenƟty in 1950 (By
 invitaƟon only)

>>> The "ti" ligatures are implemented as U+10019F, a Plane 16 private-use
>>> character.
>>>
>>> Truncating this character to 16 bits, which is a Bad Thing™, yields
>>> U+019F LATIN CAPITAL LETTER O WITH MIDDLE TILDE. So it looks like either
>>> Don's clipboard or the editor he pasted it into is not fully
>>> Unicode-compliant.
>>>
>>> Don's point about using alternative characters to implement ligatures,
>>> thereby messing up web searches, remains valid.
>>>
>>> --
>>> Doug Ewell | http://ewellic.org | Thornton, CO 
>>>
>>>
>>>
>>
>


Re: Windows keyboard restrictions

2015-08-08 Thread Andrew Cunningham
On Saturday, 8 August 2015, Richard Wordingham
richard.wording...@ntlworld.com wrote:

Michael did do a series of blog posts on building TSF based input methods
years ago. Something I tinkered with off and on.

 What we're waiting for is a guide we can follow, or some code we can
 ape.  Such should be, or should have been, available in a Tavultesoft
 Keyman rip-off.


I don't believe in rip-offs esp when there a free versions and the enhanced
version doesn't cost much.

But that said there is KMFL on linux which handles a subset of the keyman
definition files. And Keith Striebly, before he died, did a port of the
kmfl lib to windows. But I doubt anyone is maintaining it.

But reality is that the use cases discussed in this and related threads do
not need fairly complex or sophisticated layouts. So kmfl and derivates
should be fine respite how limited I consider them.

Alternative there are a range of input frameworks developed in se asia that
would be easy to work with as well.

Alternative input frameworks have been around for years. Its up to use them
or not use them.

I don't see much point bleating about the limitations of the win32 keyboard
model. Just use amlternative input framework .. wether it is TSF table
based input, keyman , kmfl port to windows or any of a large slather of
input frameworks that are available out there.

Andrew



-- 
Andrew Cunningham
Project Manager, Research and Development
(Social and Digital Inclusion)
Public Libraries and Community Engagement
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Ph: +61-3-8664-7430
Mobile: 0459 806 589
Email: acunning...@slv.vic.gov.au
  lang.supp...@gmail.com

http://www.openroad.net.au/
http://www.mylanguage.gov.au/
http://www.slv.vic.gov.au/


Re: Unicode of Death

2015-05-29 Thread Andrew Cunningham
Geez Philippe,

It was tounge in cheek.

A.

On Saturday, 30 May 2015, Philippe Verdy verd...@wanadoo.fr wrote:

 2015-05-28 23:36 GMT+02:00 Andrew Cunningham lang.supp...@gmail.com:

 Not the first time unicode crashes things. There was the google chrome
bug on osx that crashed the tab for any syriac text.

 Unicode crashes things? Unicode has nothing to do in those crashes
caused by bugs in applications that make incorrect assumptions (in fact not
even related to characters themselves but to the supposed behavior of the
layout engine. Programmers and designers for example VERY frequently forget
the constraints for RTL languages and make incorrect assumptions about left
and right sides when sizing objects, or they don't expect that the cursor
will advance backward and forget that some measurements can be negative: if
they use this negative value to compute the size of a bitmap redering
surface, they'll get out of memory, unchecked null pointers returned, then
they will crash assuming the buffer was effectively allocated.
 These are the same kind of bugs as with the too common buffer overruns
with unchecked assumtions: the code is kept because it works as is in
their limited immediate tests.
 Producing full coverage tests is a difficult and lengthy task, that
programmers not always have the time to do, when they are urged to produce
a workable solution for some clients and then given no time to improve the
code before the same code is distributed to a wider range of clients.
 Commercial staffs do that frequently, they can't even read the technical
limitations even when they are documented by programmers... in addition the
commercial staff like selling softwares that will cause customers to ask
for support... that will be billed ! After that, programmers are
overwhelmed by bug reports and support requests, and have even less time to
design other thigs that they are working on and still have to produce. QA
tools may help programmers in this case by providing statistics about the
effective costs of producing new software with better quality, and the cost
of supporting it when it contains too many bugs: commercial teams like
those statistics because they can convert them to costs, commercial
margins, and billing rates. (When such QA tools are not used, programmers
will rapidly leave the place, they are fed up by the growing pressure to do
always more in the same time, with also a growing number of urgent
support requests.).
 Those that say Unicode crashes things do the same thing: they make
broad unchecked assumptions about how things are really made or how things
are actually working.


-- 
Andrew Cunningham
Project Manager, Research and Development
(Social and Digital Inclusion)
Public Libraries and Community Engagement
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Ph: +61-3-8664-7430
Mobile: 0459 806 589
Email: acunning...@slv.vic.gov.au
  lang.supp...@gmail.com

http://www.openroad.net.au/
http://www.mylanguage.gov.au/
http://www.slv.vic.gov.au/


Re: Unicode of Death

2015-05-28 Thread Andrew Cunningham
Not the first time unicode crashes things. There was the google chrome bug
on osx that crashed the tab for any syriac text.

A.

On Friday, 29 May 2015, Bill Poser billpos...@gmail.com wrote:
 No doubt the evil Unicode Consortium is in league with the Trilateral
Commission, the Elders of Zion,and the folks at NASA who faked the moon
landing :)

 On Thu, May 28, 2015 at 7:53 AM, Doug Ewell d...@ewellic.org wrote:

 Unicode is in the news today as some folks with waaay too much time on
 their hands have discovered a string consisting of Latin, Arabic,
 Devanagari, and CJK characters that crashes Apple devices when it
 appears as a pop-up message.

 Although most people seem to identify it correctly as a CoreText bug,
 there are a handful, as you might expect, who attribute it to some shady
 weirdness in Unicode itself. My favorite quote from a Reddit user was
 this:

 Every character you use has a unicode value which tells your phone what
 to display. One of the unicode values is actually never-ending and so
 when the phone tries to read it it goes into an infinite loop which
 crashes it.

 I've read TUS Chapter 4 and UTR #23 and I still can't find the
 never-ending Unicode property.

 Perhaps astonishingly to some, the string displays fine on all my
 Windows devices. Not all apps get the directionality right, but no
 crashes.

 --
 Doug Ewell | http://ewellic.org | Thornton, CO 




-- 
Andrew Cunningham
Project Manager, Research and Development
(Social and Digital Inclusion)
Public Libraries and Community Engagement
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Ph: +61-3-8664-7430
Mobile: 0459 806 589
Email: acunning...@slv.vic.gov.au
  lang.supp...@gmail.com

http://www.openroad.net.au/
http://www.mylanguage.gov.au/
http://www.slv.vic.gov.au/


Re: Combined Yorùbá characters with dot below and tonal diacritics

2015-04-12 Thread Andrew Cunningham
On 12/04/2015 7:27 PM, Ilya Zakharevich nospam-ab...@ilyaz.org wrote:

 On Sun, Apr 12, 2015 at 07:07:01AM +0200, Philippe Verdy wrote:


  MSKLC does not provide a way to build another geometry and map geometric
  keys to vkeys (or the revers).

 Again, this has nothing to do with MSKLC.


If you are compiling a keyboard driver from source, then it has nothing to
do with MSKLC.

But for a general answer, for the average user who needs to develop a
keyboard, then MSKLC is very pertinent.

  Note also that (since always), MSKLC generated drivers have never
allowed
  us to change the mapping of scancodes (from hardware keyboards) to
virtual
  keys, aka vkeys, or to WM_SYSKEY (this is hardwired in a lower
internal
  level).

 Wrong.  Look for any French or German keyboard.

Microsoft has a tendency never to change a keyboard or how it operates,
there is a lot of bad design decisions and cruft that is still there. Just
because something can be done, doesn't mean it should be done.


  These drivers only map sequences of one or more vkeys (and a few
  supported states, it's not possible to add keyboard states other than
CTRL,
  SHIFT, CAPSLOCK, ALTGR2, and custom states for dead keys)

 How do you think I do it in my layout?


There are Microft keyboard layouts that use other states, the Canadian
multilingual keyboard comes to mind, mainly to comply with a canadian
standard. But microsoft themselves recommend remaining to the four keyboard
states Phillipe lists.

  to only one WM_CHAR.

 I have no idea why you would mix in WM_* stuff into this discussion…


Depending on your perspective it is pertinent or not.

  And it's not possible to change the mapping of vkeys to WM_SYSCHAR
  (this is also hardwired at a lower level).

 I have no clue what you are talking about now…


Andrew


Re: Combined Yorùbá characters with dot below and tonal diacritics

2015-04-11 Thread Andrew Cunningham
Hi Ilya,

The problem with approach documented below is two fold:

1) the characters required do not all exist as precomposed characters thus
microsoft's dead key sequences will not work for yoruba.

2) certaon alt-gr sequences are not quaranteed to work in all programs.
Some programs treat the Alt-Gr sequence as the equivalent to the Alt key
sequence. With program shortcuts overriding keyboard input.

From memory this was a problem we would have with MS Word. Care needs to be
taken selecting AltGr sequences to implement in keyboard.

And adding frequently typed characters like vowels and tone marks to altgr
is usually a bad idea. Easier to move less needed sequences to the altgr
state putting feequently type characters on the normal and shift states

Andrew

On Sunday, 12 April 2015, Ilya Zakharevich nospam-ab...@ilyaz.org wrote:
 On Sat, Apr 11, 2015 at 01:19:23AM +0100, Luis de la Orden wrote:
 Thanks for challenging my understanding of dead keys. I have a layout in
my
 Mac that works like a charm to write Yorùbá, Portuguese and Spanish
with
 the UK layout. I am having trouble with the Windows layout and should
have
 mentioned that more clearly. Nevertheless, I was using Microsoft Keyboard
 Layout Creator and assumed that the limitations of the software (or the
 limitations of my knowledge of the software) were the limitations of the
 technology as a whole.

 I see no problem with using MSKLC with Yorùbá.  Just make
   AltGr-e,  AltGr-o,  AltGr-s
 produce
   e̩,o̩, ands̩.
 Then make AltGr--, AltGr-' and AltGr-` into prefix keys (deadkeys)
 converting characters into accented forms.  IIRC, this would work fine
 also with “base keys” producing Unicode clusters (like those above)
 (check in the document below).

 For details, see the corresponding sections of

http://search.cpan.org/~ilyaz/UI-KeyboardLayout/lib/UI/KeyboardLayout.pm
 [I do not think the “standard” keyboard input on Windows is documented
 anywhere else :-( ].

 Hope this helps,
 Ilya


-- 
Andrew Cunningham
Project Manager, Research and Development
(Social and Digital Inclusion)
Public Libraries and Community Engagement
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Ph: +61-3-8664-7430
Mobile: 0459 806 589
Email: acunning...@slv.vic.gov.au
  lang.supp...@gmail.com

http://www.openroad.net.au/
http://www.mylanguage.gov.au/
http://www.slv.vic.gov.au/


Re: Avoidance variants

2015-03-25 Thread Andrew Cunningham
Or is it a markup issue rather than something for plain text?


On 26 March 2015 at 13:30, Mark E. Shoulson m...@kli.org wrote:

  So, not much in the way of discussion regarding the TETRAGRAMMATON issue
 I raised the other week.  OK; someone'll eventually get to it I guess.

 Another thing I was thinking about, while toying with Hebrew fonts.
 Often, letters are substituted in _nomina sacra_ in order to avoid writing
 a holy name, much as the various symbols for the tetragrammaton are used.
 And indeed, sometimes they're used in that name too, as I mentioned, usages
 like ידוד or ידוה and so on.  There's an example in the paper that shows
 אלדים instead of אלהים.  Much more common today would be אלקים and in fact
 people frequently even pronounce it that way (when it refers to big-G God,
 in non-sacred contexts.  But for little-g gods, the same word is pronounced
 without the avoidance, because it isn't holy.  It's weird.)

 I wonder if it makes sense maybe to encode not a codepoint, but a variant
 sequence(s) to represent this sort of defaced or altered letter HEH.
 It's still a HEH, it just looks like another letter, right? (QOF or DALET
 or occasionally HET)  That would keep some consistency to the spelling.  On
 the other hand, the spelling with a QOF is already well entrenched in texts
 all over the internet.  But maybe it isn't right.  And what about the use
 of ה׳ or ד׳ for the tetragrammaton?  Are they both HEHs, one altered, or
 is one really a DALET?  Any thoughts?

 (and seriously, what to do about all those tetragrammaton symbols?)

 ~mark

 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode




-- 
Andrew Cunningham
Project Manager, Research and Development
(Social and Digital Inclusion)
Public Libraries and Community Engagement
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Ph: +61-3-8664-7430
Mobile: 0459 806 589
Email: acunning...@slv.vic.gov.au
  lang.supp...@gmail.com

http://www.openroad.net.au/
http://www.mylanguage.gov.au/
http://www.slv.vic.gov.au/
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Android 5.1 ships with support for several minority scripts

2015-03-14 Thread Andrew Cunningham
Comment on Cham was informational. What is in unicode charts was based on
Eastern Cham only.

Proposals to add Cham and Arabic characters to needed to support Western
Cham are underdevelopment.

Testing on Thai Tham will occur ... I was curious as to what the original
design parameters forvthe font was. It is easier to evaluate a fonts
language support knowing what was originally indended.

For instance I do not assume that the myanmar font was designed to support
all languages that use the myanmar script.

I can also make assumptions about Latin script coverage and languages that
are supported/unsupported.

Andrew

On Sunday, 15 March 2015, Roozbeh Pournader rooz...@unicode.org wrote:
 Andrew,
 I don't know the answer to your questions unfortunately. You can
investigate the fonts yourself (they are available at
https://code.google.com/p/noto/), or ask for support for Western Cham
(assuming it's already properly encoded at Unicode) at the Noto issue
tracker at https://code.google.com/p/noto/issues/entry.
 On Fri, Mar 13, 2015 at 8:27 PM, Andrew Cunningham lang.supp...@gmail.com
wrote:

 Hi Roozbeh,

 a point of clarification and a question:

 * the Cham font is actually an Eastern Cham font supporting Akhar Thrah
the Eastern variety of the script.

 Akhar Srak . Western Cham script remains unsupported.

 Which languages was the Thai Tham font designed to support? And which
variety of the script?

 Andrew

 On Saturday, 14 March 2015, Roozbeh Pournader rooz...@unicode.org
wrote:
  Android 5.1, released earlier this week, has added support for 25
minority scripts. The wide coverage can be reproduced by almost everybody
for free, thanks to the Noto and HarfBuzz projects, both of which are open
source. (Android itself is open source too.)
  By my count, these are the new scripts added in Android 5.1: Balinese,
Batak, Buginese, Buhid, Cham, Coptic, Glagolitic, Hanunnoo, Javanese, Kayah
Li, Lepcha, Limbu, Meetei Mayek, Ol Chiki, Oriya, Rejang, Saurashtra,
Sundanese, Syloti Nagri, Tagbanwa, Tai Le, Tai Tham, Tai Viet, Thaana, and
Tifinagh.
  (Android 5.0, released last year, had already added the Georgian lari,
complete Unicode 7.0 coverage for Latin, Greek, and Cyrillic, and seven new
scripts: Braille, Canadian Aboriginal Syllabics, Cherokee, Gujarati,
Gurmukhi, Sinhala, and Yi.)
  Note that different Android vendors and carriers may choose to ship
more fonts or less, but Android One phones and most Nexus devices will
support all the above scripts out of the box.
 
  None of this would have been possible without the efforts of Unicode
volunteers who worked hard to encode the scripts in Unicode. Thanks to the
efforts of Unicode, Noto, and HarfBuzz, thousands of communities around the
world would can now read and write their language on smartphones and
tablets for the first time.
 

 --
 Andrew Cunningham
 Project Manager, Research and Development
 (Social and Digital Inclusion)
 Public Libraries and Community Engagement
 State Library of Victoria
 328 Swanston Street
 Melbourne VIC 3000
 Australia

 Ph: +61-3-8664-7430
 Mobile: 0459 806 589
 Email: acunning...@slv.vic.gov.au
   lang.supp...@gmail.com

 http://www.openroad.net.au/
 http://www.mylanguage.gov.au/
 http://www.slv.vic.gov.au/




-- 
Andrew Cunningham
Project Manager, Research and Development
(Social and Digital Inclusion)
Public Libraries and Community Engagement
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Ph: +61-3-8664-7430
Mobile: 0459 806 589
Email: acunning...@slv.vic.gov.au
  lang.supp...@gmail.com

http://www.openroad.net.au/
http://www.mylanguage.gov.au/
http://www.slv.vic.gov.au/
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Android 5.1 ships with support for several minority scripts

2015-03-13 Thread Andrew Cunningham
Hi Roozbeh,

a point of clarification and a question:

* the Cham font is actually an Eastern Cham font supporting Akhar Thrah the
Eastern variety of the script.

Akhar Srak . Western Cham script remains unsupported.

Which languages was the Thai Tham font designed to support? And which
variety of the script?

Andrew

On Saturday, 14 March 2015, Roozbeh Pournader rooz...@unicode.org wrote:
 Android 5.1, released earlier this week, has added support for 25
minority scripts. The wide coverage can be reproduced by almost everybody
for free, thanks to the Noto and HarfBuzz projects, both of which are open
source. (Android itself is open source too.)
 By my count, these are the new scripts added in Android 5.1: Balinese,
Batak, Buginese, Buhid, Cham, Coptic, Glagolitic, Hanunnoo, Javanese, Kayah
Li, Lepcha, Limbu, Meetei Mayek, Ol Chiki, Oriya, Rejang, Saurashtra,
Sundanese, Syloti Nagri, Tagbanwa, Tai Le, Tai Tham, Tai Viet, Thaana, and
Tifinagh.
 (Android 5.0, released last year, had already added the Georgian lari,
complete Unicode 7.0 coverage for Latin, Greek, and Cyrillic, and seven new
scripts: Braille, Canadian Aboriginal Syllabics, Cherokee, Gujarati,
Gurmukhi, Sinhala, and Yi.)
 Note that different Android vendors and carriers may choose to ship more
fonts or less, but Android One phones and most Nexus devices will support
all the above scripts out of the box.

 None of this would have been possible without the efforts of Unicode
volunteers who worked hard to encode the scripts in Unicode. Thanks to the
efforts of Unicode, Noto, and HarfBuzz, thousands of communities around the
world would can now read and write their language on smartphones and
tablets for the first time.


-- 
Andrew Cunningham
Project Manager, Research and Development
(Social and Digital Inclusion)
Public Libraries and Community Engagement
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Ph: +61-3-8664-7430
Mobile: 0459 806 589
Email: acunning...@slv.vic.gov.au
  lang.supp...@gmail.com

http://www.openroad.net.au/
http://www.mylanguage.gov.au/
http://www.slv.vic.gov.au/
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Western Cham in Akhar Jawi

2014-10-27 Thread Andrew Cunningham
Thanks Roozbeh,

I will most likely write a proposal, at the moment I am still mapping
character usage to see if other unencoded characters pop up.

Also doing the same for the western cham script, some of the more recent
reforms  (within past 10 years) in Cambodia don't appear to be encoded.

Andrew

On 28 October 2014 02:26, Roozbeh Pournader rooz...@unicode.org wrote:

 This is the first time I'm seeing the character. I suggest writing a
 Unicode proposal.
 On Oct 26, 2014 10:42 PM, Andrew Cunningham lang.supp...@gmail.com
 wrote:

 Hi all,


 When Western Cham is written in the Arabic script, there is regional
 variation in the Arabic characters used. Two varieties I am looking at use
 a character that i can't see in the Unicode charts, although I may have
 missed it.

 The character is a alef with three dots above (with the dots pointing
 upwards), see the attached images.

 has anyone come across this character used in other contexts?

 Andrew

 --
 Andrew Cunningham
 Project Manager, Research and Development
 (Social and Digital Inclusion)
 Public Libraries and Community Engagement
 State Library of Victoria
 328 Swanston Street
 Melbourne VIC 3000
 Australia

 Ph: +61-3-8664-7430
 Mobile: 0459 806 589
 Email: acunning...@slv.vic.gov.au
   lang.supp...@gmail.com

 http://www.openroad.net.au/
 http://www.mylanguage.gov.au/
 http://www.slv.vic.gov.au/

 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode




-- 
Andrew Cunningham
Project Manager, Research and Development
(Social and Digital Inclusion)
Public Libraries and Community Engagement
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Ph: +61-3-8664-7430
Mobile: 0459 806 589
Email: acunning...@slv.vic.gov.au
  lang.supp...@gmail.com

http://www.openroad.net.au/
http://www.mylanguage.gov.au/
http://www.slv.vic.gov.au/
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Western Cham in Akhar Jawi

2014-10-26 Thread Andrew Cunningham
Hi all,


When Western Cham is written in the Arabic script, there is regional
variation in the Arabic characters used. Two varieties I am looking at use
a character that i can't see in the Unicode charts, although I may have
missed it.

The character is a alef with three dots above (with the dots pointing
upwards), see the attached images.

has anyone come across this character used in other contexts?

Andrew

-- 
Andrew Cunningham
Project Manager, Research and Development
(Social and Digital Inclusion)
Public Libraries and Community Engagement
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Ph: +61-3-8664-7430
Mobile: 0459 806 589
Email: acunning...@slv.vic.gov.au
  lang.supp...@gmail.com

http://www.openroad.net.au/
http://www.mylanguage.gov.au/
http://www.slv.vic.gov.au/
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Current support for N'Ko

2014-09-29 Thread Andrew Cunningham
On 29/09/2014 11:02 PM, Frédéric Grosshans frederic.grossh...@gmail.com
wrote:

 Le 27/09/2014 01:10, Andrew Cunningham a écrit :

 * NEVER try to copy and paste text from PDF. It is a preprint format and
should be treated as such.

 Well... Having access to the raw text is often useful (for example, to
allow blinds to have acces to the content of pdf documents, or to search a
word in a scanned historical document), and cut and pasting text from PDF
often works, even if the “rich text” formating is lost.


The problem is that often the actual text isnt necessarily ths same as the
original text used to generate the pdf.

Results will vary according to fonts used and tools used to generate the
pdf. Even adobe acrobat contains different tools which can give vastly
different results.

It is best to think of PDF as dealing with glyphs rather than characters.

I tend to mainly work with complex scripts, and the results with those is
usually not encouraging. I know there is ActualText, but honestly I dont
actually ever remember seeing a complex script PDF I could copy and paste
from without post-processing of the text.

The average person creating PDF files has no knowledge of how to achieve
optimal results.

Nko is one of the easier scripts to deal with thankfully.

 In the case of the Ebola FAQs (
https://sites.google.com/site/athinkra/ebola-faqs) discussed here, it
almost worked perfectly on my computer (Ubuntu Linux 14.04) for N’Ko
(diacritics are shifted by one character) and Vai. Of course, the Adlam was
not working (somehow converted to Arabic), bus it was expected, since Adlam
is not (yet?) in Unicode.


 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Current support for N'Ko

2014-09-29 Thread Andrew Cunningham
On 30/09/2014 4:11 AM, David Starner prosfil...@gmail.com wrote:

 On Fri, Sep 26, 2014 at 4:10 PM, Andrew Cunningham
 lang.supp...@gmail.com wrote:
  * NEVER try to copy and paste text from PDF. It is a preprint format and
  should be treated as such.


 I'd try and cut and paste from print if I could. People are going to
 cut and paste from anything if it saves them a little time. If you
 disable cut and pasting from PDF, those who have easy access to OCR
 may just print to image and OCR it to cut and paste. To say don't do
 this is unproductive.


Ok what I should say is that in best case scenario for complex script text
you can copy and paste nd then do post processing on extracted text to get
the actual text. Post processing may involve reordering characters, or
systematic conversions of glyph sequences.

In worse case scenario you get utter garbage you can not reconstruct pdf
files from.

Searching and indexing is even more problematic.

Honestly, for languages I work with it would be quicker and more accurate
in many csses to use OCR (even at 80% accuracy) that cut and paste from PDF.

As I said in previous email results and effectiveness will differ depending
on fonts used and PDF generator used.

PDF was designed for preprint, not archival purposes.

 --
 Kie ekzistas vivo, ekzistas espero.
 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Current support for N'Ko

2014-09-26 Thread Andrew Cunningham
Hi Don,

I will give a detailed reply offline to you and Charles. I am slowly
working on notes on web deployment of various languages in my spare time.
Been held up unpicking Myanmar script and possible errata/additions to
UTN11

But N'ko is on my list of scripts to document.

I will need to look at your pages and unpick them.

But  couple of reflections.

Your blog post is dealing with multiple issues.

* bidi support in html5 and css3, and to what extent are scipts like N'ko
taken into account.

* What rendring system is being usd by browser

* what font is being used: opentype,graphite, aat .. this will affect
rendring in browsers. For opentype which script is being used which will
affect which opentype features will  be processed.

so getting the font stackright is important. and the font stack will differ
from browser to browser.

I need to check for existance of a cross platform N'ko font.

* NEVER try to copy and paste text from PDF. It is a preprint format and
should be treated as such.

Andrew

On 27/09/2014 12:45 AM, d...@bisharat.net wrote:

 Some observations concerning N'Ko support in browsers may be of interest:


http://niamey.blogspot.com/2014/09/nko-on-web-review-of-experience-with.html

 This is pursuant to reposting a translation in N'Ko of a World Heath
Organization FAQ on ebola. That translation was one of several facilitated
by Athinkra LLC, and available at
https://sites.google.com/site/athinkra/ebola-faqs

 Don Osborn
 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Editing Sinhala and Similar Scripts

2014-03-19 Thread Andrew Cunningham
LOL,  that's why,  if the input framework allows it, its easier to support
both approachable to backspace or at least an option to choose one or the
other.

; )

Andrew
On 19/03/2014 11:37 PM, Doug Ewell d...@ewellic.org wrote:

 Richard Wordingham richard dot wordingham at ntlworld dot com wrote:

  Typing is a nightmare.


  When you backspace it destroys multiple keystrokes.


 I suspect this is a widespread and unsolved problem.


 There are two types of people:

 1. those who fully expect Backspace to erase a single keystroke, and feel
 it is a fatal flaw if it erases an entire combination, and

 2. those who fully expect Backspace to erase an entire combination, and
 feel it is a fatal flaw if it erases just a single keystroke.

 Unfortunately, both types exist in significant numbers.

 --
 Doug Ewell | Thornton, CO, USA
 http://ewellic.org | @DougEwell ­
 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Editing Sinhala and Similar Scripts

2014-03-19 Thread Andrew Cunningham
There is also a distinction between editing an existing document that you
opened as distinct from writing a document, going back to a certain point
in document and editing that section within the same editing session.

In the first case their is no history, in the second case their may be
history to work with.

Andrew


On 20 March 2014 14:43, Peter Constable peter...@microsoft.com wrote:

 If you click into the existing text in this email and backspace, what
 keystroke will you expect to be erased? Your system has no way of knowing
 what keystroke might have been involved in creating the text.

 What is _can_ make sense to talk about is to say that a user expects
 execution of a particular key sequence, such as pressing a Backspace key,
 to have a particular editing effect on the content of text. Erasing a
 keystroke and keystrokes resulting in edits are different things. One
 makes sense, the other does not.

 It may seem like I'm being pedantic, but I think the distinction is
 important. Our failure is in framing our thinking from years of experience
 (and perhaps some behaviours originally influenced by typewriter and
 teletype technologies) in which a keyboard has a bunch of keys that add
 characters, and variations on that that even include a lot of logic to get
 input keying sequences that can generate tens of thousands of different
 character; but then one or two keys (delete, backspace) that can only
 operate in very dumb ways. (We've also always assumed that any logic in
 keying behaviours can be conditioned only by the input sequences, but not
 by any existing content, but that steps beyond my earlier point.) These
 constraints in how we think limit possibilities


 Peter


 -Original Message-
 From: Doug Ewell [mailto:d...@ewellic.org]
 Sent: March 19, 2014 9:39 AM
 To: Peter Constable; unicode@unicode.org
 Subject: RE: Editing Sinhala and Similar Scripts

 Peter Constable petercon at microsoft dot com wrote:

  There are two types of people:
 
  1. those who fully expect Backspace to erase a single keystroke
 
  It is nonsensical to talk about erasing a _keystroke_.

 But that's what they expect.

 --
 Doug Ewell | Thornton, CO, USA
 http://ewellic.org | @DougEwell


 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode




-- 
Andrew Cunningham
Project Manager, Research and Development
(Social and Digital Inclusion)
Public Libraries and Community Engagement
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Ph: +61-3-8664-7430
Mobile: 0459 806 589
Email: acunning...@slv.vic.gov.au
  lang.supp...@gmail.com

http://www.openroad.net.au/
http://www.mylanguage.gov.au/
http://www.slv.vic.gov.au/
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Editing Sinhala and Similar Scripts

2014-03-19 Thread Andrew Cunningham
On 20 March 2014 15:17, J. Leslie Turriff jlturr...@centurylink.net wrote:

 Perhaps it might be useful to be able to distinguish between an
 editing
 mode and a composition mode:  editing mode would be active when a
 document
 is first loaded into the editor, when the editor has no keystroke history
 to
 consult, and  in this mode the backspace key would merely remove text
 glyph
 by glyph, so to speak, as happens with ASCII text;  composition mode would
 be active when keystrokes have been recorded in a buffer, so that backspace
 could be used to unstroke the original strokes; the unstroke operations
 would mimic the order in which the originals were entered, even if the
 editor
 had optomized the composition.




Although that requires an input framework and application that utilise that
buffer in various ways during composition mode. It is possible, and in
the past I have written a manual and run training on advanced editing for
Dinka language translators on how to utilise such features. But not many
applications support such features.

Andrew

-- 
Andrew Cunningham
Project Manager, Research and Development
(Social and Digital Inclusion)
Public Libraries and Community Engagement
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Ph: +61-3-8664-7430
Mobile: 0459 806 589
Email: acunning...@slv.vic.gov.au
  lang.supp...@gmail.com

http://www.openroad.net.au/
http://www.mylanguage.gov.au/
http://www.slv.vic.gov.au/
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka)

2014-03-18 Thread Andrew Cunningham
Chris,

Keyman is capable of doing that and a lot more,  but few keyboard layout
developers use it to its full potential.

As an example,  I was asked by Harari teachers here in Melbourne to develop
a set of three keyboard layouts for them and their students.
The three keyboards were for three different orthographies in the following
scripts:
1) Latin
2) Ethiopic
3) Arabic

They wanted all three layouts to work identically,  using the keystrokes
used on the Latin keyboard.

The Ethiopic and Arabic keyboard layouts required extensive remapping of
key sequences to output.

If I was a programmer I could have done something more elegant by building
an external library Keyman could call but as it is we could do a lot inside
the Keyman keyboard layout itself.

For Myanmar script keyboard layouts we allow visual input for the e-vowel
sign and medial Ra,  with the layout handling reordering.

One of the Latin layouts I use,  supports combining diacritics and reorders
sequences of diacritics to their canonical order regardless of order of
input. Assuming a maximum of one diacritic below and two diacrtics above
base character.
Analysis and creativity can produce some very effective Keyman layouts.

Andrew
 On 18/03/2014 7:23 PM, Christopher Fynn chris.f...@gmail.com wrote:

 MSKLC and KeyMan are fairly crude ways of creating input methods

 For what you want to - you probably need a memory resident program
 that traps the Latin input from the keyboard, processes the
 (transliterated) input strings converting them into unicode Sinhala
 strings, and then injects these back into the input queue  in place of
 the Latin characters.

 There are a couple of utilities that do this for typing
 transliterated/romanised Tibetan in Windows and getting  Tibetan
 Unicode output.
 http://tise.mokhin.org/
 http://www.thubtenrigzin.fr/denjongtibtype/en.html

 But I think both of these were written in C as they have to do a lot
 of processing which is far beyond what can be accomplished with MSKLC
 and even KeyMan

 - C
 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka)

2014-03-18 Thread Andrew Cunningham
I suspect it was a fishing expedition to illustrate how awkward it is to
type on Unicode keyboard layouts versus his system.

Ie still no clear separation of input and encoding in his responses.
On 19/03/2014 6:39 AM, Doug Ewell d...@ewellic.org wrote:

 Tom, with typo spotted and corrected by Jean-François, seems to have
 found it:

 කාර්‍ය්‍යාලවල යනහ්‍ර
 පඩකහි

 The sequence of code points would thus be:

 0D9A 0DCF 0DBB 0DCA 200D 0DBA 0DCA 200D 0DBA 0DCF 0DBD 0DC0 0DBD 0020
 0DBA 0DB1 0DC4 0DCA 200D 0DBB 0020 0DB4 0DA9 0D9A 0DC4 0DD2

 Naena, is this what you were looking for?

 --
 Doug Ewell | Thornton, CO, USA
 http://ewellic.org | @DougEwell


 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka)

2014-03-18 Thread Andrew Cunningham
Different individuals,  groups and communities can bring their own
expectations to input layout designs.
Design is a balance between capabilities and limitations of the input
framework versus the expectations of the user community around how they
language should work.

I work with multiple operating systems and even more input frameworks.

I have my preferred input frameworks. But it ultimately air is a question
of knowing your tools.

For instance, if you compile a keyborad layout from the commandline with
MSKLC you can chain deadkeys,  build against custom locales in Vista and
Win7, or build against unsupported language codes in Win8+

Andrew
On 19/03/2014 9:13 AM, Tom Gewecke t...@bluesky.org wrote:


 On Mar 18, 2014, at 12:52 PM, Andrew Cunningham wrote:

 I suspect it was a fishing expedition to illustrate how awkward it is to
 type on Unicode keyboard layouts versus his system.


 Interesting question perhaps.  Is it more awkward to type 14 strokes as k
 a a r y y a a l a v a l a  or to type 9 as  ක ා  ර  ්‍ය  ්‍ය  ා  ල  ව  ල ?


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka)

2014-03-17 Thread Andrew Cunningham
On 18/03/2014 11:23 AM, Naena Guru naenag...@gmail.com wrote:



 I tried to make a phonetic one to kind of relate to the English keys.
Still, you need to have many shifted keys to get common letters.


No you don't, you just need to understand the possibilities of what your
input framework is capable of and the best way to implement what you want
to achieve.

The windows input system is probably the most contrained,  but to look at a
good phonetic layout have a look at the Cherokee Phonetic layout on Windows
8+

Designing a god layout requires using the right tools,  knowing the limits
and capabilities of those tools, and using them in creative ways.


 On Mon, Mar 17, 2014 at 11:38 AM, Doug Ewell d...@ewellic.org wrote:

 Naena Guru naenaguru at gmail dot com wrote:

  Making a keyboard [layout] is not hard. You can either edit an
  existing one or make one from scratch. I made the latest Romanized
  Singhala one from scratch. The earlier one was an edit of US-
  International.

 I've made a couple dozen of them myself, with MSKLC.

  When you type a key on the physical keyboard, you generate what is
  called a scan-code of that key so that the keyboard driver knows which
  key was pressed. (During DOS days, we used to catch them to make
  menus.) Now, you assign one or a sequence of Unicode characters you
  want to generate for the keypress.

 Precisely. As Marc Durdin said, you can create a keyboard layout just as
 easily for Unicode characters as for ASCII and Latin-1 characters. You
 can also assign a combination of characters to a single key.

 So it is not true that typing Unicode Sinhala requires you to learn a
 key map that is entirely different from the familiar English keyboard,
 while losing some marks and signs too. Unicode does not prescribe any
 key map. You can have whatever layout you like.

 As Marc also said, if you think there are marks and signs missing from
 Unicode, that is another matter.

 --
 Doug Ewell | Thornton, CO, USA
 http://ewellic.org | @DougEwell



 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Romanized Singhala got great reception in Sri Lanka

2014-03-16 Thread Andrew Cunningham
On 17/03/2014 6:55 AM, Jean-François Colson j...@colson.eu wrote:

 Le 16/03/14 14:10, William_J_G Overington a écrit :



 Is the Romanized Singhala system a way to enter the characters into a
computer using only a QWERTY keyboard?

 It is easy to input (phonetically) using a keyboard layout slightly
altered from QWERTY.

 How is the keyboard altered from QWERTY please?

 Are you publishing the font please?


 In fact, I think he was speaking of the bare American (US) qwerty. An
international version of it should do the job.

 Looking at his site http://lovatasinhala.com/ and making a copy and paste
of the page contents, you see he uses 7-bit ASCII, a few Latin-1 accented
vowels, and a few additional “letters” such as ð, Ð, þ, æ and µ.


He also makes a case distinction,  where upper and lowercase versions of
some characters produce different Sinhala characters.

 Naena Guru’s aim is not to make an input method to type Sinhalese.
Sinhalese keyboards layouts already exist:
 http://www.microsoft.com/resources/msdn/goglobal/keyboards/kbdsn1.html
 http://www.microsoft.com/resources/msdn/goglobal/keyboards/kbdsw09.html
 http://kaputa.com/uniwriter/apple.gif
 http://www.nongnu.org/sinhala/doc/keymaps/sinhala-keyboard_3.html

 His aim is rather to make an 8-bit font to replace that “difficult” and
“expensive” Unicode compliant Sinhalese.


Creating a new set of difficulties.

Andrew
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Diacritical marks: Single character or combined character?

2013-12-06 Thread Andrew Cunningham
To add to low the other comments.  I would add two points:

1. It depends on the language you deal with.
2. It depends on the  input framework you are using.

A number of the languages I deal with use combinations of base characters
and diacrtics where some combinations have precomposed forms and others
don't.

When developing keyboard layouts for such languages using simple input
frameworks you have to use combining diacritics or a wierd mix of combining
and composed.

With more sophisticated input frameworks you have more flexibility and
control.

Andrew


Re: Dotted Circle plus Combining Mark as Text

2013-10-20 Thread Andrew Cunningham
I suspect it is a font issue, rather than a renderer issue, but then using
a dotted circle is a convention used in the unicode charts and in the
unicode spec. It is not a combination I'd expect a font developer from SE
Asia to necessarily support.

Since publications in SE Asia have their own typographic conventions for
displaying isolated combining marks.

Andrew


Re: Can a single text document use multiple character encodings?

2013-08-30 Thread Andrew Cunningham
I can think of a few websites that mix legacy encoded content withina utf-8
document.

Often done as a practicality.

Or alternatively mixing Unicode and pseudo-Unicode in same document.

Andrew
On 30/08/2013 11:14 PM, Ilya Zakharevich nospam-ab...@ilyaz.org wrote:

 On Wed, Aug 28, 2013 at 07:07:23PM +, Costello, Roger L. wrote:

  For example, can some text be encoded as UTF-8 while other text is
 encoded as UTF-16 - within the same document?

 I think it is a very interesting question.  A Perl program is
 (obviously) a text document.  On the other hand, in two minutes I
 could deduce a few ways to mix many different encodings into the same
 document.  My current record is 5 different encodings; some of them
 are arbitrary, some of them should satisfy certain compatibility
 requirements (something like
  =cut CR
 and
  =pod CR
 being encoded the same in two encodings).  And, on top of this, is yet
 another way to mix encodings arbitrarily.

 The tricks are threefold:

 ◌ First, a Perl program is actually a mixture of 3 different
   documents: the program stream, the data-for-the-program stream,
   and the documentation stream.  There are certain rules for
   interleaving them (except for DATA which should be at the end!),
   and there are documented way to specify encodings of the
   streams.

 ◌ Second, the string and regular-expression literals are
   “interpreted” by the lexer: there is a way for the program to
   specify a way to “massage” the literals before they are handled
   to interpreter.  This gives yet other ways to have strings
   and/or regular expressions to be in a different encoding.  (Note
   that this may lead to “doubly encoded” phenomena if the
   “ambient” encoding is not “raw”.)

 ◌ Third, there is a way to switch the encoding of a Perl program
   on the fly (at the end-of-line of current encoding).

 To be honest, I should have better tested all this before
 posting — but I did not.  On the practical side, how is this useful?
 Having different encoding for DATA and the program, and/or
 documentation and the program may be quite widely used.  The other
 hacks may have been used at least in the (enormous!) Perl test suite.

 Ilya




Re: Ways to show Unicode contents on Windows?

2013-07-19 Thread Andrew Cunningham
Although writing an IME from scratch is beyond the skill set of a few of us.

Although there are text services framework table based IMEs. Although I did
here a romour that support for those may disappear. Not sure if that is
true or not.

But since Windows 8, it has become even more difficult to track what is
happening in terms of input, esp since there are more input frameworks than
there used to be.

One of the reasons I prefer using non-Microsoft tools for complex input
requirements.

Microsoft typography team has done same very good work. But Microsoft is so
large, things are becoming fragmented.

Interesting tools like locale builder were never maintained.

And it is becoming more difficult to develop solutions for lesser used
languages.

It is the nature of the beast, not just an issue with Microsoft and Windows
8, but with internationalisatin support in many large projects.

Andrew
On 19/07/2013 5:47 PM, Richard Wordingham richard.wording...@ntlworld.com
wrote:

 On Thu, 18 Jul 2013 17:11:45 -0700
 Ilya Zakharevich nospam-ab...@ilyaz.org wrote:

  Just in case: do you realize that out-of-BMP must be specified via
  LIGATURES section?

 Yes, for 'character' read UTF-16 code element.  Even worse, you can't
 use dead keys outside the BMP, which prevents one using MSKLC for
 typing in natural language in cuneiform orthography.  (Plain text
 Egyptian is no more supported than is plain text calculus.)  However,
 I recall that one can use a simple IME instead.

 Richard.





Re: Ways to show Unicode contents on Windows?

2013-07-18 Thread Andrew Cunningham
Hi Ilya,

That is part of the story. There are tidbits scattered all through
Michael's blog.

On 19/07/2013 11:53 AM, Ilya Zakharevich nospam-ab...@ilyaz.org wrote:

 On Wed, Jul 17, 2013 at 12:04:10AM +0100, Richard Wordingham wrote:
  (LCID); I don't see any way to check what the general .klc file format
  is - the format seemed very delicate when I had to edit it by hand, at
  least, not for the SMP.

 I wonder whether this link is relevant to what you discuss:

   http://blogs.msdn.com/b/michkap/archive/2013/04/16/10233999.aspx

 Myself, I found very few problems with manipulation of .klc files.
 (See the first dozen of Windows GOTCHAS in
   http://search.cpan.org/~ilyaz/UI-KeyboardLayout/lib/UI/KeyboardLayout.pm
 )

 Just in case: do you realize that out-of-BMP must be specified via
 LIGATURES section?  (Put %% instead of the characters, and put in
 LIGATURES: the VK, the modification column, and the “content”: up to 4
 16-bit numbers.)  My sources are in k.ilyaz.org/iz/windows/src.zip.

 Yours,
 Ilya



Re: Ways to show Unicode contents on Windows?

2013-07-16 Thread Andrew Cunningham
Hi Richard,

Yes you can build against a custom locale in VIsta onwards. Requires
editing source file in text editor, then building keyboard layouts from the
command line using MSKLC

Andrew


On 17 July 2013 09:04, Richard Wordingham
richard.wording...@ntlworld.comwrote:

 On Mon, 15 Jul 2013 18:19:34 +1000
 Andrew Cunningham lang.supp...@gmail.com wrote:

  On 15/07/2013 6:02 PM, Christopher Fynn chris.f...@gmail.com
  wrote:

MS Office seems to want to do is apply fonts based on the language
   being used - the input language being determined by the keyboard
   or IME currently selected. When using a custom keyboard (e.g. one
   created with MSKLC) or IME  MS Office frequently does not accuratly
   determine the language and consequently overides your font
   selection.

  I am wondering if building a MSKLC against a custom locale will get
  around the problem, or would make no difference?

 Can one actually build MSKLC against a custom locale?  The
 documentation on the easiest way of building a custom locale implies
 that it is only available for Windows Vista, whereas I only have XP
 and Windows 7.  The .klc files MSKLC created use a numerical locale ID
 (LCID); I don't see any way to check what the general .klc file format
 is - the format seemed very delicate when I had to edit it by hand, at
 least, not for the SMP.  Neither Akkadian nor Hittite comes up on the
 pick list.  (I might choose Hittite because the cuneiform font I have
 is for Hittite.)  I suppose I might have problems with cuneiform
 because I chose the only Mesopotamian locale available - Iraqi Arabic.

 Richard.




-- 
Andrew Cunningham
Project Manager, Research and Development
(Social and Digital Inclusion)
Public Libraries and Community Engagement
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Ph: +61-3-8664-7430
Mobile: 0459 806 589
Email: acunning...@slv.vic.gov.au
  lang.supp...@gmail.com

http://www.openroad.net.au/
http://www.mylanguage.gov.au/
http://www.slv.vic.gov.au/


Re: Ways to show Unicode contents on Windows?

2013-07-15 Thread Andrew Cunningham
On 15/07/2013 6:02 PM, Christopher Fynn chris.f...@gmail.com wrote:



  MS Office seems to want to do is apply fonts based on the language
 being used - the input language being determined by the keyboard or
 IME currently selected. When using a custom keyboard (e.g. one created
 with MSKLC) or IME  MS Office frequently does not accuratly determine
 the language and consequently overides your font selection.


I am wondering if building a MSKLC against a custom locale will get around
the problem, or would make no difference?

Andrew


Re: Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?

2013-02-01 Thread Andrew Cunningham
Hi Roger,

The situation is complex. Few applications and web services bother with
normalisation, so what you get, I.e. NFC or NFD or other ... often depends
on which language you are using and what input framework you are using.

Some keyboard layouts will produce NFC output,

some keyboard layouts will not produce either NFC or NFD.

some keyboard layouts will produce NFD.

some keyboards layouts may produce NFD if the typist enters the characters
in the right order, if the language uses multiple combining  diacritics and
some of combining diacritics do not interact typographically.

You need very specific input frameworks supporting constraints and
reordering to guarantee either NFC or NFD for some languages.

And for some languages, different keyboard layouts will produce different
output. Ie some Vietnamese input tools produce NFC, while others do not
produce NFC or NFD.

Library data is also problematic. Some ILMs will out put NFC but this is
not the norm. Usually they will leave it in its internal format. For
MARC21, the character repertoire taken as a whole will produce data that is
northern NFC nor NFD, but if you look at subsets of data by language, a lot
of the data is effectively NFD. But not all.

Andrew
On Feb 2, 2013 1:19 AM, Costello, Roger L. coste...@mitre.org wrote:

 Hi Folks,

 The W3C recommends [1] text sent out over the Internet be in Normalized
 Form C (NFC):

 This document therefore chooses NFC as the
 base for Web-related early normalization.

 So why would one ever generate text in decomposed form (NFD)?

 Do any programming languages output text in NFD? Does Java? Python? C#?
 Perl? JavaScript?

 Do any tools produce text in NFD?

 Should I assume that any text my applications receive will always be
 normalized to NFC form?

 Is NFD dead?

 /Roger

 [1] http://www.w3.org/TR/charmod-norm/#sec-ChoiceNFC





Re: Normalization rate on the Web

2013-01-21 Thread Andrew Cunningham
Hi Denis,

A fea thoughts ... library data may be nfc or nfd, but is more likely to
conform to the MARC character repetoire, so isn't exactly NFD.

Vietnamese data is either 1) NFC or 2) neither NFC nor NFD

It would be rare to find vietnamese data in NFD

For a range of afrjcan languages, maily ones uskng diacriti s anx diacritic
stackkng, it may be 1) NFC, 2) NFD or 3) niether NFC nor NFD depending on
the input framework used.
On Jan 22, 2013 3:26 AM, Denis Jacquerye moy...@gmail.com wrote:

 Does anybody have any idea of how much of the Web is normalized in NFC
 or NFD? Or how much not normalized?

 How would one find out or try to make a smart guess?

 I know a lot of library catalogue data is in NFD or somewhat
 decomposed. Is there any other field that heavily uses decomposition?

 --
 Denis Moyogo Jacquerye
 African Network for Localisation http://www.africanlocalisation.net/
 Nkótá ya Kongó míbalé --- http://info-langues-congo.1sd.org/
 DejaVu fonts --- http://www.dejavu-fonts.org/





Re: cp1252 decoder implementation

2012-11-20 Thread Andrew Cunningham
Hi

On 21 November 2012 16:42, Philippe Verdy verd...@wanadoo.fr wrote:

 But may be we could ask to Microsoft to map officially C1 controls on the
 remaining holes of windows-1252, to help improve the interoperability in
 HTML5 with a predictable and stable behavior across HTML5 applications. In
 that case the W3C needs not doing anything else and there's no need to
 update the IANA registry.


Not sure what the purpose or need for this would be, let alone the need for
it. Seems to be a vision of an ideal world that does not exist.

If such remapping occurred then some legacy content would be potentially
broken.

Many languages, and many character encodings did not go through a formal
standardization or registration. Thus not officially supported, and most of
the time worked by 1) declaring themselves as iso-8859-1 or windows-1252
and 2) specifying specific fonts.

Web browsers support a small limited number of character encodings, and
redefining and changing how key character encodings work will have
implications for legacy data and for languages currently unsupported by
Unicode or languages with limited practical support from vendors.

Ok, not many but there are a few still out there, and i still do come
across content being created in legacy encodings.

Andrew

Project Manager, Research and Development
Social and Digital Inclusion Unit
Public Libraries and Community Engagement
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Ph: +61-3-8664-7430
Mobile: 0459 806 589
Email: acunning...@slv.vic.gov.au
  lang.supp...@gmail.com

http://www.openroad.net.au/
http://www.mylanguage.gov.au/
http://www.slv.vic.gov.au/


Re: [indic] Re: Lack of Complex script rendering support on Android

2011-11-07 Thread Andrew Cunningham
I'd agree with Ed, its a broader problem than just India, and a problem not
just based on market segments. I use Android devices often, but can not use
them as a serious tool for work because of what I would classify as serious
limitations in the OS and its internationalisation model. At moment mobile
devices are still toys unable to deal with most community languages I need
to work with or support in Australia, including many Latin script languages.

This is a limitation in all mobile OSes not just Android. I just have to
look through my facebook and google+ accounts to see many messages in
African and South East Asian languages that will not display.

I do love the irony of Android devices not being able to display all
content available through Google services.

Andrew


Re: Multiple private agreements (was: RE: Code pages and Unicode)

2011-08-24 Thread Andrew Cunningham
so you will end up with the CSUR AND the registry Pilippe is
suggesting AND all the existing uses of PUA that will not end up in
CSUR or the other registry.

sounds like it will be a mess.

its bad enough dealing with Unicode and pseudo-Unicode in the Myanmar
script, adding PUA potentially into the mix  ummm...

On 25 August 2011 11:55, Philippe Verdy verd...@wanadoo.fr wrote:
 2011/8/24 Doug Ewell d...@ewellic.org:
 As Richard said, and you probably already know, there is no chance that
 UTC will ever do anything with the PUA, especially anything that gives
 the appearance of endorsing its use.  I'm just thankful they haven't
 deprecated it.

 The appearance of endorsing its use would only come if the website
 describing the registry was using a frame using the Unicode logo.

 It can act exactly like the CSUR registry, as an independant project
 (with its own membership and participation policies), that would also
 be helpful for collaborating with liaison members, ISO NB's, or some
 local cultural organizations or collaborative projects.

 The focus of this registry would only be for helping the encoding
 process: registered PUAs or PUA ranges would not survive to finalized
 proposals that were formally proposed and rejected by both the UTC and
 WG2, and abandonned as well by its iniital promoters in the registry
 (no new updated proposal), or to proposals that have been finally
 released in the UCS (and there would likely be a short timeframe for
 the death of these registrations, probably not exceeding one year).

 It would be different from the CSUR, because CSUR also focuses on
 supported PUAs that will never be suppoorted in the UCS (for example,
 due to legal reasons, such as copyright which would restrict the
 publication of any representative glyph in the UCS charts), or
 creative/artistic designs

 (For example, I'm still not convinced that Klingon qualifies for
 encoding in the UCS, because of copyright restrictions and absence of
 a formal free licence from right owners; the same would apply to any
 collection of logos, including the logos of national or international
 standard bodies that you can find on lots of manufactured products and
 in their documentation, because the usage of these logos is severely
 restricted and often implies contractual assessments by those
 displaying it on their products or publications; this would also apply
 to corporate logos, even if they are widely used, sometimes with
 permission, but this time because these logos frequently change for
 marketing reasons).






-- 
Andrew Cunningham
Senior Project Manager, Research and Development
Vicnet
State Library of Victoria
Australia

andr...@vicnet.net.au
lang.supp...@gmail.com




Re: ch ligature in a monospace font

2011-06-29 Thread Andrew Cunningham
On 30 June 2011 07:59, Richard Wordingham
richard.wording...@ntlworld.com wrote:
 On Wed, 29 Jun 2011 03:49:42 +
 Peter Constable peter...@microsoft.com wrote:

 That would appear to be a limitation of the input method.

 It is indeed a limitation of X.  I get round it on Ubuntu by using
 IBus and KMFL (Keyman for Linux), which then allows me to use dead keys
 for sequences, something which is (or used to be) beyond MSKLC.


I assume you mean KMFL (Key Manager for Linux) with uses an extremely
old version of the Keyman syntax.

From memory you should be able to get more mileage out of MSKLC than
you seem to have.

Andrew
-- 
Andrew Cunningham
Senior Project Manager, Research and Development
Vicnet
State Library of Victoria
Australia

andr...@vicnet.net.au
lang.supp...@gmail.com




Re: Using Javascript to Detect Script Support in a Browser

2010-06-21 Thread Andrew Cunningham
hi Ed,

On 22 June 2010 11:51, Ed Trager ed.tra...@gmail.com wrote:
 Thanks, Andrew!  I like Keith's approach.

 I have been looking at Lanna a little bit and I am not sure if *any*
 OS shaper currently really has fully implemented correct shaping
 support for Lanna?  In any event, Lanna is quite similar to Myanmar,
 so Keith's approach could be used very successfully.

since there are no specific guidance for developing Lanna or Myanmar
OpenType fonts, I assume that Lanna fonts have developed using some of
the more generic OpenType features much like Myanmar and thus should
shape correctly on the same platforms as Myanmar Unicode fonts do.

I guess the key issue is who your target audience are and what is the
oldest OS versions likely to be used on your site?

Out of curiosity, which OpenType fonts are you using for Lanna?

When experimenting with Myanmar web fonts, I made one big mistake,
relied on some of the web based tools for generating web fonts, which
broke the complex rendering. Best to generate the web fonts form
available command line tools.

 It might be interesting to see if Keith's approach can be generalized
 a bit to detect whether correct rendering is available for a number of
 those related S and SE Asian scripts: Myanmar, Lanna, Khmer, Kannada,
 etc.

should be possible.


-- 
Andrew Cunningham
Senior Project Manager, Research and Development
Vicnet
State Library of Victoria
Australia

andr...@vicnet.net.au
lang.supp...@gmail.com




Re: Using Javascript to Detect Script Support in a Browser

2010-06-18 Thread Andrew Cunningham
it is an issue that we've struggled with for a while

eot, ttf font linking, woff and svg fonts all play a part in a
possible solution.

for my projects i also have to consider if clients are likely to be
using older operating systems, and thus may not have rendering
support.

SO detecting if appropriate fonts are available, doesn't

Keith Stribley used a similar approach, see
http://www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/WebDevelopers/#detect

For Myanmar he compared U+1000 U+1000 to U+1000 U+1039 U+1000
which not only allowed him to see if an appropriate font was a
available, but whether appropriate rendering was occurring.

On 17 June 2010 07:29, Ed Trager ed.tra...@gmail.com wrote:
 On Tue, Jun 15, 2010 at 5:52 PM, Marc Durdin

 Couldn't you do this just using font fallback in CSS, and just leave it to 
 the user agent to sort out?  Two examples:

  P { font-family: Code2000, MyCode2000; }
 �...@font-face { font-family: MyCode2000; src: url('code2000.ttf'); }

 Or:

  P { font-family: MyCode2000; }
 �...@font-face { font-family: MyCode2000; src: local(Code2000), 
 url('code2000.ttf'); }



for the browsers that can handle src local() syntax.

 I cannot conclusively say at this point whether my planned dynamic
 solution is better than a more static let the UA figure it out
 approach, but I'm going to try it and see how it goes.


both approaches have there benefits, really depends on what you are
trying to achieve. But i suspect that the static solution is more
scaleable.





-- 
Andrew Cunningham
Senior Project Manager, Research and Development
Vicnet
State Library of Victoria
Australia

andr...@vicnet.net.au
lang.supp...@gmail.com




Re: OpenType vs TrueType (was current version of unicode-font)

2004-12-04 Thread Andrew Cunningham
Ok,
slight variation on the questions to date.
which opentype fonts (other than Dolous SIL and Code2000) support the 
placement of combining diacritics?

Andrew

Andrew Cunningham
andj_c at iprimus.com.au
andrewc at vicnet.net.au



Re: Combining diacriticals and Cyrillic

2003-07-10 Thread Andrew Cunningham
Hi Vladimir

yes in theory your answer is Unicode, i.e. cyrillic plus combining 
diacritics.

Although the actual application of the theory will differ from operating 
system to operating system.

I did a quick test on windows in both word processors and web browsers. 
Everything displayed correctly (given certain combinations of fonts and 
applications).

There are two elements that need to be addressed:

1) appropraite fonts. I only know of two that are suitable: Code2000 (v. 
1.13) has the appropriate opentype tables (I believe it uses the 
OpenType MarkToBase feature - others on the list will correct me if my 
memory is faulty). The second font is Doulos SIL (v 0.6 - Beta). This 
font has both OpenType tables and Graphite tables. Graphite is a 
rendering system developed by SIL International.

2) You need a rendering system that supports the features. On Windows, 
this means that you will need a version of Uniscribe that supports the 
use of combining diacritics with cyrillic characters. Currently none are 
available, except for the version in the MS Office 2003 Beta. I did a 
quick test using the two fonts above, and the characters displayed 
correctly. So from the point of view of word processing, there is a 
solution coming. This approach will also work with other applications 
that support uniscribe. Although you might ahve to wait until Microsoft 
release a service pack that contains the uniscribe update.

I assume that Microsoft will update one or more fonts with the necessary 
features when they release Office 2003.

I also tested the software in some graphite enabled software (WorldPad 
and a graphite enabled version of Mozilla). It seemed to work fine as well.



[EMAIL PROTECTED] wrote:

Dear Ladys and Gentlemen,

Currently there is an ongoing effort in Bulgaria trying to resolve an issuie concerning the way we write in Bulgarian. 

Our problem is: 

Usually a bulgarian regular user does not need to write accented characters. There is one middle-sized exclusion of this, but generally we do fine without accented characters. The problem is that in some special cases or more serious lingustic work, one definetely needs to be able to write accented characters (accented vowels). 

One of the ideas is to invent a new ASCII-based encodings, containing the accented characters we need. This would introduce an additional disorder in the current mess of cyrillic encodings, and would introduce problems with automated spellcheck.

Generally I beleive it would be best to invent a Unicode based solution.

Such a solution is for example, combining diacritical signs with the cyrillic symbols. 

I composed a demo page: 
http://v.bulport.com/bugs/opera/426/balhaah_lonex_org/ 

and then made 10-20 shots of the results on Opera and IE on Linux, Windows 98 and Windows XP: 
http://v.bulport.com/bugs/opera/426/balhaah_lonex_org/shots.html 

You can see that this approach yields _quite_ incosistent and useless results, depending on the font, application and operating system being used. 

Finally, I wonder if you could give us some advice: 

1. 
Is it possible somehow to improve this approach? I imagine eg., if the font can provide prepared combined symbols whenever the application asks for a combined cyrillic+diacritical, instead of leaving the application to do the combination.

2. 
Do you see other unicode based approach to the Bulgarian problem? 

3. 
Do you beleive the approach should be looked for outside Unicode? 

Please excuse me for wasting your time, 
Vladimir, 
Bulgaria

.



--
Andrew Cunningham
Multilingual Technical Officer
Online Projects Team, Vicnet
State Library of Victoria
328 Swanston Street
Melbourne  VIC  3000
Australia
[EMAIL PROTECTED]

Ph. +61-3-8664-7430
Fax: +61-3-9639-2175
http://www.openroad.net.au/
http://www.libraries.vic.gov.au/
http://www.vicnet.net.au/



Re: 4701

2003-02-03 Thread Andrew Cunningham
From memory, although my memory may be faulty, there are some slight 
differences between the animals assigned in the Chinese calendars and 
the animals assigned in the Vietnamese calendar.

in the Vietnamese sequence, it is goat. while most chinese sources 
indicate sheep (occasionally they say ram, but sheep is most common)

at least thats what i seem to remember. But then three's been so many 
fire crackers going off over the three days of tet, that something might 
have rattled loose in my memory.

Andrew

Michael Everson wrote:
At 10:19 -0800 2003-02-01, Eric Muller wrote:


Michael Everson wrote:


Happy New Year of the Yáng to everybody! (I can't work out whether 
it's the Year of the Sheep, the Goat, or the Ram.)


Ram.



europe.cnn.com (which I was looking at for other, sadder reasons), says 
Goat. My local Superquinn's (large grocery chain) has had signs on all 
the Chinese food for weeks which says Ram. My Chinese dictionary says 
Sheep.







Re: glyph selection for Unicode in browsers

2002-09-26 Thread Andrew Cunningham

Hi

Tex Texin wrote:
 
 In the case of HTML, XML, CSS, ways to specify typographic preferences
 exist, and language can be expressed via lang. We just need browsers
 and other user agents to make use of the lang information as part of
 font selection.

For me, this is the crux: that browsers have not implimented the css 
:lang selector.

Things would be easier if we could tie presentation (via css) to the 
specified language of a document or part of a document.

Andrew

-- 
Andrew Cunningham
Multilingual  Technical Officer
OPT, Vicnet
State Library of Victoria
Australia

[EMAIL PROTECTED]

Ph: +61-3-8664-7001
Fax: +61-3-9639-2175

http://home.vicnet.net.au/~andrewc/
http://www.openroad.net.au/





Re: Can browsers show text? I don't think so

2002-07-06 Thread Andrew Cunningham

I was intending to avoid this whole thread. But considering some of the 
comments that have been made in the thread, I'm forced to make one comment:

I find the naivety displayed on this list, relating to issues about 
multilingual PUBLIC internet access, is disturbing.

Andrew


Andrew Cunningham
[EMAIL PROTECTED]





Re: Unicode Latin combining diacritics - Looking for real-world example documents

2002-04-01 Thread Andrew Cunningham

Hi Chris,

I'm currently type setting some Dinka poetry for a friend.

Dinka requires a combining diaeresis with open-o and open-e info at 

http://www.openroad.net.au/languages/african/dinka-2.html

a sample utf-8 web page is at 
http://www.openroad.net.au/languages/samples/dinka.html

this poem plus another one arre attached as text documents (utf-8)

Additionally, for linguistical purposes you could also add tone markers
as well (grave and acute) but this isn't used in day to day writing.

I'll try to source some Nuer text which uses a macron below.

 Chris Pratley wrote:
 
 Does anyone have real-world documents in Unicode that take advantage
 of Latin Combining Diacritics (U+0300 range and others) to accurately
  represent the text content? If so, I would appreciate links or docs
 mailed to me.
 
 
 
 We’re doing some testing of Latin Diacritic support for IPA and
 African languages, romanizations, etc., and it is (understandably)
 very hard to find any “real” text in languages that require this
 support where the diacritics have not been left out in order to work
 around the lack of software support. (Catch-22!). I’m looking for
 text (especially with stacked diacritics) in IPA, Hausa, Ewe, or other
 West African languages, Mixtec or other Mexican languages, Dinka,
 Nuer, etc. Basically anything that is real-world and shows off typical
 or tricky diacritic combinations. If you could include an image or at
 least a verbal description to show what the display would be if it
 were correct, that would be lovely.
 
 
 
 I’m not promising anything, but I know that there are several (many)
 people on this list who would be interested in having this support in
 Word or other Microsoft products, so now’s your chance to influence
  the outcome – if we’re going to get it done right I need your
 help!
 
 
 
 Thanks in advance,
 
 Chris Pratley
 
 Group Program Manager
 
 Microsoft Word
 
 
 
 Sent with OfficeXP on WindowsXP
 


Yeŋa ba wɛ̈l ɣakɔ̈u
M. A. M.

Yeŋa ba wɛ̈l ɣakɔ̈u
Yeŋa ba wɛ̈l ɣakɔ̈u
Yeŋa bï kɔ̈ɔ̈c ɣakɔ̈u
Yeŋa ba wɛ̈l ɣakɔ̈u.

Na tiëŋ cam ku cuëc,
Ka cïn raan kääc ke ɣɛɛn.
Na liɛɛc ɣakɔ̈u ku ɣanhom,
Ka cïn raan, kääc ke ɣɛn.
Kuatdiɛ̈ adaai të nɛ̈k alei ɣa thïn
Apirika acä wɛ̈l yekɔ̈u.

Yeŋö cï wuɔ̈ɔ̈t ɣa maan
Yeŋö cï wuɔ̈ɔ̈t ɣa maan
Kɔc wäär bïï yanhden
Aa kääc roor të mec
Ku alei anɛ̈k ɣa yanhden.
Yeŋö ye wuɔ̈ɔ̈t yethok mat
Bïk ɣa cɔl adhur kuat ce thok mat.

Yeŋö ye alei lɔ̈ɔ̈r yupic
Yeŋo yen lɔ̈ɔ̈r guɔ̈t köök
Yeŋo ye alei ɣa guɔ̈t pïny wakɔ̈u.
Acï weŋ peec ku peec thɔ̈k
Acï Deŋ peec ku peec Nyankiir
Adhur ɣɛn ke wämäth akën
Ku kɔc ken ye kek yanh tök theek
Ku acïn raan kony ɣɛɛn.

Yeŋa ba wɛ̈l ɣakɔ̈u
Yeŋa ba wɛ̈l ɣakɔ̈u
Apirika, acä wɛ̈l yekɔ̈u
Acä päl alei.
Bï alei ɣa näk bɛɛŋ
Ku ke yic yen anɛ̈k ɣɛɛn
Yiny wïc ɣɛn piɛnydiɛ̈.

Aŋic pinynhom lɔc cï Muɔnyjäŋ thuɔu gam
Ku thou atɛɛr ke pïïr
Na bɔ̈ piɛnydiɛ̈ bei
Ke Nhialic abä bɛn cuɔ̈ɔ̈t
Köŋ cuëëc ë Deŋ Abuk
Aba bɛn cuɔ̈ɔ̈t
Ku abä wɛ̈ɛ̈r bei.
Yic acie tiaam
Yic acie thou
Cɔk run bɛ̈ɛ̈n Loŋär bɛ̈n
Ke Nhialic abä bɛn cuɔ̈ɔ̈t.

Acïn të liu yuɔmkiɛ̈ thïn
Yïn kiir ɣer ku kiir col
Cäk ɣa päl alei
Cäk ɣa päl alei
Yeŋa cït ɣɛɛn
Yeŋa cï yɔŋ yaaŋ yic buɔɔt
Ɣɛn acï cuäny ɣöt ke miëthkiɛ̈ 
Arak thiäär.
Ɣɛn acï nɔ̈k bï ɣa luɔ̈i akuut
Ku acïn raan cï alei thiëëc.

Kɔc ë pinynhom kɔc tɛ̈k yiith
Cäk la dë
Cï alei week dɔ̈m määth
Na week kɔc ye ɣok, ke yanh tök theek
Cï alei week ɣɔɔc
Na cäk jai Jesu,
Ke ɣok aabï rɔ̈m pan nhial ë Kristo.

Wun ë Tiɛɛl acie Mac ku Pan Cïnic Bɛ̈ny ee riääk aköl
Muɔrwël Ater Muɔrwël

Jɔlku muɔ̈l teer
Ku yeku röt nhiaar
Acïn raan ben Muɔnyjäŋ nhiaar
Ee Muɔnyjäŋ yen acï ya anyaar
Yen ajɔl wuɔ̈ɔ̈t thäär.

Ariɔ̈ny kiith wäär cï thiaan
Kek aa jam ka bï Muɔnyjäŋ tiaam
Ku këden acä ye bɛ̈ɛ̈r
Ɣok aa kɔc thiääk
Aköl le ɣɛn rɔ̈m ke keek
Aabï dhiau arak thiäär
Rin ɣɛn ee moc arak thiäär.

Ee raan dhɛ̈l ɣa yen acä ye ŋuään
Ku yen aye cuɔ̈p teer
Ku na cɔk kë ɣa keer
Ke ɣɛn acï kɛt keek.
Thɔndït ee nöök të piiny,
Ɣɛn ee Muɔnyjäŋ.
Ɣɛn ee Muɔnyjäŋ.

Jɔlku muɔ̈l teer
Ku yeku röt theek
Rin puɔ̈n cïnic teer
Yen abï Nhialic ɣok thiee
Ku yen abï Nhialic ɣok röt kueeŋ
Ku yen abï ɣok röt deer.

Na cuk röt ë theek
Ka alei ee ɣo wïc yiic
Ë raan cuai bï keeth
Ku rum käkua bï pïïr ë keek
Ku benku ya dhiau ɣok aacï alei peec
Ku ke tiɛldan cï ɣok ë mat
Yen acï alei ɣok ë theek.

Duɔ̈kkë ye mïth ë röt
Acï kɔc ë leec
Ku ee yï mac theer
Ku acïn Muɔnyjäŋ ye mac theer
Diët Muɔnyjäŋ acie baai keer
Aa mïïth kek acï ɣok yiëk teer
Ku yen acï alei ɣok ë theek
Ku rum käkua bï pïïr ë keek
Ku benku ya dhiau ɣok aacï alei peec
Ku ke tiɛldan cï ɣok ë mat
Yen acï alei ɣok ë theek.

Jɔlku muɔ̈l teer
Alei acï ɣok leel
Buk yanhde ya theek
Ku cuk yäthkua ye theek
Ku na cɔkku yanhden theek
Ka ŋuɔt cïk gam buk nhïïm thöŋ ke keek
Rin alei, ku cɔk yiëk moor
Ka cï yï kɔŋ leec.

Matku ɣo yiic
Matku ɣo yiic
Ku yeku röt deet
Ëtë bï ɣok piir thïn ke keek
Ee 

Re: How to make oo with combining breve/macron over pair?

2002-03-03 Thread Andrew Cunningham

Hi Dan,

At 08:39 PM 3/3/02 -0800, Dan Wood wrote:
 
Hi,

I'm not finding hints of this in any of the FAQ or where's my character
docs  I'm trying to create (or find) the oo pair with a combining
macron (0304) and combining breve (0306) over the pair of them together, as
in these images:

http://wwwbartlebycom/images/pronunciation/oomacrgif



is it a combining macron you need? or a combining overline

from my understanding the overline is supposed to connect on the left and
right, and i'd assume the acron isn't supposed to

so maybe U+006F U+0305 U+006F U+0305 would suit


either that or we need two additional combining double diacritics added to
unicode


Andrew




Re: Unicode Search Engines

2002-02-19 Thread Andrew Cunningham

At 08:13 AM 2/19/02 -0800, Doug Ewell wrote:
Asmus Freytag [EMAIL PROTECTED] wrote:

 So if some language turns out to need
 a with horn in the future, its readers will have to cross its fingers
 that rendering engines become capable of displaying U+0061 U+031B
 properly.

 Support for such arbitrary combination is apparently in the works in
several
 camps - it's needed in African languages for one.

And judging from Marco's unrelated post about Yoruba q-tilde, in which I
*did* see the tilde positioned correctly (more or less) over the q, I
guess support is more advanced than I thought.  Terrific.


Ummm ... may work for lower case ... if you're not fussy about precise
location of the diacritic , i suspect that the diacritic would overstrike
the uppercase character though




===
Andrew Cunningham
Multilingual Technical Officer
Accessibility and Evaluation Unit, VICNET
State Library of VIctoria
Australia

http://www.openroad.net.au/

[EMAIL PROTECTED]

+61-3-8664-7001
===




Unicode-Afrique forum

2002-02-03 Thread Andrew Cunningham



Hi everyone, 

thought I'd pass on the info below.

A French language forum discussing the potential of 
Unicode for African langauges has been launched. Details below.

Andrew

==

Unicode-Afrique
http://groups.yahoo.com/group/Unicode-Afrique/

L'Unicode représente probablement la meilleure 
chance pour favoriser l'informatique et le contenu d'Internet en langues 
africaines. La pluralité actuelle de polices et des systèmes de coding 
non-intercompatibles pour les caractères spéciales ou non-Latins empêche un vrai 
plurilinguisme des NTIC en Afrique (et le monde).

Cet e-groupe existe pour: donner publicité aux 
projets en Afrique utilisant l'Unicode; discuter des questions et problèmes 
pratiques avec Unicode et les jeux de caractères pour des langues africaines; et 
partager des expériences utiles sur le développement et utilisation des polices 
unicodes pour langues africaines. Donc il n'est pas en concurrence ni avec 
le newsgroup sur l'Unicode "fr.comp.normes.unicode," ni avec les listes de 
discussion générale sur les NTIC en Afrique tel que 
"afrique-informatique."


Re: Problems with viewing Hindi Unicode Page

2002-01-25 Thread Andrew Cunningham


- Original Message -
From: [EMAIL PROTECTED]
 The version of Arial Unicode MS on my system does have layout tables for
 Devanagari. I don't know with what product this version was introduced to
 my system -- I've got Win2K, IE5.5 and Office XP.


I guess the question becomes, which version of Arial Unicode MS?

I suspect that the version of Arial Unicode MS you have must be form Office
XP


Andj





Re: Inuktitut, Cree, Ojibwe input methods?

2001-10-29 Thread Andrew Cunningham

There is also

CreeKeyUni which uses Tavultesoft Keyman 5

available at

http://www.creeculture.ca/e/language/fonts_kbds.html


Andrew Cunningham



At 03:09 PM 10/29/01 -0800, John Hudson wrote:
At 10:43 10/29/2001, Mark Leisher wrote:

Does anyone have any pointers to keyboard layouts/input methods for these
(or
related) languages?

There is an official Inuktitut keyboard developed for the government, 
language commission and land rights organisations in Nunavut. A driver has 
been made for Windows NT/2000/XP, and my understanding is that Microsoft 
are reviewing this for possible inclusion in the OS.

This keyboard driver is downloadable from 
http://www.assembly.nu.ca/test/unicode/, which also some fonts and 
utilities for converting documents using older non-standard encodings to 
Unicode.

John Hudson

Tiro Typeworks www.tiro.com
Vancouver, BC  [EMAIL PROTECTED]

Afghan warlord kills own troops, sells drugs,
plays with dead goats - and he's on our side.
National Post headline
Friday, October 19, 2001








Re: Inuktitut, Cree, Ojibwe input methods?

2001-10-29 Thread Andrew Cunningham

Hi Peter and everyone,

I'd be interested in seeing they keyman file you generated for Eastern Cree.

Most of the keyboards i've seen have been designed for specific langauges.
Has anyone come across a single keyborad layout intended to support all of
UCAS? A friend at the national library of canada was interested in a single
keyboard layout that their staff and teh public could use to access unicode
based catalogues and databases. On public workstations it would be easier
to support a single layout, rather than different layouts for different
languages that use Syllabics.

Andj 

Andrew Cunningham
Multilingual Technical Officer
Accessibility and Evaluation Unit, Vicnet 
State Library of Victoria 
Australia









At 10:21 PM 10/29/01 -0600, [EMAIL PROTECTED] wrote: 



On 10/29/2001 04:13:39 PM James Kass wrote:

And, here is a page which illustrates different layouts for Eastern
and Western Syllabics (and has fonts, too):
http://www.knet.on.ca/keyboard.html

I just did a quick Keyman file for one of these layouts (the Easter
Syllabics layout -- generating Unicode, not the custom encoding of their
fonts), though there are a few symbols in their chart where it's not clear
to me just what they want. Anyway, I'll make it available if anyone wants
to use it. 



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]









Re: OT Nastaleeq conforming to Unicode

2001-09-06 Thread Andrew Cunningham

Hi Abdul-Majid,

I'd be very interested in hearing more about your font development project.

Andj.



At 10:53 AM 9/6/01 -0700, Majid Bhurgri wrote: 



A few days ago I posted following message which was received well and I
received quite a few responses. But as I was on vacation, I only breifly
reviewed some of the messages and somehow, in the meanwhile, all the
messages got deleted before I could respond or even save these.

I apologize for the inconveniece, and request you to kindly resend your
messages to me so that I can respond to you individually.

Thanks  regards.

Abdul-Majid Bhurgri

I have developed a prototype Nastaleeq (Urdu) font of the same quality
as the currently available Nastaleeq fonts used for typesetting, which
also conforms to the Unicode Standards and OpenType specs and as such
works smoothly in MSWindows and multilingual Windows applications (MS
Word, Excel, Access etc.)

Completion of the project, needs time and resources. Anyone interested
may contact me at
http://lw2fd.hotmail.msn.com/cgi-bin/compose?curmbox=F1a=26b2c2aca
40ca330d18d7dd54bab6734mailto=1[EMAIL PROTECTED]msg=MSG999774807.4
start=2361053len=3196src=type=x[EMAIL PROTECTED]





--
Get your FREE download of MSN Explorer at
'http://go.msn.com/bql/hmtag_itl_EN.asp'http://explorer.msn.com









Re: Latin w/ diacritics (was Re: benefits of unicode)

2001-04-16 Thread Andrew Cunningham

Quoting John Hudson [EMAIL PROTECTED]:

 
 Although there has not been any official announcement from Microsoft,
 and 
 no release date, my understanding is that 'generic' shaping is being
 added 
 to Uniscribe. This includes support for diacritic composition using 
 OpenType mark-to-base and mark-to-mark positioning lookups. The font 
 support is already in place (see the OpenType specification v1.3,
 published 
 last week, at http://www.microsoft.com/typography ), and the system
 support 
 is on the way.
 

This is good news, whenever it does finally eventuate.

I'll look  at the new spec.


Andrew Cunningham
Multilingual Technical Project Officer
Vicnet, State Library of Victoria
[EMAIL PROTECTED]




Re: Latin w/ diacritics (was Re: benefits of unicode)

2001-04-16 Thread Andrew Cunningham

Quoting James Kass [EMAIL PROTECTED]:

 
 Waiting isn't much of an option, the users need results now.
  Even when the rendering technology catches up, the old 386's
 and such that are in use in places like the Sudan may not be able
 to support an OS capable of using new rendering technology.
 



 Similar circumstances may apply to many of the hundreds or
 thousands of 'Unicode-challenged' writing systems mentioned
 by Peter Constable.
 

actuallu not unicode-challenged, since unicode has a mechanism to support them, 
more OS- and software-challenged.

   
 
 Andrew also mentioned custom (8-bit) code pages, which are widely
 used.  Lately, people who haven't considered the lack of alternatives
 have taken to criticizing such practicality, calling it "font-hacks"
 and

actually i don't think they're widely used. But I'd rather not get into 
Sudanese politics at the moment.



 so forth. If you do make custom code page web sites, perhaps you
 should consider maintaining duplicate web pages in Unicode.  Even
 though the Unicode pages wouldn't display, they would be handy to
 send as links in response to anyone complaining about non-standard
 code pages.
 

our intially intention was to use a unicode solution, but have also 
investigated a custom 8-bit code page.

One of the areas that has interested me for a while is teh area of langauge 
retention among refugee communities.

My Dinka friends are hopping to develop a trilingual web site (Engliah, Arabic, 
Dinka) that would provide information about their culture and provide resources 
that can be used to teach their children their own langauge. This could be done 
in print, the reason that they wish to place the resources online, is to 
provide these resources to other Dinka refugees that have settled in other 
countries.

 
 Whether the PUA or custom code pages are used, some kind of
 software which converts to and from Unicode would be
 helpful to assure that users of older hardware can continue
 to communicate with the "modern" world.
 

philosophically I'd prefer not to use the PUA.

Its quite possible that we'll used a 8-bit character set initially, and that 
i'll construct unicode versions for private testing and evaluating.

since i'm not a programmer, I'm not able to throw together such a utility. I've 
seen a number of utilities that allow you to convert between unicode and a 
range of defined character sets and encodings, but I haven't found a utility 
that does this and that would allow you to easily construct custom mapping 
tables to use with it as well.

If anyone is aware of such a tool, I'd be interested in hearing about it.

Andj.


Andrew Cunningham
Multilingual Technical Project Officer
Vicnet, State Library of Victoria
[EMAIL PROTECTED]




Re: benefits of unicode

2001-04-15 Thread Andrew Cunningham

Quoting "Michael (michka) Kaplan" [EMAIL PROTECTED]:

 From: "Andrew Cunningham" [EMAIL PROTECTED]


 
 Well, I guess this is one of those huge "maybe" type questions, since
 there
 is no universal definition of what "supports Unicode x.xx" means. Here
 are
 some sample posers:
 

LOL

yep i understand and agree ... I suppose that working predominately with 
community languages in Australia, I tend to get asked more often for those 
scripts in unicode 3.0 that Microsoft don't support yet in any way shape or 
form.

*shrugs*

'tis the weave. One of the inherent problems with working with multilingual 
community information. Life would be easier if I was working on teh business 
side reather than teh community side of the field.

 
  and if only they did allow latin script support in uniscribe  but
 i
  guess support for african langaguageds is extremely low on their list
 of
  priorities.
 
 I would not ever presume such a thing... what issues in latin scripts
 are
 you referring to? I am not sure Uniscribe is where such a fix would be
 (all
 the issues I know of would involve keyboards and potentially fonts).
 


Lets see ... one problem i'm having at the moment .. is how to support Dinka 
(Southern Sudan) in Unicode on web pages  displaying on windows 
95/98/ME/NT4/2000.

four characters come to mind, each of teh four characters can be represented 
ideally by a pair of code points ...

U+0254 U+0308
LATIN SMALL LETTER OPEN O  +  COMBINING DIAERESIS

U+0186 U+0308
LATIN CAPITAL LETTER OPEN O  +  COMBINING DIAERESIS

U+025B U+0308
LATIN SMALL LETTER OPEN E  +  COMBINING DIAERESIS

U+0190 U+0308
LATIN CAPITAL LETTER OPEN E  +  COMBINING DIAERESIS

also there is a convention for indicating tone that is not part of teh formal 
orthography of teh langauge, but is useful in materials deisgned for students 
learning teh language. A set of combining diacritics are used to indicate tone. 
Riusing tone indicated by an acute, and falling tone indicated by a breve.

so U+0254, U+0186, U+025B, U+0190  would have to combine with an acute and a 
grave.

All breathy vowels (indicated by a diaeresis) would also have to combine with a 
grave or acute .. so you'd have a base vowel: a,e, open e, i, o, open o, and u 
each with two combining diacritics, one a diaeresis and teh second an acute or a 
grave.

theoretically I know what unicode characters would be in teh data stream, I 
could use keyman for instance to input teh appropriate characters/vowels and teh 
combining diacritics. The problem comes with display.

I can cheat ... and create glyphs in teh PUA for all teh necessary charcater .. 
that would mena that instead of entering U+0254 U+0308, I'd have the input 
software input a single code point in teh PUA ... a rather daft approach for 
future compatability since an appropriate codepoint sequence already exists 
(U+0254 U+0308).

In theory this could be handled using glyph substitution .. its possible to 
create an open type font that uses glyph substitution to render the required 
glyphs.

buut this is where the problems start, from my understanding adobe's indesign 
supports some open type font features using teh latin script, but Microsoft's 
uniscribe does not support Latin script.

since my knowledge of font renderimg technology is rather limited, are you aware 
of another way i can render these characters in IE5+ on various windows 
platforms

I suppoose if i restricted myself to fixed width fonts I could create combining 
diacritics that would be  correctly spaced ... but since I really need 
proportional fonts ... I'm not sure how to proceed.

currently we're using custom character sets (8-bit) taht were explicitly made 
for the Dinka language.

This problem isn't unique to Dinka, you'll find it exists in other african and 
some australian aboriginal languages. So teh question is ... how should one 
handle kllangauges that use combinations of latin letters and diacritics and 
where a precomposed form does not exist?

Andj.



Andrew Cunningham
Multilingual Technical Project Officer
Vicnet, State Library of Victoria
[EMAIL PROTECTED]




Re: benefits of unicode

2001-04-15 Thread Andrew Cunningham

Hi James,

Quoting James Kass [EMAIL PROTECTED]:

 
 Many African adaptations of the Latin script require
 characters which aren't precomposed in Unicode.  
 

yep, you can add a number of australian aboriginal languages to that list as 
well

 One example of a common problem is with combining 
 diacritics designed for lower case letters.  When the 
 diacritic is used with a capital letter, it appears at the 
 default height and is superimposed on the capital letter 
 rather than appearing above it.  
 
 The new OpenType font specifications enable such 
 combinations to be displayed correctly.
 
 Uniscribe is the mechanism that accesses OpenType 
 features on Microsoft OS.  The version of Uniscribe 
 currently available doesn't yet have support for Latin 
 OpenType features enabled.  I seem to recall having 
 read recently that Latin features would soon be 
 enabled in Uniscribe, perhaps as early as this summer.
 

I hope so, last comment i remeber reading on the volt mailing list seemed to 
indicate tehy weren't overly interesed in supporting latin. Hope they do. And 
depending on how they do it, it might make unicode for african langauges 
possible.

Andj.

Andrew Cunningham
Multilingual Technical Project Officer
Vicnet, State Library of Victoria
[EMAIL PROTECTED]




Re: benefits of unicode

2001-04-13 Thread Andrew Cunningham


- Original Message -
From: Michael (michka) Kaplan [EMAIL PROTECTED]

 It DOES, however, underscore the fact that Unicode support is so much
easier
 than supporting every random code page that the only reasonable way
vendors
 can keep up with every single market is to have a good story for Unicode
 support.


true, personally i'd rather seem Microsft complete their unicode support
first before doing anything with other character sets ... quite a few years
off full support for unicode 3.0 and 3.1

and if only they did allow latin script support in uniscribe  but i
guess support for african langaguageds is extremely low on their list of
priorities.

Andj.







[OT] Re: relation between unicode and font

2001-01-05 Thread Andrew Cunningham

Hi everyone,

actually there is a bug in the browsers, or at least in internet explorer.
Its been there in versions 4,5,5.5

Yes a lot of 8-bit fonts exist, Many of these 8-bit fonts follow MIcrosoft's
codepages rather that the iso-8859 series, in that that place characters in
the C1 zone.

For instance, if i was creating a vietnamese page in VISCII encoding, I'd
associate the VISCII fonts with the user defined encoding in the web
browsers. This works fine in Netscape, but doesn't work in Internet
Explorer.

For some reason only known to Microsoft, since version 4 of their browser
... the User Defined slot carries out a similar conversion to the Western
(Windows) encoding ... the characters in the C1 zone are remapped based on
Win-1252 to the appropraite values in Unicode. Why this mapping was ever
applied to the user defined slot, I'll never know.

If you prepare a VISCII web page containing all the lower case Vietnamese
vowels, you'll discover that some of the vowels can not be displayed in
internet explorer at all. While Netscape 4.x passes these through as is and
will display.

Unicode is a boon these days .. it menas I can create a Vietnamese web page
that can display on netscape AND internet explorer ...

Any custom 8-bit encoding that has characters in the C1 zone may have the
same problem.

working with multilingual public internet access becomes problematic .. IE
is only suitable for encodings that have inbuilt support in the browser ..
and useless for encodings like VISCII that are tarnsformed by the browser
making some of the characters undisplayable ...

one of the reasons that my industry hasn't widely accepted internet explorer
as a default browser. It cann't handle the langauges we need to use,
community langauges rather than commercial languages.

and also one of the reasons that we try to encourage the use of unicode.


Andj

Andrew Cunningham
Multilingual Technical Project Officer
VICNET, State Library of Victoria
Australia

[EMAIL PROTECTED]


- Original Message -
From: Yung-Fong Tang [EMAIL PROTECTED]
To: Unicode List [EMAIL PROTECTED]
Cc: Unicode List [EMAIL PROTECTED]
Sent: Saturday, January 06, 2001 6:29 AM
Subject: Re: relation between unicode and font


 Not really a browser bug. It is a bug in the FONT. Some of the font
 basically claim they are design for a certain encoding which 0x00-0x7F
 represent ASCII while the glyph in that font in those position have
 shape in non ASCII. If font author *lie* to browser, in the information
 which encoded in the font, there are no thing the browser (or browser
 developer) can do.

 [EMAIL PROTECTED] wrote:

  On Thu, 4 Jan 2001, sreekant wrote:
 
 
  font face="Tikkana"A B /font is being shown as some telugu
  characters.
 
  That's basically a browser bug, though some people have seen it
  as a method of extending character repertoire. It has absolutely
  nothing to do with Unicode. For an explanation of the fallacy, see
  http://ppewww.ph.gla.ac.uk/%7eflavell/charset/fontface-harmful.html
  http://babel.alis.com/web_ml/html/fontface.html
 






Re: Mixing languages on a Web site

2000-07-01 Thread Andrew Cunningham

Hi Mike

To use microsoft's global IME for Japanese on NT4, there is one very
important step you need to do ... install NT4 Japanese support .. there are
a few articles about it in the Microsoft knowledge base .. i have the urls
at work, don't have them with me at the moment ...

on the win NT4 cdrom there is a folder somewhere called langpacks ... use
windows explorer to look in it ... there is a file called japanese.inf ..
right mouse click on it .. a pop up menu will appear ... on of the menu
items is 'install' .. select this .. and it will install NT4's Japanese
langauge support .. this should be installed before the global IME for
Japanese ... otherwise it will not work ... at least that's the story ...

ciao

Andrew

Andrew Cunningham
[EMAIL PROTECTED]




- Original Message -
From: Ayers, Mike [EMAIL PROTECTED]
To: Unicode List [EMAIL PROTECTED]
Sent: Saturday, 1 July 2000 3:49
Subject: RE: Mixing languages on a Web site



  From: Michael (michka) Kaplan [mailto:[EMAIL PROTECTED]]
  Sent: Friday, June 30, 2000 4:28 AM
 
  To prove #4 will work, see
 
  http://www.trigeminal.com/samples/provincial.html
 
  Along with 102 other languages, this page includes both Japanese and
  Turkish. UTF-8 is what makes that possible
 
  michka

 I checked it out, and with IE5 I can now view almost all of it.
 There are 5 lines that I cannot view and for which there are no fonts
 available, but otherwise great.  Netscape does not show nearly as many
 (hints?).

 On a possibly entirely unrelated subject, I downloaded Microsoft's
 IMEs for Chinese and Japanese, hoping to learn to use them.  However, I
 cannot figure out how to enable them, and can't locate any helpful info on
 Microsoft's site.  I am running NT4.  Any tips greatly appreciated.


 Thanks,

 /|/|ike