Re: Latin Script

2010-06-28 Thread Asmus Freytag

I'd like to second Mark.

There is a lot of information in the Standard, including the UAXs, and 
the Unicode Character Database that would help answer your questions.


The volunteers associated with the Unicode effort have worked hard 
putting all that information together - so use it, instead of taking up 
their time in repeating it all in personal answers to you.


A./

On 6/28/2010 9:37 PM, Mark Davis ☕ wrote:
See the following for the (/many/) differences between characters with 
the Latin script, and those with LATIN in their names.


http://unicode.org/cldr/utility/unicodeset.jsp?a=\p{script:latin}&b=\p{name:/LATIN/} 



I'd suggest taking a more focused approach to learning about the 
standard, rather than trying relatively scattershot questions to this 
list. You might read through at least the first 3 chapters of the 
Unicode Standard, plus the Scripts UAX. These are all online for free 
at unicode.org .


Mark

— Il meglio è l’inimico del bene —


On Mon, Jun 28, 2010 at 20:55, Tulasi > wrote:


Looks like Unicode did not create any name for any Latin letter/symbol
with LATIN in its name :-')

Am I correct?

Is there a mailing list for ISO/IEC ?

> I don't think it's necessary to post these glyphs to the public
list.

Better to do like Edward Cherlin, i.e., type the symbol after the
name.

e.g., LATIN SMALL LETTER PHI (ɸ)

That way an illiterate like me can quickly see the letter/symbol along
with its name, without additional research.

> The merger between Unicode and ISO 10646 caused a few character
names in
> Unicode to be changed to match the 10646 names.

My I know these letters/symbols with names please?

Tulasi
PS: Thanks Doug, especially for posting the links


From: Doug Ewell mailto:d...@ewellic.org>>
Date: Sun, 27 Jun 2010 16:09:41 -0600
Subject: Re: Latin Script
To: Unicode Mailing List mailto:unicode@unicode.org>>
Cc: Tulasi mailto:tulas...@gmail.com>>

"Tulasi"  wrote:

>> U+00AA FEMININE ORDINAL INDICATOR (which does not contain
"LATIN") is
>> considered part of the Latin script, while U+271D LATIN CROSS
(which
>> does) is considered common to all scripts.
>
> Can you post both symbols please, thanks?

I can point you to http://www.unicode.org/charts/PDF/U0080.pdf , which
includes a glyph for U+00AA, and
http://www.unicode.org/charts/PDF/U2700.pdf , which includes a
glyph for
U+271D.  I don't think it's necessary to post these glyphs to the
public
list.

> Trying to know who among ISO and Unicode first created the
names' list
> for Latin-script is not an indication of obsession :-')
>
> So among Unicode and ISO/IEC, who first created ISO/IEC 8859-1 &
> ISO/IEC 8859-2 letters/symbols names with each name with LATIN
in it?

Most of the characters in the various parts of ISO 8859 were
originally
standardized before Unicode or ISO 10646, so the names were probably
either created by the ISO/IEC subcommittees responsible for those
parts,
or found in earlier standards and adopted as-is.

The merger between Unicode and ISO 10646 caused a few character
names in
Unicode to be changed to match the 10646 names.

--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­








Re: Latin Script

2010-06-28 Thread Doug Ewell

"Tulasi"  wrote:

Looks like Unicode did not create any name for any Latin letter/symbol 
with LATIN in its name :-')


Am I correct?


Probably not, if you take into account something like U+2C70 (Ɒ) LATIN 
CAPITAL LETTER TURNED ALPHA, which was added in Unicode 5.2, or U+A78D 
(Ɥ) LATIN CAPITAL LETTER TURNED H, which is being added to Unicode 6.0.


I have no idea what sort of knowledge you are trying to gather with 
this.


--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­




Re: Latin Script

2010-06-28 Thread Mark Davis ☕
See the following for the (*many*) differences between characters with the
Latin script, and those with LATIN in their names.

http://unicode.org/cldr/utility/unicodeset.jsp?a=\p{script:latin}&b=\p{name:/LATIN/}

I'd suggest taking a more focused approach to learning about the standard,
rather than trying relatively scattershot questions to this list. You might
read through at least the first 3 chapters of the Unicode Standard, plus the
Scripts UAX. These are all online for free at unicode.org.

Mark

— Il meglio è l’inimico del bene —


On Mon, Jun 28, 2010 at 20:55, Tulasi  wrote:

> Looks like Unicode did not create any name for any Latin letter/symbol
> with LATIN in its name :-')
>
> Am I correct?
>
> Is there a mailing list for ISO/IEC ?
>
> > I don't think it's necessary to post these glyphs to the public list.
>
> Better to do like Edward Cherlin, i.e., type the symbol after the name.
>
> e.g., LATIN SMALL LETTER PHI (ɸ)
>
> That way an illiterate like me can quickly see the letter/symbol along
> with its name, without additional research.
>
> > The merger between Unicode and ISO 10646 caused a few character names in
> > Unicode to be changed to match the 10646 names.
>
> My I know these letters/symbols with names please?
>
> Tulasi
> PS: Thanks Doug, especially for posting the links
>
>
> From: Doug Ewell 
> Date: Sun, 27 Jun 2010 16:09:41 -0600
> Subject: Re: Latin Script
> To: Unicode Mailing List 
> Cc: Tulasi 
>
> "Tulasi"  wrote:
>
> >> U+00AA FEMININE ORDINAL INDICATOR (which does not contain "LATIN") is
> >> considered part of the Latin script, while U+271D LATIN CROSS (which
> >> does) is considered common to all scripts.
> >
> > Can you post both symbols please, thanks?
>
> I can point you to http://www.unicode.org/charts/PDF/U0080.pdf , which
> includes a glyph for U+00AA, and
> http://www.unicode.org/charts/PDF/U2700.pdf , which includes a glyph for
> U+271D.  I don't think it's necessary to post these glyphs to the public
> list.
>
> > Trying to know who among ISO and Unicode first created the names' list
> > for Latin-script is not an indication of obsession :-')
> >
> > So among Unicode and ISO/IEC, who first created ISO/IEC 8859-1 &
> > ISO/IEC 8859-2 letters/symbols names with each name with LATIN in it?
>
> Most of the characters in the various parts of ISO 8859 were originally
> standardized before Unicode or ISO 10646, so the names were probably
> either created by the ISO/IEC subcommittees responsible for those parts,
> or found in earlier standards and adopted as-is.
>
> The merger between Unicode and ISO 10646 caused a few character names in
> Unicode to be changed to match the 10646 names.
>
> --
> Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
> RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­
>
>
>


Re: Latin Script

2010-06-28 Thread Tulasi
Looks like Unicode did not create any name for any Latin letter/symbol
with LATIN in its name :-')

Am I correct?

Is there a mailing list for ISO/IEC ?

> I don't think it's necessary to post these glyphs to the public list.

Better to do like Edward Cherlin, i.e., type the symbol after the name.

e.g., LATIN SMALL LETTER PHI (ɸ)

That way an illiterate like me can quickly see the letter/symbol along
with its name, without additional research.

> The merger between Unicode and ISO 10646 caused a few character names in
> Unicode to be changed to match the 10646 names.

My I know these letters/symbols with names please?

Tulasi
PS: Thanks Doug, especially for posting the links


From: Doug Ewell 
Date: Sun, 27 Jun 2010 16:09:41 -0600
Subject: Re: Latin Script
To: Unicode Mailing List 
Cc: Tulasi 

"Tulasi"  wrote:

>> U+00AA FEMININE ORDINAL INDICATOR (which does not contain "LATIN") is
>> considered part of the Latin script, while U+271D LATIN CROSS (which
>> does) is considered common to all scripts.
>
> Can you post both symbols please, thanks?

I can point you to http://www.unicode.org/charts/PDF/U0080.pdf , which
includes a glyph for U+00AA, and
http://www.unicode.org/charts/PDF/U2700.pdf , which includes a glyph for
U+271D.  I don't think it's necessary to post these glyphs to the public
list.

> Trying to know who among ISO and Unicode first created the names' list
> for Latin-script is not an indication of obsession :-')
>
> So among Unicode and ISO/IEC, who first created ISO/IEC 8859-1 &
> ISO/IEC 8859-2 letters/symbols names with each name with LATIN in it?

Most of the characters in the various parts of ISO 8859 were originally
standardized before Unicode or ISO 10646, so the names were probably
either created by the ISO/IEC subcommittees responsible for those parts,
or found in earlier standards and adopted as-is.

The merger between Unicode and ISO 10646 caused a few character names in
Unicode to be changed to match the 10646 names.

--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­




RE: Generic Base Letter

2010-06-28 Thread CE Whitehead

Hi Vincent, all!


(For the record, I use IE8 but not quite the same version as you; 
I prefer to keep my actual computer info a little private -- not that it is 
private of course; all pages that write cookies know it all --

I've agone into my Regional and Language Options and installed files for 
complex script and left-to-right languages and have set the default font to a 
unicode font;
I've installed almost all updates for IE8 [mainly all security updates; I still 
need to install one that will prevent display of fixed tables in compatibility 
mode but this is unrelated to the display issue were are talking about];
as for Windows, I still have some updates that I've heard of some problems with 
that I have not installed -- I've installed most -- but the ones I have not 
installed do not change the display issues in question.)


As for business, I am assuming that the document you sent is normally rendered 
in quirks mode anyway --
because of the absence of a document type declaration.  (Correct me if this is 
not right.)
However, that did not mess up my display and really should not so much affect 
the rendering of characters I do not think
(correct me if I am wrong; I think the doc type declaration permits css style 
codes, xml features, etc.)

 

However what font are you using as your default?
Individual fonts may have display bugs:
http://home.tiscali.nl/t876506/UnicodeDisplay.html


Oherwise I do not know what to tell you.


I am not sure if there is a problem with the vendor in this case thus -- though 
maybe there is for your browser version.  There may be a problem with the 
vendor in other instances (I have some bidi issues with ie8; I think I should 
be able to use a dir attribute but am still having to use lro and rlo 
characters).


As for the motion at Unicode to add the invisible character to the list of 
rejected characters, it seems that the motion was rejected.
So where does that leave Michael Everson et. al's proposal?  Apparently as a 
proposal without sufficient support; however the character proposed is still 
not a rejected character as far as I can tell.  (Again, correct me if I am 
wrong.)


I don't have a passcode so I cannot see any details myself.
Sorry -- I cannot help here.
 

Best,
C. E. Whitehead
cewcat...@hotmail.com





  

Re: charset parameter in Google Groups (was Re: Indian Rupee Sign to be chosen today)

2010-06-28 Thread Asmus Freytag

On 6/28/2010 11:38 AM, Mark Davis ☕ wrote:



The problem with slavishly following the charset parameter is that it 
is often incorrect. However, the charset parameter is a signal into 
the character detection module, so the charset is correctly supplied 
from the message then the results of the detection will be weighted 
that direction.


The weighting factor / mechanism may be something that you might look at 
for possible improvement.


Doug raised an interesting argument, i.e. that some values of a charset 
parameter might have a higher probability of being correct than other 
values.


If something is tagged Latin-1 or Windows-1252, the chances are that 
this is merely an unexamined default setting. Most of the other 8859 
values should be much less likely to be such "blind" defaults.


I wonder whether the probability of successful charset assignment 
increases if you were to give these more "specific" charset values a 
higher weight.


When I played with simple recognition algorithms about 15 years ago, I 
found that some simple methods for crude language detection gave 
signatures that would allow charset detection. Even though these methods 
weren't sophisticated enough to resolve actual languages (esp. among 
closely related languages) they were good enough to narrow things down 
to the point, where one could pick or confirm charsets.


For example, significant stretches of German can be written without 
diacritics, and can fool charset detection unless it picks up on the 
statistic patterns for German. With that in hand, the first non-ASCII 
character encountered is then likely to "nail" the charset. Or, absent 
such character, the statistics can be used to confirm that an existing 
charset assignment is plausible. (8859-15 having been deliberately 
designed to be "undetectable" is the exception, unless there's a Euro 
sign in the scanned part of the document...)


A./



RE: charset parameter in Google Groups (was Re: Indian Rupee Sign to be chosen today)

2010-06-28 Thread Doug Ewell
Mark Crispin  wrote:

> On Mon, 28 Jun 2010, Mark Davis ☕ wrote:
>> The problem with slavishly following the charset parameter is that it
>> is often incorrect. However, the charset parameter is a signal into
>> the character detection module, so the charset is correctly supplied
>> from the message then the results of the detection will be weighted
>> that direction.
> 
> I interpret these two sentences as:
> 
> "The problem with following the standards is that some people don't
> follow the standards.  So instead of following the standards
> ourselves, we will guess if the other guy follows the standards or
> not, no matter how much he claims to follow standards.  Too bad if our
> fix transforms his valid data into garbage."

At the very least, it would be nice if the charset parameter constituted
a much stronger signal into the detection module than it apparently did
in Andreas' case, so that if he says the text is 8859-15, and we already
know that 8859-15 is nearly impossible to distinguish heuristically from
8859-1, the module might as well take his word for it.

I do tend to agree with Mark that the complaint against Google Groups
(with which I am not affiliated) might have been posted with more
civility and less invective.

--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­






Re: charset parameter in Google Groups (was Re: Indian Rupee Sign to be chosen today)

2010-06-28 Thread Mark Crispin

On Mon, 28 Jun 2010, Mark Davis ☕ wrote:

The problem with slavishly following the charset parameter is that it is
often incorrect. However, the charset parameter is a signal into the
character detection module, so the charset is correctly supplied from the
message then the results of the detection will be weighted that direction.


I interpret these two sentences as:

"The problem with following the standards is that some people don't follow
the standards.  So instead of following the standards ourselves, we will
guess if the other guy follows the standards or not, no matter how much he
claims to follow standards.  Too bad if our fix transforms his valid data
into garbage."

I have heard that song many times over 30 years.  I have often been the
victim of having my valid data transformed into garbage by the attitude of
standards-violators who think that I can't possibly be following the
standards, therefore my data must be "fixed".  When I protest that my data
was "fixed" into garbage, I get told that my complaint doesn't matter.

I take a hard line these days.  I generally don't like the concept of
"fixing" things; I believe in GIGO (Garbage In, Garbage Out).  More
importantly, I utterly reject VIGO (Valid In, Garbage Out) caused by
ill-considered efforts to create GIVO.

What I don't understand is why the EU doesn't use its regulatory power to
force compliance.  If the EU can tell Britain that it can't sell eggs by
the dozen any more, it can shut down a messaging service in Europe that
does not comply with published standards.

-- Mark --

http://panda.com/mrc
Democracy is two wolves and a sheep deciding what to eat for lunch.
Liberty is a well-armed sheep contesting the vote.

Re: Generic Base Letter

2010-06-28 Thread Khaled Hosny
On Mon, Jun 28, 2010 at 03:47:40PM +, Murray Sargent wrote:
> Khaled notes: "There are so many issues with MS implementation(s), for 
> example you can not combine any arbitrary Arabic diacritical marks on any 
> given base character. I don't think Unicode need to invent workaround broken 
> vendor implementations, interested parties should instead pressure on that 
> vendor to fix its implementation(s)."
> 
> The MS Office math facility allows combining marks in the range 
> U+0300..U+036F and most in the range U+20D0..U+20F0 to be applied to any base 
> character(s) including complicated mathematical expressions. Such generality 
> is needed in mathematics, since tildes, hats, bars, etc., are displayed over 
> multiple base characters such as the expression a+b. Hebrew and Arabic 
> combining marks aren't currently treated as valid mathematical combining 
> marks, so the sequence U+25CC U+05BC U+05B8 doesn't render as Vincent desires 
> in a math zone. It seems reasonable to allow all Unicode combining marks as 
> accents in math zones.

That would be nice, but we were talking about combining marks in normal,
non-math, text. For example, it is now common practice to use two
consecutive Fatha/Damma/Kasra for a certain form of Arabic tanwin used
in Koran, however Uniscribe won't allow this and will always insert a
dotted circle between the two marks. I know this behaviour is
documented, but I fail to see the rationale behind it. Generally
speaking, doing "script spell checking" in the rendering engine is a
lousy idea IMO.

Regards,
 Khaled

-- 
 Khaled Hosny
 Arabic localiser and member of Arabeyes.org team
 Free font developer



RE: Indian Rupee Sign to be chosen today

2010-06-28 Thread John Dlugosz
> 
> I imagine that the Rs symbol (₨) is used outside of India.
> 
> Michael Everson * http://www.evertype.com/
> 
> 

Interestingly, the font being chosen for that character shows it to me as "Rp", 
not "Rs".  That seems to be "Microsoft Sans Serif".  It's also the case in 
Palatino Linotype and Tahoma.

TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) 
of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, 
FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and 
subscription company, and TradeStation Europe Limited, a United Kingdom, 
FSA-authorized introducing brokerage firm. None of these companies provides 
trading or investment advice, recommendations or endorsements of any kind. The 
information transmitted is intended only for the person or entity to which it 
is addressed and may contain confidential and/or privileged material. Any 
review, retransmission, dissemination or other use of, or taking of any action 
in reliance upon, this information by persons or entities other than the 
intended recipient is prohibited. If you received this in error, please contact 
the sender and delete the material from any computer.




RE: Generic Base Letter

2010-06-28 Thread Murray Sargent
Khaled notes: "There are so many issues with MS implementation(s), for example 
you can not combine any arbitrary Arabic diacritical marks on any given base 
character. I don't think Unicode need to invent workaround broken vendor 
implementations, interested parties should instead pressure on that vendor to 
fix its implementation(s)."

The MS Office math facility allows combining marks in the range U+0300..U+036F 
and most in the range U+20D0..U+20F0 to be applied to any base character(s) 
including complicated mathematical expressions. Such generality is needed in 
mathematics, since tildes, hats, bars, etc., are displayed over multiple base 
characters such as the expression a+b. Hebrew and Arabic combining marks aren't 
currently treated as valid mathematical combining marks, so the sequence U+25CC 
U+05BC U+05B8 doesn't render as Vincent desires in a math zone. It seems 
reasonable to allow all Unicode combining marks as accents in math zones.

Murray




Re: Indian Rupee Sign to be chosen today

2010-06-28 Thread Kent Karlsson

Den 2010-06-28 10.16, skrev "Michael Everson" :

> On 28 Jun 2010, at 06:41, Mahesh T. Pai wrote:
> 
>> Mahesh T. Pai said on Mon, Jun 28, 2010 at 10:57:53AM +0530,:
>> 
>>> On a serious note -
>>> 
>>> 1. Would a change of glypn / glyph shape be considered?
> 
> It would depend on what is chosen by the Government, I should think.

Also note that the character named RUPEE SIGN has a compatibility
decomposition to "Rs" (short for "Rupees" I guess, not for "rupee sign"):
20A8;RUPEE SIGN;Sc;0;ET; 0052 0073N;

This compatibility decomposition cannot be changed, nor can the sample
glyph be changed significantly (i.e. it must continue to look much like
"Rs").

Unfortunately, some current fonts erroneously display a glyph for
that character that looks more like "Rp". That is a font error,
and nothing else.

If a "NEW RUPEE SIGN" (or whatever name is preferred) is chosen,
and this goes the same way as the EURO SIGN, i.e. is highly likely to
be actually used, as was the case when the EURO SIGN was added to Unicode),
and assuming the glyph does not look like "Rs", a new character will need
to be added.

/kent k





"ASCII" emoji in iOS4

2010-06-28 Thread Michael Everson
Strange.

http://www.appleinsider.com/articles/10/06/25/ios_4_includes_emoji_input_on_japanese_keyboard.html

Michael Everson * http://www.evertype.com/





Re: Generic Base Letter

2010-06-28 Thread Khaled Hosny
On Sun, Jun 27, 2010 at 10:00:18PM -0700, Asmus Freytag wrote:
> The one argument that I find convincing is that too many
> implementations seem set to disallow generic combination, relying
> instead on fixed tables of known/permissible combinations.

Only if you consider Microsoft "too many", AFAIK, only Microsoft's
Uniscribe does such, stupid in my opinion, behaviour.

> In that situation, a formally adopted character with the clearly
> stated semantic of "is expected to actually render with ANY
> combining mark from ANY script" would have an advantage. List-based
> implementations would then know that this character is expected to
> be added to the rendering tables for all marks of all scripts.
> 
> Until and unless that is done, it couldn't be used successfully in
> those environments, but if the proposers could get buy-in from a
> critical mass of vendors of such implementations, this problem could
> be overcome.
> 
> Without such a buy-in, by the way, I would be extremely wary of such
> a proposal, because the table-based nature of these implementations
> would prohibit the use of this new character in the intended way.

There are so many issues with MS implementation(s), for example you can not
combine any arbitrary Arabic diacritical marks on any given base
character. I don't think Unicode need to invent workaround broken vendor
implementations, interested parties should instead pressure on that
vendor to fix its implementation(s).

Regards,
 Khaled

-- 
 Khaled Hosny
 Arabic localiser and member of Arabeyes.org team
 Free font developer



Re: Indian Rupee Sign to be chosen today

2010-06-28 Thread Jeroen Ruigrok van der Werven
-On [20100628 10:37], Michael Everson (ever...@evertype.com) wrote:
>I imagine that the Rs symbol (₨) is used outside of India.

CLDR shows that en_PK uses it.

-- 
Jeroen Ruigrok van der Werven  / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
Time is merely a residue of Reality...



Re: Indian Rupee Sign to be chosen today

2010-06-28 Thread Michael Everson
Report of deferral: 
http://www.ptinews.com/news/740593_Cabinet-defers-decision-on-rupee-symbol
Michael Everson * http://www.evertype.com/





Re: Indian Rupee Sign to be chosen today

2010-06-28 Thread Michael Everson
On 28 Jun 2010, at 06:41, Mahesh T. Pai wrote:

> Mahesh T. Pai said on Mon, Jun 28, 2010 at 10:57:53AM +0530,:
> 
>> On a serious note -
>> 
>> 1. Would a change of glypn / glyph shape be considered? 

It would depend on what is chosen by the Government, I should think.

>> 2. What are the origins of that character?
> 
> I feel that an answer is important, because the code chart specifically 
> mentions this as the Indian rupee sign.
> 
> http://unicode.org/charts/PDF/U20A0.pdf

It's OK. All a new character would need is a unique name.

I imagine that the Rs symbol (₨) is used outside of India.

Michael Everson * http://www.evertype.com/





Front of pack nutrition labelling of food

2010-06-28 Thread William_J_G Overington
Some readers might like to know of the following document.
 
http://www.food.gov.uk/multimedia/pdfs/consultationresponse/responsefopnutritionlabeling.pdf
 
My idea about using shape as well as colour is included on page 29.
 
The page from which the document is linked is as follows.
 
http://www.food.gov.uk/consultations/ukwideconsults/2009/fopnutritionlabelling
 
There is Unicode involved in this as there is also a font and a typecase_ pdf 
for the symbols.
 
http://forum.high-logic.com/viewtopic.php?f=10&t=2870
 
The font includes glyphs for three shapes from the Geometric Shapes range. 
 
http://www.unicode.org/charts/PDF/U25A0.pdf
 
In the Unicode chart the three shapes each have the word BLACK in their name.
 
The font also includes the same glyphs mapped to the positions normally used 
for r, g, o, y as well, for convenience when using some graphics packages.
 
The font also includes glyphs for the alphabet A..Z so that the font can 
conveniently be used in those graphics packages that show the name of each font 
using glyphs from the font itself.
 
William Overington
 
28 June 2010