(Informational only: UTF-8 BOM and the real life)

2012-07-25 Thread Steven Atreju
So, dear list, i'm really sorry for this distress.
I don't want to start any thread, but i can't help it and thus
want to pass this through to you.

I had problems with my bicycle and sent a mail asking for help.
This is a real large company (www.mifa.de).

  |Received: from ds0501.hostingschmiede.de
  |From: informat...@radservice.net informat...@radservice.net
  |Organization: CC GmbH
  |
  |This is a multi-part message in MIME format
  |
  |Content-Type: text/html; charset=UTF-8
  |Content-Transfer-Encoding: 8bit
  |Content-Disposition: inline

The HTML part is all right.

  |td style=width:100px;font:normal 11px 
Arial;vertical-align:topEmpfänger/td

The text part is UTF-8 converted once again to UTF-8.
Which is ridiculous.

  |Content-Type: text/plain; charset=UTF-8
  |Content-Transfer-Encoding: 8bit
  |Content-Disposition: inline
  |
  |Datum:   25.07.2012 15:52:02
  |Absender:informat...@radservice.net
  |---
  |

And that was an Unicode BOM that has been converted to UTF-8 and
then been converted to UTF-8 once again.  As you all see - in the
middle of nowhere.

  |Sehr geehrter Herr Steven,
  |
  |vielen Dank für Ihre E-Mail.

I've sent them a nice mail on UTF-8 BOM and perl(1) programming
in general.  (I can't imagine anything else due to resource
reasons.)

Yes, i also hope this will get better as time goes by.
Yes, consumers should ignore a zero-width non-break space.
It's not visual.
Thanks for your understanding, but i had to send this now.
Good night.

  Steven




Re: (Informational only: UTF-8 BOM and the real life)

2012-07-25 Thread Jukka K. Korpela

2012-07-26 0:19, Steven Atreju wrote:


   |

And that was an Unicode BOM that has been converted to UTF-8 and
then been converted to UTF-8 once again.


Apparently the problem is that the data has been doubly encoded: first 
into UTF-8, then interpreting the bytes of UTF-8 data, interpreting them 
as if they were in windows-1252, and then UTF-8 encoding the resulting 
characters. This is of course very incorrect, and not uncommon.



   |vielen Dank für Ihre E-Mail.


So the letter “ü” was munged too, and presumably all non-ASCII data. So 
this is not an argument against using BOM in UTF-8. The BOM was a victim 
of incorrect processing, like everyone else (outside ASCII). One might 
even argue that the BOM is useful here, too, since it immediately 
signals that there is something wrong, and “” is an encoding error 
signature, so to say.


Yucca






Re: (Informational only: UTF-8 BOM and the real life)

2012-07-25 Thread Asmus Freytag

On 7/25/2012 2:45 PM, Jukka K. Korpela wrote:
. One might even argue that the BOM is useful here, too, since it 
immediately signals that there is something wrong, and “” is an 
encoding error signature, so to say.




+8

A./



CLDR and ICU

2012-07-25 Thread Richard Wordingham
What is the formal relationship between the Common Locale Data
Repository (CLDR) and International Components for Unicode (ICU)?

I ask for two reasons:

I raised a ticket http://unicode.org/cldr/trac/ticket/5092 on a
proposed clarificatory addition to UTS#35 'Locale Data Markup
Language', and it has just been closed as a duplicate of an ICU issue.
As no-one disputes that the problem is an issue relating to LDML, this
seems bizarre.

The ICU implementation of collation tailoring for changed ordering is
bizarre in some complicated cases.  (Life can be complicated.)  Should
UTS#35 be documenting what ICU does, or should Unicode be saying what
ICU should do when implementing a tailoring expressed in LDML?

Richard.



Re: CLDR and ICU

2012-07-25 Thread Ken Whistler

On 7/25/2012 5:01 PM, Richard Wordingham wrote:

What is the formal relationship between the Common Locale Data
Repository (CLDR) and International Components for Unicode (ICU)?


...


The ICU implementation of collation tailoring for changed ordering is
bizarre in some complicated cases.  (Life can be complicated.)  Should
UTS#35 be documenting what ICU does, or should Unicode be saying what
ICU should do when implementing a tailoring expressed in LDML?


Well, Unicode should not be saying what anybody should do here.

UTS #35 is owned by the CLDR-TC, not the UTC or the Unicode Consortium
as a whole.

The discussion of the relationship between CLDR and ICU presumably
belongs on the cldr-users list, rather than the unicode list, except insofar
as an issue raised for tailoring of collation in LDML and/or in the
ICU implementation reflects back on something which would need
changing or clarifying in UTS #10.

--Ken




Re: CLDR and ICU

2012-07-25 Thread Mark Davis ☕
Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Wed, Jul 25, 2012 at 5:01 PM, Richard Wordingham 
richard.wording...@ntlworld.com wrote:

 What is the formal relationship between the Common Locale Data
 Repository (CLDR) and International Components for Unicode (ICU)?


ICU is one of the main clients for CLDR data. Because it makes extensive
use of the data, the CLDR group also uses ICU for testing.


 I ask for two reasons:

 I raised a ticket http://unicode.org/cldr/trac/ticket/5092 on a
 proposed clarificatory addition to UTS#35 'Locale Data Markup
 Language', and it has just been closed as a duplicate of an ICU issue.
 As no-one disputes that the problem is an issue relating to LDML, this
 seems bizarre.


It was not closed as a duplicate of an ICU issue. It was closed as a
duplicate. You jumped to the conclusion that it was a duplicate of an ICU
bug.

The reason it was marked as a duplicate is that there had been changes in
the working draft such that the committee believed that the problems cited
in your report had been taken care of. For example, your ticket complains
about [0.0.c.t], but if you look at the working draft (be sure to refresh
your browser; sometimes an old version can hang around for a while), there
is no such text.

If there are still issues that you feel have not been resolved, the ticket
can be reopened with specific comments as to what was not addressed, or you
can open a new ticket for just the remaining items.


 The ICU implementation of collation tailoring for changed ordering is
 bizarre in some complicated cases.  (Life can be complicated.)  Should
 UTS#35 be documenting what ICU does,

or should Unicode be saying what
 ICU should do when implementing a tailoring expressed in LDML?


This is a false dichotomy.

The goal for collation is to balance user expectations in terms of
functionality, feasibility, performance, and size. The CLDR committee
certainly takes into account how implementations can use CLDR data; it
would be of little good to have data that required implementations to be
overly bulky or complicated or slow. There will, however, always be room
for improvement.

In many cases there is a change in LDML or CLDR data where ICU and other
clients have to catch up to it; in many cases implementation experience in
ICU (or Windows, or iOS, or...) leads to a proposal for how to handle
something in LDML or CLDR data. In some cases ICU or other clients may have
their own tailorings on top of CLDR; and for that matter, many companies
(such as my company, Google) apply some patches on top of CLDR data.

The same is true for many other Unicode standards and data. The
implementations inform the standard, and are also adapting to changes in it.



 Richard.




RE: Manipulation of System Fonts on Windows 7

2012-07-25 Thread Peter Constable
Changing the primary fonts used throughout the Windows 7 shell is not a 
supported scenario. 

If you were to install a Chinese language pack (available to you if you have an 
Ultimate or Enterprise license), then either Microsoft YaHei (for Simplified) 
or Microsoft JhengHei (for Traditional) would be used for most UI. But, of 
course, the UI would be in Chinese.

Now, if you have the UI displayed in (say) English, then it is not the primary 
fonts that matter for CJK but rather what is used as fallback fonts. If you 
change the system locale setting (the Language for non-Unicode programs -- on 
the Administrative tab in the Regional and Language Options control panel) to 
one of the Chinese options, then the order in which fonts will be used in much 
of the shell will change. So, by default for an English system, the primary UI 
font is Segoe UI, and Meiryo UI will be the first font that gets tried if a UI 
string has CJK; but if you change the system locale to (say) Chinese 
(Simplified, China), then Microsoft YaHei will be the first font used for CJK.

Note that changing system locale will impact what you see in much of the shell 
and in certain text controls used in apps (e.g. the main doc window in 
Notepad), but it won't affect text in all scenarios -- e.g. on an (unstyled) 
web page or in Wordpad.

If you have a font that supports Shavian, there is something you can try to get 
it used as a fallback font, though this is not a scenario that was tested in 
Win7: if you're comfortable making changes in the Windows Registry, then go to 
this key

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows 
NT\CurrentVersion\LanguagePack\SurrogateFallback

And add a string entry with the name Plane 1 and a value which is the name of 
your font (the font family name, not the file name). (There used to be a KB 
article about this mechanism, but I haven't seen it in a long while. Given the 
nature of changes made in certain parts of the text stack in Win7, I won't 
guarantee it would still work.)



Peter


-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Charlie Ruland
Sent: July 22, 2012 1:34 PM
To: Unicode Discussion
Subject: Manipulation of System Fonts on Windows 7

I would like to manipulate system fonts on a Windows 7 computer. More 
precisely, I wish to do the following:

1. Change the font for CJK Unified Ideographs (and CJK punctuation, radicals 
etc.; maybe the CJK Ideographs Extensions as well?) from the current 
Japanese-looking one to one in simplified Chinese style, though of course the 
new system font should also contain traditional characters.

2. Assign a system font for Shavian. Currently boxes/squares are displayed.

What I need is: 1. advice on which fonts to choose and 2. a brief tutorial how 
to safely change fonts system-wide.

Although I am aware that this request is somewhat off-topic I am sure that some 
people here will be able to give me the hints I am looking for.

Thanks in advance,

Charlie