(Informational only: UTF-8 BOM and the real life)
So, dear list, i'm really sorry for this distress. I don't want to start any thread, but i can't help it and thus want to pass this through to you. I had problems with my bicycle and sent a mail asking for help. This is a real large company (www.mifa.de). |Received: from ds0501.hostingschmiede.de |From: informat...@radservice.net informat...@radservice.net |Organization: CC GmbH | |This is a multi-part message in MIME format | |Content-Type: text/html; charset=UTF-8 |Content-Transfer-Encoding: 8bit |Content-Disposition: inline The HTML part is all right. |td style=width:100px;font:normal 11px Arial;vertical-align:topEmpfänger/td The text part is UTF-8 converted once again to UTF-8. Which is ridiculous. |Content-Type: text/plain; charset=UTF-8 |Content-Transfer-Encoding: 8bit |Content-Disposition: inline | |Datum: 25.07.2012 15:52:02 |Absender:informat...@radservice.net |--- | And that was an Unicode BOM that has been converted to UTF-8 and then been converted to UTF-8 once again. As you all see - in the middle of nowhere. |Sehr geehrter Herr Steven, | |vielen Dank für Ihre E-Mail. I've sent them a nice mail on UTF-8 BOM and perl(1) programming in general. (I can't imagine anything else due to resource reasons.) Yes, i also hope this will get better as time goes by. Yes, consumers should ignore a zero-width non-break space. It's not visual. Thanks for your understanding, but i had to send this now. Good night. Steven
Re: (Informational only: UTF-8 BOM and the real life)
2012-07-26 0:19, Steven Atreju wrote: | And that was an Unicode BOM that has been converted to UTF-8 and then been converted to UTF-8 once again. Apparently the problem is that the data has been doubly encoded: first into UTF-8, then interpreting the bytes of UTF-8 data, interpreting them as if they were in windows-1252, and then UTF-8 encoding the resulting characters. This is of course very incorrect, and not uncommon. |vielen Dank für Ihre E-Mail. So the letter “ü” was munged too, and presumably all non-ASCII data. So this is not an argument against using BOM in UTF-8. The BOM was a victim of incorrect processing, like everyone else (outside ASCII). One might even argue that the BOM is useful here, too, since it immediately signals that there is something wrong, and “” is an encoding error signature, so to say. Yucca
Re: (Informational only: UTF-8 BOM and the real life)
On 7/25/2012 2:45 PM, Jukka K. Korpela wrote: . One might even argue that the BOM is useful here, too, since it immediately signals that there is something wrong, and “” is an encoding error signature, so to say. +8 A./
CLDR and ICU
What is the formal relationship between the Common Locale Data Repository (CLDR) and International Components for Unicode (ICU)? I ask for two reasons: I raised a ticket http://unicode.org/cldr/trac/ticket/5092 on a proposed clarificatory addition to UTS#35 'Locale Data Markup Language', and it has just been closed as a duplicate of an ICU issue. As no-one disputes that the problem is an issue relating to LDML, this seems bizarre. The ICU implementation of collation tailoring for changed ordering is bizarre in some complicated cases. (Life can be complicated.) Should UTS#35 be documenting what ICU does, or should Unicode be saying what ICU should do when implementing a tailoring expressed in LDML? Richard.
Re: CLDR and ICU
On 7/25/2012 5:01 PM, Richard Wordingham wrote: What is the formal relationship between the Common Locale Data Repository (CLDR) and International Components for Unicode (ICU)? ... The ICU implementation of collation tailoring for changed ordering is bizarre in some complicated cases. (Life can be complicated.) Should UTS#35 be documenting what ICU does, or should Unicode be saying what ICU should do when implementing a tailoring expressed in LDML? Well, Unicode should not be saying what anybody should do here. UTS #35 is owned by the CLDR-TC, not the UTC or the Unicode Consortium as a whole. The discussion of the relationship between CLDR and ICU presumably belongs on the cldr-users list, rather than the unicode list, except insofar as an issue raised for tailoring of collation in LDML and/or in the ICU implementation reflects back on something which would need changing or clarifying in UTS #10. --Ken
Re: CLDR and ICU
Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Wed, Jul 25, 2012 at 5:01 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: What is the formal relationship between the Common Locale Data Repository (CLDR) and International Components for Unicode (ICU)? ICU is one of the main clients for CLDR data. Because it makes extensive use of the data, the CLDR group also uses ICU for testing. I ask for two reasons: I raised a ticket http://unicode.org/cldr/trac/ticket/5092 on a proposed clarificatory addition to UTS#35 'Locale Data Markup Language', and it has just been closed as a duplicate of an ICU issue. As no-one disputes that the problem is an issue relating to LDML, this seems bizarre. It was not closed as a duplicate of an ICU issue. It was closed as a duplicate. You jumped to the conclusion that it was a duplicate of an ICU bug. The reason it was marked as a duplicate is that there had been changes in the working draft such that the committee believed that the problems cited in your report had been taken care of. For example, your ticket complains about [0.0.c.t], but if you look at the working draft (be sure to refresh your browser; sometimes an old version can hang around for a while), there is no such text. If there are still issues that you feel have not been resolved, the ticket can be reopened with specific comments as to what was not addressed, or you can open a new ticket for just the remaining items. The ICU implementation of collation tailoring for changed ordering is bizarre in some complicated cases. (Life can be complicated.) Should UTS#35 be documenting what ICU does, or should Unicode be saying what ICU should do when implementing a tailoring expressed in LDML? This is a false dichotomy. The goal for collation is to balance user expectations in terms of functionality, feasibility, performance, and size. The CLDR committee certainly takes into account how implementations can use CLDR data; it would be of little good to have data that required implementations to be overly bulky or complicated or slow. There will, however, always be room for improvement. In many cases there is a change in LDML or CLDR data where ICU and other clients have to catch up to it; in many cases implementation experience in ICU (or Windows, or iOS, or...) leads to a proposal for how to handle something in LDML or CLDR data. In some cases ICU or other clients may have their own tailorings on top of CLDR; and for that matter, many companies (such as my company, Google) apply some patches on top of CLDR data. The same is true for many other Unicode standards and data. The implementations inform the standard, and are also adapting to changes in it. Richard.
RE: Manipulation of System Fonts on Windows 7
Changing the primary fonts used throughout the Windows 7 shell is not a supported scenario. If you were to install a Chinese language pack (available to you if you have an Ultimate or Enterprise license), then either Microsoft YaHei (for Simplified) or Microsoft JhengHei (for Traditional) would be used for most UI. But, of course, the UI would be in Chinese. Now, if you have the UI displayed in (say) English, then it is not the primary fonts that matter for CJK but rather what is used as fallback fonts. If you change the system locale setting (the Language for non-Unicode programs -- on the Administrative tab in the Regional and Language Options control panel) to one of the Chinese options, then the order in which fonts will be used in much of the shell will change. So, by default for an English system, the primary UI font is Segoe UI, and Meiryo UI will be the first font that gets tried if a UI string has CJK; but if you change the system locale to (say) Chinese (Simplified, China), then Microsoft YaHei will be the first font used for CJK. Note that changing system locale will impact what you see in much of the shell and in certain text controls used in apps (e.g. the main doc window in Notepad), but it won't affect text in all scenarios -- e.g. on an (unstyled) web page or in Wordpad. If you have a font that supports Shavian, there is something you can try to get it used as a fallback font, though this is not a scenario that was tested in Win7: if you're comfortable making changes in the Windows Registry, then go to this key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\LanguagePack\SurrogateFallback And add a string entry with the name Plane 1 and a value which is the name of your font (the font family name, not the file name). (There used to be a KB article about this mechanism, but I haven't seen it in a long while. Given the nature of changes made in certain parts of the text stack in Win7, I won't guarantee it would still work.) Peter -Original Message- From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Charlie Ruland Sent: July 22, 2012 1:34 PM To: Unicode Discussion Subject: Manipulation of System Fonts on Windows 7 I would like to manipulate system fonts on a Windows 7 computer. More precisely, I wish to do the following: 1. Change the font for CJK Unified Ideographs (and CJK punctuation, radicals etc.; maybe the CJK Ideographs Extensions as well?) from the current Japanese-looking one to one in simplified Chinese style, though of course the new system font should also contain traditional characters. 2. Assign a system font for Shavian. Currently boxes/squares are displayed. What I need is: 1. advice on which fonts to choose and 2. a brief tutorial how to safely change fonts system-wide. Although I am aware that this request is somewhat off-topic I am sure that some people here will be able to give me the hints I am looking for. Thanks in advance, Charlie