Re: ZWNBSP vs. WJ (was: How is NBH (U0083) Implemented?)

2011-08-05 Thread Doug Ewell
Jukka K. Korpela jkorpela at cs dot tut dot fi wrote:

 So? It was, and it still often is, better to use ISO 8859-1 rather
 than Unicode, in situations where there no tangible benefit, or just a
 smal l benefit, from using Unicode. For example, many people are still
 conservative about encodings in e-mail, for good reasons, so they use
 ISO 8859-1 or, as you did in your message, windows-1252.

A word about my encoding choices.  My first message on Thursday was
sent from my home PC, using Windows Live Mail, and it used UTF-8 because
I configured Windows Live Mail to do so.  My second message was sent
from my mobile device, and used Windows-1252.  I don't know if there is
a way to tell the device to use UTF-8 for outgoing messages, but I can
say it was not my conscious intent to prefer Windows-1252 over Unicode.

This message is being sent via a Web interface; I guess we'll find out
what encoding it chooses for me.

 On the other hand, this isn’t comparable to ZWNBSP vs. WJ. These
 control characters do the same job in text, as per the standard, so
 the practical question is simply which one is better supported.

ZWNBSP, like WJ, is intended to inhibit breaking between words.  Despite
the other (and original) intended use of U+FEFF at the start of a text
as a byte-order mark, there is a pervasive belief that an initial U+FEFF
means the text should be treated as beginning with some kind of space
character.  This is silly, since there is no concept of between words
at the start of a text, but it is nevertheless the way people perceive
things.

WJ was introduced to encourage users to separate these two functions. 
If users don't adopt it, the problem will never be solved.  There are
enough issues in Unicode that cannot be fixed due to stability concerns;
it would be nice to be able to fix this one at least.

I still question how many real-world texts use either U+FEFF or U+2060
to achieve this non-breaking behavior.

 ISO 8859-1 and Unicode perform very different jobs, so that using ISO
 8859-1, you limit your character repertoire (at least as regards to
 directly representable characters, as opposite to various “escape
 notations”). If you don’t need anything outside the ISO 8859-1, the
 choice used to be very simple, though nowadays it has become a little
 more complicated (as e.g. Google Groups seems to munge ISO 8859-1 data
 in quotations but processes UTF-8 properly)

UTF-8 has the property of being easily detected and verified as such,
which solves part of the Google Groups problem (inability to detect
which SBCS is being used).  The other part of the problem is the
practice of using heuristics to override an explicit charset
declaration, but that is a topic for another day.

 I won’t make any statements about full compliance, but in Microsoft
 Office Word 2007, U+FEFF alias ZWNBSP does its basic job (inside text)
 in most situations whereas U+2060 alias WJ seems to be not recognized
 at all and appears as some sort of a visible box. So to have a job
 jone, there is not much of a choice. (Word 2007 fails to honor ZWNBSP
 semantics after EN DASH, which is bad, but it does not make it useless
 in other situations.)

It does always come down to a complaint against Microsoft, doesn't it? 
Unfortunately, Yucca is right here: opening Word 2007 and pasting a
snippet of text with embedded ZWNBSP does display correctly, while the
same experiment with embedded WJ shows a .notdef box.  This seems to be
a font-coverage problem, amplified by Word's silent overriding of user
font choices—changing the font from the default Calibri to DejaVu Sans
(and optionally back to Calibri) makes the display problem go away, but
of course no user could reasonably be expected to go through that.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell ­






Re: ZWNBSP vs. WJ (was: How is NBH (U0083) Implemented?)

2011-08-05 Thread Asmus Freytag (w)
The ambiguity of an initial FEFF was not desirable, but this discussion shows 
that certain things can't be so easily fixed by adding characters at a later 
stage.

The more time elapsed between encoding of the ambiguous character and the later 
fix the more software, the more data, and the more protocols exist that 
support the original character, creating backwards compatibility issues.

Incidentally, this is totally what I expected when the WJ was proposed, but 
sentiment in favor of its addition ran high at the time...

The ZWNBSP was present in Unicode 1.0 (1991) while the WJ was added in 3.2 
(2002), that is about 10 years later. We are now an additional 10 years down 
the road, and instead of clarifying the issue, the interim result is that WJ 
has muddied the waters instead.

Somewhere here are lessons to be learned.

A./


-Original Message-
From: Doug Ewell d...@ewellic.org
Sent: Aug 5, 2011 8:49 AM
To: unicode@unicode.org
Subject: Re: ZWNBSP vs. WJ (was: How is NBH (U0083) Implemented?)

Jukka K. Korpela jkorpela at cs dot tut dot fi wrote:

 So? It was, and it still often is, better to use ISO 8859-1 rather
 than Unicode, in situations where there no tangible benefit, or just a
 smal l benefit, from using Unicode. For example, many people are still
 conservative about encodings in e-mail, for good reasons, so they use
 ISO 8859-1 or, as you did in your message, windows-1252.

A word about my encoding choices.  My first message on Thursday was
sent from my home PC, using Windows Live Mail, and it used UTF-8 because
I configured Windows Live Mail to do so.  My second message was sent
from my mobile device, and used Windows-1252.  I don't know if there is
a way to tell the device to use UTF-8 for outgoing messages, but I can
say it was not my conscious intent to prefer Windows-1252 over Unicode.

This message is being sent via a Web interface; I guess we'll find out
what encoding it chooses for me.

 On the other hand, this isn’t comparable to ZWNBSP vs. WJ. These
 control characters do the same job in text, as per the standard, so
 the practical question is simply which one is better supported.

ZWNBSP, like WJ, is intended to inhibit breaking between words.  Despite
the other (and original) intended use of U+FEFF at the start of a text
as a byte-order mark, there is a pervasive belief that an initial U+FEFF
means the text should be treated as beginning with some kind of space
character.  This is silly, since there is no concept of between words
at the start of a text, but it is nevertheless the way people perceive
things.

WJ was introduced to encourage users to separate these two functions. 
If users don't adopt it, the problem will never be solved.  There are
enough issues in Unicode that cannot be fixed due to stability concerns;
it would be nice to be able to fix this one at least.

I still question how many real-world texts use either U+FEFF or U+2060
to achieve this non-breaking behavior.

 ISO 8859-1 and Unicode perform very different jobs, so that using ISO
 8859-1, you limit your character repertoire (at least as regards to
 directly representable characters, as opposite to various “escape
 notations”). If you don’t need anything outside the ISO 8859-1, the
 choice used to be very simple, though nowadays it has become a little
 more complicated (as e.g. Google Groups seems to munge ISO 8859-1 data
 in quotations but processes UTF-8 properly)

UTF-8 has the property of being easily detected and verified as such,
which solves part of the Google Groups problem (inability to detect
which SBCS is being used).  The other part of the problem is the
practice of using heuristics to override an explicit charset
declaration, but that is a topic for another day.

 I won’t make any statements about full compliance, but in Microsoft
 Office Word 2007, U+FEFF alias ZWNBSP does its basic job (inside text)
 in most situations whereas U+2060 alias WJ seems to be not recognized
 at all and appears as some sort of a visible box. So to have a job
 jone, there is not much of a choice. (Word 2007 fails to honor ZWNBSP
 semantics after EN DASH, which is bad, but it does not make it useless
 in other situations.)

It does always come down to a complaint against Microsoft, doesn't it? 
Unfortunately, Yucca is right here: opening Word 2007 and pasting a
snippet of text with embedded ZWNBSP does display correctly, while the
same experiment with embedded WJ shows a .notdef box.  This seems to be
a font-coverage problem, amplified by Word's silent overriding of user
font choices—changing the font from the default Calibri to DejaVu Sans
(and optionally back to Calibri) makes the display problem go away, but
of course no user could reasonably be expected to go through that.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell ­