Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-16 Thread Steven Atreju
Doug Ewell d...@ewellic.org wrote:

 |Steven Atreju wrote:
 |
 | If Unicode *defines* that the so-called BOM is in fact a Unicode-
 | indicating tag that MUST be present,
 |
 |But Unicode does not define that.

Nope.  On http://unicode.org/faq/utf_bom.html i read:

  Q: Why do some of the UTFs have a BE or LE in their label,
  such as UTF-16LE?

So it seems to me that the Unicode Consortium takes care of
newbies and those people who work at a very high programming
level, say, PHP, Flash, JavaScript or even no programming at all.
And:

  Q: Is the UTF-8 encoding scheme the same irrespective of whether
  the underlying processor is little endian or big endian?
  ...
  Where a BOM is used with UTF-8, it is only used as an ecoding
  signature to distinguish UTF-8 from other encodings — it has
  nothing to do with byte order.

Fifteen years ago i think i would have put effort in including the
BOM after reading this, for complete correctness!  I'm pretty sure
that i really would have done so.

So, given that this page ranks 3 when searching for «utf-8 bom»
from within Germany i would 1), fix the «ecoding» typo and 2)
would change this to be less «neutral».  The answer to «Q.» is
simply «Yes.  Software should be capable to strip an encoded BOM
in UTF, because some softish Unicode processors fail to do so when
converting in between different multioctet UTF schemes.  Using BOM
with UTF-8 is not recommended.»

 | I know that, in Germany, many, many small libraries become closed
 | because there is not enough money available to keep up with the
 | digital race, and even the greater *do* have problems to stay in
 | touch!
 |
 |People like to complain about the BOM, but no libraries are shutting 
 |down because of it. Keeping up with the digital race isn't about 
 |handling two or three bytes at the beginning of a text file, in a way 
 |that has been defined for two decades.

RFC 2279 doesn't note the BOM.

Looking at my 119,90.- German Mark Unicode 3.0 book, there is
indeed talk about the UTF-8 BOM.  We have (2.7, page 28)
«Conformance to the Unicode Standard does not requires the use of
the BOM as such a signature» (typo taken plain; or is it no
typo?), and (13.6, page 324) «..never any questions of byte order
with UTF-8 text, this sequence can serve as signature for .. this
sequence of bytes will be extremely rare at the beginning of text
files in other encodings ... for example []Microsoft Windows[]».

So this is fine.  It seems UTF-16 and UTF-32 were never ment for
data exchange and the BOM was really a byte order indicator for a
consumer that was aware of the encoding but not the byte order.
And UTF-8 got an additional «wohooo - i'm Unicode text» signature
tag, though optional.  I like the term «extremely rare» sooo much!!
:-)

I restart my «rant» UTF-8 filetype thread from the beginning now.
I wonder: was the Unicode Consortium really so unconfident?  Do i
really read «UTF-8 will drown in this evil mess of terroristic
charsets, so rise the torch of freedom in this unfriendly
environment!»?
I have downloaded the 6.0 and 6.1 stuff as a PDF and for free (:-.

If you know how to deal with UTF-8, you can deal with UTF-8.
If you don't, no signature ever will help you, no?!

If you don't know the charset of some text, that comes from
nowhere, i.e., no container format with meta-information, no
filetype extension with implicit meta-information, as is used on
Mac OS and DOS, then UTF-8 is still very easily identifieable by
itself due to the way the algorithm is designed.  Is it??

Tear down the wall!
Tear down the wall!
Tear down the wall!

 |It's about technologies and 
 |standards and platforms and formats that change incompatibly every few 
 |years.

That is of course true.

But what to do with these myriads of aggressive nerds that linger
in these neon-enlightened four square meter boxes, with their
poignant hunger for penthouse windows and four-cylinder
Mercedes-Benz limousines?  I'm asking you.  I've seen photos of
standard committees in palm-covered bays (CSS2?  DOM?  W3M
anyway), i've dropped my subscription to regular IETF discussion
because i can stand only so and so many dozens of dinner,
hotel-room reservation, laptop-compatible socket in Paris? and
whatever threads (the annual ladies steakhouse meeting!).  So here
you are.  These people have deserved it, and no better.

  Steven



Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-16 Thread Leif Halvard Silli
Steven Atreju, Mon, 16 Jul 2012 13:35:04 +0200:
 Doug Ewell d...@ewellic.org wrote:

 And:
 
   Q: Is the UTF-8 encoding scheme the same irrespective of whether
   the underlying processor is little endian or big endian?
   ...
   Where a BOM is used with UTF-8, it is only used as an ecoding
   signature to distinguish UTF-8 from other encodings — it has
   nothing to do with byte order.
 
 Fifteen years ago i think i would have put effort in including the
 BOM after reading this, for complete correctness!  I'm pretty sure
 that i really would have done so.

I believe that most people that are conscious about inserting the BOM, 
do so because, without it, then Web browsers (with Chrome as the 
exception, whenever the page contains non-ASCII characters, at least) 
are unlikely to sniff a UTF-8 encoded page to be UTF-8 encoded. So, it 
has nothing with complete correctness to do, but everything to do 
with complete safety.

 So, given that this page ranks 3 when searching for «utf-8 bom»
 from within Germany i would 1), fix the «ecoding» typo and 2)
 would change this to be less «neutral».  The answer to «Q.» is
 simply «Yes.  Software should be capable to strip an encoded BOM
 in UTF, because some softish Unicode processors fail to do so when
 converting in between different multioctet UTF schemes.  Using BOM
 with UTF-8 is not recommended.»

The current text is much to prefer. Also, you place the wagon before 
the horse. You place tools over users.

There is one reason to use UTF-8 BOM which that FAQ point doesn't 
mention, however, and that is that Chrome/Safari/Webkit plus IE treat a 
UTF-8 encoded text/html page with a BOM different from a UTF-8 encoded 
text/html page without a BOM - even when the page is otherwise properly 
labelled as UTF-8. For the former, then the user would not be able to 
override the encoding, manually. Whereas for a page without the BOM, 
then the user can override the encoding/shoot themselves (and others) 
in the foot.

 And UTF-8 got an additional «wohooo - i'm Unicode text» signature
 tag, though optional.  I like the term «extremely rare» sooo much!!
 :-)

What's the problem?

 If you know how to deal with UTF-8, you can deal with UTF-8.
 If you don't, no signature ever will help you, no?!

Do you mean that, instead of the wohoo, one should do more thorough 
sniffing? I have no insight into how reliable such non-BOM-sniffing is. 
But I take it that it is much less secure than BOM-sniffing. Hence it 
would be risky (?) to deny users to override the encoding of a 
non-BOM-sniffed page. Which, bottom line, means that the BOM got an 
advantage.

 If you don't know the charset of some text, that comes from
 nowhere, i.e., no container format with meta-information, no
 filetype extension with implicit meta-information, as is used on
 Mac OS and DOS, then UTF-8 is still very easily identifieable by
 itself due to the way the algorithm is designed.  Is it??

As I just said in a reply to Doug: Of the Web browsers in current use, 
Chrome is the very best. This is, I think, because it, to a higher 
degree than the competition, assumes UTF-8 whenever it finds non-ASCII 
characters. Clearly, sniffing could improve. At least in the browser 
world. But is that also true for command lines tools?
-- 
Leif H Silli




RE: pre-HTML5 and the BOM

2012-07-16 Thread Doug Ewell
Leif Halvard Silli xn dash dash mlform dash iua at xn dash dash mlform
dash iua dot no wrote:

 So, in a way, the ZWNBSP - or any other non-ASCII character (it would
 in fact be better to use U+200B, to reserve the U+FEFF for its
 designated BOM purpose) could serve as a UTF-8 sniff character not
 only when it is the first character of the document, but also
 elsewhere in documents. And this already happens ...

My normal signature block includes a soft hyphen, U+00AD, which is C2
AD in UTF-8, for test purposes and as a hint that the message is UTF-8.
The Web interface from which I'm sending this particular message may or
may not preserve this character.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell ­





Re: pre-HTML5 and the BOM

2012-07-16 Thread Leif Halvard Silli
Doug Ewell, Sat, 14 Jul 2012 15:14:10 -0600:
 Philippe Verdy wrote:
 
 It would break if the only place where to place a BOM is just the
 start of a file. But as I propose, we allow BOMs to occur anywhere to
 specify which encoding to use to decode what follows each one, even
 shell scripts would work [ snip ]

 U+FEFF is specifically defined as having the BOM semantic only when 
 it appears at the beginning of the file or stream. Everywhere else, 
 it can have only the ZWNBSP semantic.

True. That said: Of the Web browsers in current use, Chrome is the very 
best (read: most aggressive) at UTF-8 sniffing. The others hardly sniff 
anything but for the BOM. For example, if you do an UTF-8 encoded page 
which contains nothing but ASCII - except a U+FEFF character (or any 
other non-ASCII character) inside the class= attribute of e.g. the 
html element, then Chrome will sniff it as UTF-8 encoded. Whereas IE, 
Webkit, Opera, Firefox will default to ISO-8858-1/Windows-1252.

So, in a way, the ZWNBSP - or any other non-ASCII character (it would 
in fact be better to use U+200B, to reserve the U+FEFF for its 
designated BOM purpose) could serve as a UTF-8 sniff character not 
only when it is the first character of the document, but also elsewhere 
in documents. And this already happens ...

(May be we see here a reflection of how Chrome is colored by its 
owner's role as a giant social media content producer/facilitator, 
whereas the other browser vendors are too much stuck in their 
back-compatibility mantra.)
-- 
Leif Halvard Silli



Copyleft

2012-07-16 Thread Jean-François Colson
Recently, the Canadian symbols  (marque de commerce) and  (marque 
déposée) have been added to Unicode at U+1F16A and U+1F16B.


Would it be possible to add the copyleft symbol in the neighbourhood ?
It looks like a reversed ©. Today, to type it, I use a reversed c with a 
combining enclosing circle ↄ⃝ , but that’s only a loose approximation.





Re: pre-HTML5 and the BOM

2012-07-16 Thread Jean-François Colson

Le 14/07/12 23:14, Doug Ewell a écrit :


A related question, though, is why some people think the sky will fall 
if a text file contains loose zero-width no-break spaces. U+FEFF is 
the very model of a default ignorable code point.
I don’t think the sky will fall but I say there still are a few 
programming languages which, in some specific conditions, may produce an 
error when they meet a BOM.





Re: Copyleft

2012-07-16 Thread Leo Broukhis
Ↄ⃝ may be a better approximation.

Leo

On Mon, Jul 16, 2012 at 10:47 AM, Jean-François Colson j...@colson.eu wrote:
 Recently, the Canadian symbols  (marque de commerce) and  (marque
 déposée) have been added to Unicode at U+1F16A and U+1F16B.

 Would it be possible to add the copyleft symbol in the neighbourhood ?
 It looks like a reversed ©. Today, to type it, I use a reversed c with a
 combining enclosing circle ↄ⃝ , but that’s only a loose approximation.






RE: Copyleft

2012-07-16 Thread Doug Ewell
There was a discussion on this list around May 2000 regarding the
so-called copyleft symbol. There were concerns that it was not really a
symbol with legal standing, like © and ® and ™, but more of a logo,
notably one worn on T-shirts by followers of a sort of social movement.
Eventually it was more or less decided that the combinations with U+20DD
were sufficient.

Obviously, with recent developments in the type of symbols that have
been encoded, the objections expressed in 2000 might no longer apply.

—Doug

 
 Original Message 
Subject: Re: Copyleft
From: Leo Broukhis l...@mailcom.com
Date: Mon, July 16, 2012 4:08 pm
To: Jean-François_Colson j...@colson.eu
Cc: unicode@unicode.org

Ↄ⃝ may be a better approximation.

Leo

On Mon, Jul 16, 2012 at 10:47 AM, Jean-François Colson j...@colson.eu
wrote:
 Recently, the Canadian symbols  (marque de commerce) and  (marque
 déposée) have been added to Unicode at U+1F16A and U+1F16B.

 Would it be possible to add the copyleft symbol in the neighbourhood ?
 It looks like a reversed ©. Today, to type it, I use a reversed c with a
 combining enclosing circle ↄ⃝ , but that’s only a loose approximation.






Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-16 Thread Doug Ewell
Steven Atreju wrote:

   Q: Is the UTF-8 encoding scheme the same irrespective of whether
   the underlying processor is little endian or big endian?
   ...
   Where a BOM is used with UTF-8, it is only used as an ecoding
   signature to distinguish UTF-8 from other encodings — it has
   nothing to do with byte order.
 ...
 So, given that this page ranks 3 when searching for «utf-8 bom» from
 within Germany i would 1), fix the «ecoding» typo and 2) would change
 this to be less «neutral».  The answer to «Q.» is simply «Yes.
 Software should be capable to strip an encoded BOM in UTF, because
 some softish Unicode processors fail to do so when converting in
 between different multioctet UTF schemes.  Using BOM with UTF-8 is not
 recommended.»

That's an answer to a different question. Yes, the UTF-8 encoding scheme
is the same irrespective of whether the underlying processor is
little-endian or big-endian. The FAQ question you quoted doesn't address
whether BOM is desirable for UTF-8. This is one reason I prefer the term
signature or U+FEFF instead of BOM when talking about UTF-8.

 RFC 2279 doesn't note the BOM.

RFC 2279 was superseded by RFC 3629 almost nine years ago. RFC 3629 has
a whole section (6) about the U+FEFF signature.

 Looking at my 119,90.- German Mark Unicode 3.0 book,

The Unicode 3.0 book was an excellent resource, but it was released
almost 12 years ago. Some of it may not reflect the latest information
or recommendations.

 there is indeed talk about the UTF-8 BOM.  We have (2.7, page 28)
 «Conformance to the Unicode Standard does not requires the use of the
 BOM as such a signature» (typo taken plain; or is it no typo?), and
 (13.6, page 324) «..never any questions of byte order with UTF-8 text,
 this sequence can serve as signature for .. this sequence of bytes
 will be extremely rare at the beginning of text files in other
 encodings ... for example []Microsoft Windows[]».

 So this is fine.  It seems UTF-16 and UTF-32 were never ment for data
 exchange and the BOM was really a byte order indicator for a consumer
 that was aware of the encoding but not the byte order.

The part of 13.6 you quoted doesn't make any statement at all about
UTF-16 or UTF-32. Back when Unicode was conceived, the 16-bit format was
the only one envisioned for data exchange.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell ­





Re: pre-HTML5 and the BOM

2012-07-16 Thread Philippe Verdy
2012/7/16 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no:
 html element, then Chrome will sniff it as UTF-8 encoded. Whereas IE,
 Webkit, Opera, Firefox will default to ISO-8858-1/Windows-1252.

Actually ISO 885**9**-1. But we've also been told that, given the C1
controls are simply invalid for HTML, even if a site indicates
ISO-8859-1, it will be interpreted as Windows-1252 (meaning there were
will remain a few unassigned byte values that are invalid, causing the
HTML parser to try other encodings if they are found, but not UTF-8
which will be invalid there too and that could as well raise
exceptions). Most of these exceptions however will just be remapped to
the U+FFFD replacement character.

The support of legacy encodings is now more restrictive in HTML5 which
only supports UTF-8 and Windows-1252, plus a few other encodings
(ASCII is considered now an alias of Windows-1252, also for
compatibiluty reasons, even if strict US-ASCII resources could be
interpreted without changes as UTF-8) and require explicit encoding
(sniffing no longer works for something else as UTF-8 for its leading
BOM interpreted as a data signature and not as a character)



Re: pre-HTML5 and the BOM

2012-07-16 Thread Philippe Verdy
2012/7/15 David Starner prosfil...@gmail.com:
 /tmp $ echo -n a  file1
 /tmp $ echo b  file2
 /tmp $ cat file1 file2  file3
 /tmp $ echo ab | diff -q - file3

Once again the problem is the /bin/cat tool which is used for
everything and agnostic about preserving text selantics. using another
cat that is Unicode aware would solve the problem.

Same thing about diff which is however only designed to work with text
files and that should be Unicode aware by default.

May be there should be a new standard in Unix for /ubin/ being present
for Unicode-aware tools and insertable in user's PATH environments if
needed. Allowing migrations to newer standards.

 This is expected behavior, and with if statements is probably done by
 thousands of scripts. Add a hidden BOM at the start of file2 and this
 whole thing breaks, as diff is going to find them different. Again,
 diff is an ancient tool that deals with all sorts of text, quasi-text
 and binary matter, and frankly aBOMb is different from ab. If we're
 building a C file with Unix tools, if a char *c = ab; suddenly
 becomes char *c = BOMab; i don't know by what semantics you expect
 that to work the same. And the very model of a default ignorable code
 point is likely to be the very model of a bug that will hide in plain
 sight.