Re: Fwd: Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-18 Thread Philippe Verdy
And those early versions of Notepad for 16/32-bit Windows were not even Unicode compliant (the support for Unicode was minimalist, in fact Unicode was only partly supported on top of the old ANSI/OEM APIs; without support for the filesystem, and lots of quirks at the kernel lelevel caused by conver

RE: Fwd: Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-18 Thread Doug Ewell
Steven Atreju wrote: > Funny that a program that cannot handle files larger than 0x7FFF > bytes (laste time i've used it, 95B) has such a large impact. Notepad hasn't had this limitation since Windows Me. That was many, many years ago. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.

Fwd: Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-18 Thread Steven Atreju
Original Message Date: Wed, 18 Jul 2012 13:45:59 +0200 From: Steven Atreju To: "Doug Ewell" Subject: Re: UTF-8 BOM (Re: Charset declaration in HTML) Doug Ewell wrote: |For those who haven't yet had enough of this debate yet, here's a link |to an informa

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-18 Thread Martin J. Dürst
Hello Doug, On 2012/07/18 0:35, Doug Ewell wrote: For those who haven't yet had enough of this debate yet, here's a link to an informative blog (with some informative comments) from Michael Kaplan: "Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!)" http://blogs.

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-17 Thread Martin J. Dürst
Hello Philippe, On 2012/07/18 3:37, Philippe Verdy wrote: 2012/7/17 Julian Bradfield: On 2012-07-16, Philippe Verdy wrote: I am also convinced that even Shell interpreters on Linux/Unix should recognize and accept the leading BOM before the hash/bang starting line (which is commonly used for

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-17 Thread Philippe Verdy
2012/7/17 Julian Bradfield : > On 2012-07-16, Philippe Verdy wrote: >> I am also convinced that even Shell interpreters on Linux/Unix should >> recognize and accept the leading BOM before the hash/bang starting >> line (which is commonly used for filetype identification and runtime > The kernel do

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-17 Thread Julian Bradfield
On 2012-07-16, Philippe Verdy wrote: > I am also convinced that even Shell interpreters on Linux/Unix should > recognize and accept the leading BOM before the hash/bang starting > line (which is commonly used for filetype identification and runtime > behavior), without claiming that they don"t kno

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-17 Thread Doug Ewell
For those who haven't yet had enough of this debate yet, here's a link to an informative blog (with some informative comments) from Michael Kaplan: "Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!)" http://blogs.msdn.com/b/michkap/archive/2005/01/20/357028.aspx Wh

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-17 Thread Steven Atreju
Philippe Verdy wrote: |2012/7/16 Steven Atreju : |> Fifteen years ago i think i would have put effort in including the |> BOM after reading this, for complete correctness! I'm pretty sure |> that i really would have done so. | |Fifteen years ago I would not ahave advocated it. Simply becau

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-16 Thread Doug Ewell
Steven Atreju wrote: > Q: Is the UTF-8 encoding scheme the same irrespective of whether > the underlying processor is little endian or big endian? > ... > Where a BOM is used with UTF-8, it is only used as an ecoding > signature to distinguish UTF-8 from other encodings — it has > noth

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-16 Thread Philippe Verdy
2012/7/16 Steven Atreju : > Fifteen years ago i think i would have put effort in including the > BOM after reading this, for complete correctness! I'm pretty sure > that i really would have done so. Fifteen years ago I would not ahave advocated it. Simply because support of UTF-8 was very poor (a

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-16 Thread Leif Halvard Silli
Steven Atreju, Mon, 16 Jul 2012 13:35:04 +0200: > "Doug Ewell" wrote: > And: > > Q: Is the UTF-8 encoding scheme the same irrespective of whether > the underlying processor is little endian or big endian? > ... > Where a BOM is used with UTF-8, it is only used as an ecoding > signature

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-16 Thread Steven Atreju
"Doug Ewell" wrote: |Steven Atreju wrote: | |> If Unicode *defines* that the so-called BOM is in fact a Unicode- |> indicating tag that MUST be present, | |But Unicode does not define that. Nope. On http://unicode.org/faq/utf_bom.html i read: Q: Why do some of the UTFs have a BE or LE

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-15 Thread Doug Ewell
Steven Atreju wrote: If Unicode *defines* that the so-called BOM is in fact a Unicode- indicating tag that MUST be present, But Unicode does not define that. I know that, in Germany, many, many small libraries become closed because there is not enough money available to keep up with the digi

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-14 Thread Steven Atreju
Eli Zaretskii wrote: |> Date: Fri, 13 Jul 2012 22:07:54 +0200 |> From: Steven Atreju |> Cc: unicode@unicode.org |> |> this time without reply-in-same-charset and |> encoding=8bit and i bet it comes out as UTF-8 on the other end: | |Yes, it does. ..cheer.. Steven

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-13 Thread Eli Zaretskii
> Date: Fri, 13 Jul 2012 22:07:54 +0200 > From: Steven Atreju > Cc: unicode@unicode.org > > this time without reply-in-same-charset and > encoding=8bit and i bet it comes out as UTF-8 on the other end: Yes, it does.

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-13 Thread Steven Atreju
Philippe Verdy wrote: |2012/7/13 Steven Atreju : |> Philippe Verdy wrote: |> |> |2012/7/12 Steven Atreju : |> |> UTF-8 is a bytestream, not multioctet(/multisequence). |> |Not even. UTF-8 is a text-stream, not made of arbitrary sequences of |> |bytes. It has a lot of internal semantic

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-13 Thread Steven Atreju
Eli Zaretskii wrote: |> For example, this mail is |> written in an UTF-8 enabled vi(1) basically from 1986, in UTF-8 |> encoding («Schöne Überraschung, gelle?» | |No, it isn't: | |Content-Type: text/plain; charset=ISO-8859-1 Oh, it's really terrible. I do have 'reply-in-same-charset'

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-13 Thread Eli Zaretskii
> Date: Fri, 13 Jul 2012 16:04:44 +0200 > From: Steven Atreju > > For example, this mail is > written in an UTF-8 enabled vi(1) basically from 1986, in UTF-8 > encoding («Schöne Überraschung, gelle?» No, it isn't: User-Agent: S-nail <12.5 7/5/10;s-nail-9-g517ac44-dirty> MIME-Version: 1.

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-13 Thread Philippe Verdy
2012/7/13 Steven Atreju : > Philippe Verdy wrote: > > |2012/7/12 Steven Atreju : > |> UTF-8 is a bytestream, not multioctet(/multisequence). > |Not even. UTF-8 is a text-stream, not made of arbitrary sequences of > |bytes. It has a lot of internal semantics and constraints. Some things > |are

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-13 Thread Steven Atreju
Philippe Verdy wrote: |2012/7/12 Steven Atreju : |> UTF-8 is a bytestream, not multioctet(/multisequence). |Not even. UTF-8 is a text-stream, not made of arbitrary sequences of |bytes. It has a lot of internal semantics and constraints. Some things |are very meaningful, some play absolutely

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-12 Thread Philippe Verdy
2012/7/12 Steven Atreju : > UTF-8 is a bytestream, not multioctet(/multisequence). Not even. UTF-8 is a text-stream, not made of arbitrary sequences of bytes. It has a lot of internal semantics and constraints. Some things are very meaningful, some play absolutely no role at all and could even be d

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-12 Thread Philippe Verdy
Right. Unix was unique when it was created as it was built to handle all files as unstructured binary files. The history os a lot different, and text files have always used another paradigm, based n line records. End of lines initially were not really control characters. And even today the Unix-sty

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-12 Thread Steven Atreju
Leif Halvard Silli wrote: |Steven Atreju, Thu, 12 Jul 2012 12:32:46 +0200: | |> In the meanwhile the UTF-8 BOM is in the standard and thus |> contradicts fourty years of (well) good (Unix/POSIX) engineering |> and craftsmanship. Where a file is a file and everything is a |> file, holistica

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-12 Thread Julian Bradfield
On 2012-07-12, Steven Atreju wrote: > In the future simple things like '$ cat File1 File2 > File3' will > no longer work that easily. Currently this works *whatever* file, > and even program code that has been written more than thirty years > ago will work correctly. No! You have to modify cont

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-12 Thread David Starner
On Thu, Jul 12, 2012 at 4:06 AM, Leif Halvard Silli wrote: > I guess you get the same problem with UTF-16 files also, then? UTF-16 isn't a text file in the Unix world; it's a binary file. UTF-8 is the only standard Unicode encoding that acts like text to a Unix system, basically because it was de

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-12 Thread Leif Halvard Silli
Steven Atreju, Thu, 12 Jul 2012 12:32:46 +0200: > In the meanwhile the UTF-8 BOM is in the standard and thus > contradicts fourty years of (well) good (Unix/POSIX) engineering > and craftsmanship. Where a file is a file and everything is a > file, holistically. Where small tools which do their t

UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-12 Thread Steven Atreju
|> As for editors: If your own editor have no problems with the BOM, then |> what? But I think Notepad can also save as UTF-8 but without the BOM - |> there should be possible to get an option for choosing when you save |> it. | |Perhaps there should be such an option in Notepad, but there is