Re: Fwd: Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-18 Thread Philippe Verdy
And those early versions of Notepad for 16/32-bit Windows were not even Unicode compliant (the support for Unicode was minimalist, in fact Unicode was only partly supported on top of the old ANSI/OEM APIs; without support for the filesystem, and lots of quirks at the kernel lelevel caused by conver

RE: Fwd: Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-18 Thread Doug Ewell
Steven Atreju wrote: > Funny that a program that cannot handle files larger than 0x7FFF > bytes (laste time i've used it, 95B) has such a large impact. Notepad hasn't had this limitation since Windows Me. That was many, many years ago. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.

Fwd: Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-18 Thread Steven Atreju
Original Message Date: Wed, 18 Jul 2012 13:45:59 +0200 From: Steven Atreju To: "Doug Ewell" Subject: Re: UTF-8 BOM (Re: Charset declaration in HTML) Doug Ewell wrote: |For those who haven't yet had enough of this debate yet, here's a link |to an informa

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-18 Thread Martin J. Dürst
Hello Doug, On 2012/07/18 0:35, Doug Ewell wrote: For those who haven't yet had enough of this debate yet, here's a link to an informative blog (with some informative comments) from Michael Kaplan: "Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!)" http://blogs.

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-17 Thread Martin J. Dürst
Hello Philippe, On 2012/07/18 3:37, Philippe Verdy wrote: 2012/7/17 Julian Bradfield: On 2012-07-16, Philippe Verdy wrote: I am also convinced that even Shell interpreters on Linux/Unix should recognize and accept the leading BOM before the hash/bang starting line (which is commonly used for

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-17 Thread Philippe Verdy
2012/7/17 Julian Bradfield : > On 2012-07-16, Philippe Verdy wrote: >> I am also convinced that even Shell interpreters on Linux/Unix should >> recognize and accept the leading BOM before the hash/bang starting >> line (which is commonly used for filetype identification and runtime > The kernel do

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-17 Thread Julian Bradfield
On 2012-07-16, Philippe Verdy wrote: > I am also convinced that even Shell interpreters on Linux/Unix should > recognize and accept the leading BOM before the hash/bang starting > line (which is commonly used for filetype identification and runtime > behavior), without claiming that they don"t kno

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-17 Thread Doug Ewell
For those who haven't yet had enough of this debate yet, here's a link to an informative blog (with some informative comments) from Michael Kaplan: "Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!)" http://blogs.msdn.com/b/michkap/archive/2005/01/20/357028.aspx Wh

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-17 Thread Steven Atreju
Philippe Verdy wrote: |2012/7/16 Steven Atreju : |> Fifteen years ago i think i would have put effort in including the |> BOM after reading this, for complete correctness! I'm pretty sure |> that i really would have done so. | |Fifteen years ago I would not ahave advocated it. Simply becau

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-16 Thread Doug Ewell
Steven Atreju wrote: > Q: Is the UTF-8 encoding scheme the same irrespective of whether > the underlying processor is little endian or big endian? > ... > Where a BOM is used with UTF-8, it is only used as an ecoding > signature to distinguish UTF-8 from other encodings — it has > noth

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-16 Thread Philippe Verdy
2012/7/16 Steven Atreju : > Fifteen years ago i think i would have put effort in including the > BOM after reading this, for complete correctness! I'm pretty sure > that i really would have done so. Fifteen years ago I would not ahave advocated it. Simply because support of UTF-8 was very poor (a

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-16 Thread Leif Halvard Silli
Steven Atreju, Mon, 16 Jul 2012 13:35:04 +0200: > "Doug Ewell" wrote: > And: > > Q: Is the UTF-8 encoding scheme the same irrespective of whether > the underlying processor is little endian or big endian? > ... > Where a BOM is used with UTF-8, it is only used as an ecoding > signature

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-16 Thread Steven Atreju
"Doug Ewell" wrote: |Steven Atreju wrote: | |> If Unicode *defines* that the so-called BOM is in fact a Unicode- |> indicating tag that MUST be present, | |But Unicode does not define that. Nope. On http://unicode.org/faq/utf_bom.html i read: Q: Why do some of the UTFs have a BE or LE

Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)

2012-07-15 Thread Naena Guru
Hey, Philippe, Your input is much appreciated. So, in a nutshell, I don't have to worry. One of these days I need to crunch down (minify) the CSS and JavaScript pages. I left them readily readable so that techs like you could easily read them in place in any browser without having to pretty print.

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-15 Thread Doug Ewell
Steven Atreju wrote: If Unicode *defines* that the so-called BOM is in fact a Unicode- indicating tag that MUST be present, But Unicode does not define that. I know that, in Germany, many, many small libraries become closed because there is not enough money available to keep up with the digi

Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)

2012-07-15 Thread Naena Guru
On Tue, Jul 10, 2012 at 11:58 PM, Leif Halvard Silli < xn--mlform-...@xn--mlform-iua.no> wrote: > Naena Guru, Tue, 10 Jul 2012 01:40:19 -0500: > > > HTML5 assumes UTF-8 as the character set if you do not declare one > > explicitly. My current pages are in HTML 4. > > There is in principle no diffe

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-14 Thread Steven Atreju
Eli Zaretskii wrote: |> Date: Fri, 13 Jul 2012 22:07:54 +0200 |> From: Steven Atreju |> Cc: unicode@unicode.org |> |> this time without reply-in-same-charset and |> encoding=8bit and i bet it comes out as UTF-8 on the other end: | |Yes, it does. ..cheer.. Steven

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-13 Thread Eli Zaretskii
> Date: Fri, 13 Jul 2012 22:07:54 +0200 > From: Steven Atreju > Cc: unicode@unicode.org > > this time without reply-in-same-charset and > encoding=8bit and i bet it comes out as UTF-8 on the other end: Yes, it does.

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-13 Thread Steven Atreju
Philippe Verdy wrote: |2012/7/13 Steven Atreju : |> Philippe Verdy wrote: |> |> |2012/7/12 Steven Atreju : |> |> UTF-8 is a bytestream, not multioctet(/multisequence). |> |Not even. UTF-8 is a text-stream, not made of arbitrary sequences of |> |bytes. It has a lot of internal semantic

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-13 Thread Steven Atreju
Eli Zaretskii wrote: |> For example, this mail is |> written in an UTF-8 enabled vi(1) basically from 1986, in UTF-8 |> encoding («Schöne Überraschung, gelle?» | |No, it isn't: | |Content-Type: text/plain; charset=ISO-8859-1 Oh, it's really terrible. I do have 'reply-in-same-charset'

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-13 Thread Eli Zaretskii
> Date: Fri, 13 Jul 2012 16:04:44 +0200 > From: Steven Atreju > > For example, this mail is > written in an UTF-8 enabled vi(1) basically from 1986, in UTF-8 > encoding («Schöne Überraschung, gelle?» No, it isn't: User-Agent: S-nail <12.5 7/5/10;s-nail-9-g517ac44-dirty> MIME-Version: 1.

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-13 Thread Philippe Verdy
2012/7/13 Steven Atreju : > Philippe Verdy wrote: > > |2012/7/12 Steven Atreju : > |> UTF-8 is a bytestream, not multioctet(/multisequence). > |Not even. UTF-8 is a text-stream, not made of arbitrary sequences of > |bytes. It has a lot of internal semantics and constraints. Some things > |are

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-13 Thread Steven Atreju
Philippe Verdy wrote: |2012/7/12 Steven Atreju : |> UTF-8 is a bytestream, not multioctet(/multisequence). |Not even. UTF-8 is a text-stream, not made of arbitrary sequences of |bytes. It has a lot of internal semantics and constraints. Some things |are very meaningful, some play absolutely

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-12 Thread Philippe Verdy
2012/7/12 Steven Atreju : > UTF-8 is a bytestream, not multioctet(/multisequence). Not even. UTF-8 is a text-stream, not made of arbitrary sequences of bytes. It has a lot of internal semantics and constraints. Some things are very meaningful, some play absolutely no role at all and could even be d

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-12 Thread Philippe Verdy
Right. Unix was unique when it was created as it was built to handle all files as unstructured binary files. The history os a lot different, and text files have always used another paradigm, based n line records. End of lines initially were not really control characters. And even today the Unix-sty

Re: Charset declaration in HTML

2012-07-12 Thread Leif Halvard Silli
Naena Guru, Tue, 10 Jul 2012 01:40:19 -0500: > As I said, I use HTML-Kit (and Tools). Your problem appears to be that HTML-Kit does not directly support UTF-8. But are you aware that you can still work with UTF-8 with it? You only need to use UnicodePad in the Unicode menu of the Tools menu, s

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-12 Thread Steven Atreju
Leif Halvard Silli wrote: |Steven Atreju, Thu, 12 Jul 2012 12:32:46 +0200: | |> In the meanwhile the UTF-8 BOM is in the standard and thus |> contradicts fourty years of (well) good (Unix/POSIX) engineering |> and craftsmanship. Where a file is a file and everything is a |> file, holistica

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-12 Thread Julian Bradfield
On 2012-07-12, Steven Atreju wrote: > In the future simple things like '$ cat File1 File2 > File3' will > no longer work that easily. Currently this works *whatever* file, > and even program code that has been written more than thirty years > ago will work correctly. No! You have to modify cont

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-12 Thread David Starner
On Thu, Jul 12, 2012 at 4:06 AM, Leif Halvard Silli wrote: > I guess you get the same problem with UTF-16 files also, then? UTF-16 isn't a text file in the Unix world; it's a binary file. UTF-8 is the only standard Unicode encoding that acts like text to a Unix system, basically because it was de

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-12 Thread Leif Halvard Silli
Steven Atreju, Thu, 12 Jul 2012 12:32:46 +0200: > In the meanwhile the UTF-8 BOM is in the standard and thus > contradicts fourty years of (well) good (Unix/POSIX) engineering > and craftsmanship. Where a file is a file and everything is a > file, holistically. Where small tools which do their t

UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-12 Thread Steven Atreju
|> As for editors: If your own editor have no problems with the BOM, then |> what? But I think Notepad can also save as UTF-8 but without the BOM - |> there should be possible to get an option for choosing when you save |> it. | |Perhaps there should be such an option in Notepad, but there is

Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)

2012-07-11 Thread Doug Ewell
Leif Halvard Silli wrote: As for editors: If your own editor have no problems with the BOM, then what? But I think Notepad can also save as UTF-8 but without the BOM - there should be possible to get an option for choosing when you save it. Perhaps there should be such an option in Notepad, bu

Re: Charset declaration in HTML

2012-07-11 Thread Leif Halvard Silli
Philippe Verdy, Wed, 11 Jul 2012 14:15:39 +0200: > 2012/7/11 Jean-François Colson >> If your document only contains >> >> > header("location:http://unicode.org";); >> ?> >> >> but you save it with a BOM, the BOM will be sent and you’ll get an >> error message like >> >> Warning: Cannot modify

Re: Charset declaration in HTML

2012-07-11 Thread Jean-François Colson
Le 11/07/12 14:15, Philippe Verdy a écrit : 2012/7/11 Jean-François Colson mailto:j...@colson.eu>> If your document only contains http://unicode.org";); ?> but you save it with a BOM, the BOM will be sent and you’ll get an error message like Warning: Cannot modify hea

Re: Charset declaration in HTML

2012-07-11 Thread Philippe Verdy
2012/7/11 Jean-François Colson > If your document only contains > >  header("location:http://unicode.org";); > ?> > > but you save it with a BOM, the BOM will be sent and you’ll get an error > message like > > Warning: Cannot modify header information - headers already sent by > (output started

Re: Charset declaration in HTML

2012-07-11 Thread Jean-François Colson
Le 11/07/12 06:32, Philippe Verdy a écrit : 2012/7/10 Naena Guru mailto:naenag...@gmail.com>> I wanted to see how hard it is to edit a page in Notepad. So I made a copy of my LIYANNA page and replaced the character entities I used for Unicode Sinhala, accented Pali and Sanskrit with

Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)

2012-07-10 Thread Leif Halvard Silli
Philippe Verdy, Wed, 11 Jul 2012 07:36:56 +0200: > 2012/7/11 Leif Halvard Silli: >> In VIM, you set or unset the BOM via the commands >> >> set bomb >> set nobomb > > Should these command specify if your computer will explode when saving > the file ? > > :'o Probably signals the

Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)

2012-07-10 Thread Philippe Verdy
2012/7/11 Leif Halvard Silli : > it. Else you can use the free Notepad++. And many others. In VIM, you > set or unset the BOM via the commands > > set bomb > set nobomb Should these command specify if your computer will explode when saving the file ? :'o set bom set nobom Sorry,

Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)

2012-07-10 Thread Leif Halvard Silli
Naena Guru, Tue, 10 Jul 2012 01:40:19 -0500: > HTML5 assumes UTF-8 as the character set if you do not declare one > explicitly. My current pages are in HTML 4. There is in principle no difference between what HTML5-parsers assume and what HTML4-parsers assume: All of them default to the default

Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)

2012-07-10 Thread Philippe Verdy
2012/7/10 Naena Guru > I wanted to see how hard it is to edit a page in Notepad. So I made a copy > of my LIYANNA page and replaced the character entities I used for Unicode > Sinhala, accented Pali and Sanskrit with their raw letters. Notepad forced > me to save the file in UTF-8 format. I ran i

Re: Charset declaration in HTML (was: Romanized Singhala - Think about it again)

2012-07-09 Thread Naena Guru
Thank you Otto. Sorry for delay in replying. I spent the entire Sunday replying Jaques twins. You are absolutely right about choice between ISO-8859-1 and UTF-8. I shouldn't have said 'using ISO-8859-1 is advantageous over UTF-8' It is efficient if your pages are written in a language that uses s

Charset declaration in HTML (was: Romanized Singhala - Think about it again)

2012-07-04 Thread Otto Stolz
Hello Naena Guru, on 2012-07-04, you wrote: The purpose of declaring the character set as iso-8859-1 than utf-8 is to avoid doubling and trebling the size of the page by utf-8. I think, if you have characters outside iso-8859-1 and declare the page as such, you get Character-not-found for those