Re: Corrigendum #9 clarifies noncharacter usage in Unicode

2013-02-21 Thread Steven Atreju
  ..
  The UTF-8, UTF-16, UTF-32  BOM FAQ
  http://www.unicode.org/faq/utf_bom.html
  has also been updated for clarity,

Very nice, but i wonder why the paragraph on noncharacters can be
found under UTF-16 instead of under some generic, non-Microsoft
specific topic.
Thanks

  Steven



Re: Capitalization in German

2013-02-21 Thread Steven Atreju
 |I was trying to follow this:

All that you say, but…

 |Karl May for many decades, so it's not like it's a new thing.

For me there is only one drunken criminal, and that is Hemingway
and his famous book, «Captain Iglo and the fish sticks».

 |I am not sure I can follow the discourse structure of your argument.

It seems i'm a bit off topic.

  Steven




Re: Capitalization in German

2013-02-20 Thread Steven Atreju
 | Here is a minimal pair to illustrate that point:
 | Er hat in Moskau liebe Genossen.
 | Er hat in Moskau Liebe genossen.
 | which translates to:
 | At Moskow, he’s got dear comrades.
 | At Moskow, he has enjoyed love.
 | 
 | A classical joke are those two newspaper header lines:
 | 
 |   Der Gefangene floh
 |   Der gefangene Floh
 | 
 | which translates to
 | 
 |   The Prisoner Escaped
 |   The Caught Flea
 |
 |So in this case, the imaginary newspapers made use of written forms 
 |that they perhaps would not have used orally, if instead of newspapers 
 |they had been Radio channels. 
 |
 |The general subject here is the fact that “outer“ things, such as the 
 |(effect of the) “look“ of the language, affects on the “inner“ \
 |things, 
 |namely how we use the language.

Better fit for Germany would be »how we are supposed to use the
language«, *though*.
..Just because today it was announced that Otfried Preußler
(..search Preussler..) died (at the age of 89)..
..Sigh..
A terrible motion in the german language is killing »evil words«
from children's books.
A few years ago Astrid Lindgrens »Pippi Langstrumpf« was forbidden
to use the word »Neger« (Negro), after a (i think long) trial
fought by the family.  And a few weeks ago the word »Negerlein«
(minimization form of »Neger«) has been axed from Ottfried
Preußlers »Die kleine Hexe« (The Little Witch) [1, German].
That is, future generations will read clean books.
And the mentioned website may have good contents, but a »Graf
Ortho« is s poor given that the german »Sesamstrasse« has
a »Graf Zahl« (Lord Number) since decades.  Well, it may be easier
for kids to overcome their inhibition level like that.
*In the seventies young girls thought on their own [2,3]*!
And at the same time Germany has become the 3rd largest
arms-exporting country.  As a side note, the german garbage
collection is called »Veolia«.
For the clean-away in us.
Had to be said.

  Steven

[1] 
http://www.berliner-zeitung.de/literatur/sprachstreit-um--die-kleine-hexe--die--boesen-woerter--von-otfried-preussler,10809200,21450236.html
[2] http://www.youtube.com/watch?v=EshRTkJ7M6A
[3] http://www.youtube.com/watch?v=utSnv-0_tzY (120 seconds, please!)




Re: I missed my self-imposed deadline for the Mayan numeral proposal

2012-12-25 Thread Steven Atreju
Philippe Verdy verd...@wanadoo.fr wrote:

 [.]
 |then the real catastrophe occured 394 years ago, in 1618, just because of
 |the conquest of America by Spanish troops : which meant a massive death of
 |lots of Amerindians (most of them due to imported infections, to which

Terrible and ridiculous little selfish infections.

 |Amerindians were not protected, but also due to the end of development of
 |the Mayan civilization caused by their internal wars, their concentration

One won't believe how stupid these natives were.
Ritual torture of the *own* people.
Also Polynesians and Aborigines, Indians and some african
natives -- all foolish enough to torture themselves instead of
others.
Fortunately these times are over, thanks to the missioners which
altruistically spent their lives spreading the light all over the
planet.

It must be said though that it seems as if countermovements rise
on the very sphere, a motion that is absolutely inconceivable.
And it'll possibly be a bit frightening to participate in the
further developments of events.
Good to know to be on the right side.

On the other hand it is quite spirited that Unicode covers the
language of so many cultures.  Even of some that didn't use
written text on their own, and originally.
As a matter of fact sometimes something similar to what can be
called a face shines through.
Rare events can be savoured much more intense.

  Steven



Re: Tool to convert characters to character names

2012-12-20 Thread Steven Atreju
Martin J. Dürst due...@it.aoyama.ac.jp wrote:

 |I'm looking for a (preferably online) tool that converts Unicode 
 |characters to Unicode character names. Richard Ishida's tools 
 |(http://rishida.net/tools/conversion/) do a lot of conversions, but not 
 |names.

For whats it worth, that sounds like a perfect task for a standard
perl(1):

  $ perl -e 'use charnames();print charnames::viacode(0x1),\n'
  START OF HEADING

For you Unicoders the Unicode::Tussle stuff may also be of
interest, too [1][2].  (Just because i haven't seen it mentioned
on this list yet.)

 |Regards,   Martin.

Keep on going,

  Steven

[1] http://search.cpan.org/~bdfoy/Unicode-Tussle-1.03/lib/Unicode/Tussle.pm
[2] https://github.com/briandfoy/Unicode-Tussle




Re: Rijksmuseum launches Rijksstudio

2012-11-01 Thread Steven Atreju
Jeroen Ruigrok van der Werven asmo...@in-nomine.org wrote:

 |For those of you that do research for orthographies and the likes based on
 |historical pieces, the Dutch Rijksmuseum has recently launched their
 |Rijksstudio. You can search through their entire collection of high
 |resolution images and create your own curated collections (which can be
 |shared, et cetera).
 |
 |https://www.rijksmuseum.nl/en/rijksstudio
 |
 |I am sure this might help some people with proposals.

Really fantastic.  'Should rework that Flash or what it is since
i got a halequin statue when i wanted to get a close up of
Vincent van Gogh..
Shame upon him who thinks evil upon it.  And though..

 |-- 
 |Jeroen Ruigrok van der Werven asmodai(-at-)in-nomine.org / asmodai
 |イェルーン ラウフロック ヴァン デル ウェルヴェン
 |http://www.in-nomine.org/ | GPG: 2EAC625B
 |Man imagines that it is death he fears; but what he fears is the \
 |unforeseen,
 |the explosion. What man fears is himself...

  Steven




Re: Rijksmuseum launches Rijksstudio

2012-11-01 Thread Steven Atreju
Jeroen Ruigrok van der Werven asmo...@in-nomine.org wrote:

 |-On [20121101 11:48], Steven Atreju (snatr...@googlemail.com) wrote:
 |Really fantastic.  'Should rework that Flash or what it is since
 |i got a halequin statue when i wanted to get a close up of
 |Vincent van Gogh..
 |
 |At least with my Chrome browsing I only encountered HTML and no flash. The
 |closeups seem to use a Google Maps-like tile approach.
 |
 |-- 
 |Jeroen Ruigrok van der Werven asmodai(-at-)in-nomine.org / asmodai
 |イェルーン ラウフロック ヴァン デル ウェルヴェン
 |http://www.in-nomine.org/ | GPG: 2EAC625B
 |Man imagines that it is death he fears; but what he fears is the \
 |unforeseen,
 |the explosion. What man fears is himself...

Yes, i'm using Opera, it's so nice European and from a smaller
company.
But don't mind, German Telekom complains if someone uses something
else than Internet Explorer (for online MMS viewing!), and those
guys won't even work with Opera.  So i need to start Safari if my
Mother-In-Law sends me photos ...

Rembrandt, Vermeer, van Gogh.

It would be at par with the decline of culture anywhere else if
i would start to compare this with metadata that flutters around
somewhere, just like signatures that suddely show up in regular
data.

  Steven




Re: U+25CA LOZENGE - why is it in the Mac OS Roman character set (and therefore widespread in current fonts)?

2012-08-15 Thread Steven Atreju
On Tue, Aug 14, 2012 at 12:48 PM, Karl Pentzlin
karl-pentz...@acssoft.de wrote:
 Am Montag, 13. August 2012 um 20:53 schrieb Hans Aberg:

 HA The German WP mentions that in the context of the now
 HA discontinued Bildschirmtext, it was called Raute:
 HA   https://de.wikipedia.org/wiki/Doppelkreuz_(Satzzeichen)
 HA   https://de.wikipedia.org/wiki/Bildschirmtext

 HA But otherwise, Raute is the same as English lozenge:
 HA   https://de.wikipedia.org/wiki/Raute_(Symbol)

 In fact, I have heavily edited these Wikipedia articles in the last days,
 Before, they show a mess of Doppelkreuz, Raute, and Nummernzeichen

Exzellent: ein Mi-ß-ssstand weniger!
(Roughly: with Sisyphus on a list!)

 [.]
 Now, after discussing this with several people, I learned that this
 scheme was too academic, as in fact everybody seems to call the #
 Raute. The word Raute otherwise is unused in colloquial German.

I'm not everybody.  Uff.

 You learn in math lessons that there is a geometric form called
 Rhombus (lozenge) which also can be called Raute, but in the class

Talking about the german school system is like opening a can of worms,

 Rhombus is the preferred term. Raute also is the preferred term in

and teachers really need some »Feedback«.  But even if an additional
study of psychology would be required to become teacher, it would most
likely be treated as an additional learning climax only.
Or philosophy, just the same.

 heraldics, but used by the general public only when referring to the
 pattern of the Bavarian flag. (Besides, Raute is used in the name

(I didn't know that at all, but maybe i just didn't understand enough bavarian?)

 of some herbs, like Ruta graveolens, but also only by specialists.)
[.]
 Thus, when the # came as a new character to the general public
 with the keypad telephone in the 1970s, together with a name Raute
 which sounds not unknown and not really wrong, thus it got its way
 into the general public together with the # (which, as said, was
 formerly not used in Germany).

I like your term which sounds not unknown very much indeed.
But i think it came up in the eighties for the public at large, i.e., the
raging current which make a mountain out of a molehill.
And it was *plain terror* because Btx was doomed to succeed.
(Which it didn't, the potato-trick didn't work.
Why, oh why?  I don't understand, too.)

 Raute is e.g. used by customer services which you call when you have
 a question regarding your mobile phone, and you are told to press the
 lower right key on your telephone keypad.

 On the other hand, as far as I know now (and a DIN officer confirmed
 me this), there is no German standard which uses the term Raute.

But Raute is plain wrong!  Should the usual »aufwändige Albträume«
(roughly: jmmmense efffords) be globbered with it?
I wouldn't like that, and it follows why..

 Thus, I probably will use the term Doppelkreuz but have to remark
 that I address the character commonly called Raute. As the

Oh, please do so.  I remember Klaus von Dohnanyi saying »Plauderstübchen«
when Christiansen was about to talk about Chatrooms, and it saved a sunday
and is still remembered, and positively.

[.]

Thank you very much.

 - Karl

  Steven




Re: German »Raute« (was: U+25CA LOZENGE)

2012-08-14 Thread Steven Atreju
Hi all,

Philippe Verdy verd...@wanadoo.fr wrote:
 |2012/8/13 Otto Stolz otto.st...@uni-konstanz.de:
 | Hello,
 |
 | am 2012-08-13 20:48, schrieb Leif Halvard Silli:
 |
 | The word 'Raute' reminds of the Norwegian 'rute' - and my Norwegian
 | book on etymology assumes that 'rute' is derived from 'Raute'. The
 | Norwegian 'rute' may refer to a cell in a (data) table or in a square
 | board for chess. Such a 'rute' is of course a square. Perhaps German
 | 'Raute' has a similar possibility of being interpreted as square?
 |[.]
 |
 | In German, »Raute« is a synonym of »Rhombus«, i. e.
 | an equilateral quadrilateral. Hence, every »Raute«
 | is a »Quadrat« (square), but not vice versa.
 | (A square has also four equal angels.)
 |
 |Correction:
 |* Every »Quadrat« (square) is a »Raute« (Rhombus), a Rhombus/Raute
 |being not restricted to right angles.

According to the german »Duden« ([0],[1]) a »Quadrat« has four
angles of 90 degrees, whereas a Raute is described as a
»schiefwinkliges gleichseitiges Viereck«, an «oblique-angled
equilateral parallelogram».
Of course ,

 |* Every »Raute« (Rhombus) is also a lozenge,[.]

And i would think that the other way is the more common one, i.e,
Rhombus (Raute), because the geometrical form is »rhombisch« and
it forms a »Rhomboid«.

  Steven

[0] http://en.wikipedia.org/wiki/Duden
[1] http://www.duden.de

P.S.:
Yes, germans; but i wouldn't count Btx since noone had it anyway..
That reminded me of the then minister of post Schwarz-Schilling,
related by marriage to Sonnenschein batteries, and i always
wondered why a small company without much research could gain lots
of orders from major companies like Volkswagen..  But that ended
in 1992 once he resigned, too.
Unfortunately www.dict.cc shows a big relationship in between
Raute/rhomb and Doppelkreuz/hash.
I don't know if that means much though.  Just one more vespiary.




Re: (Informational only: UTF-8 BOM and the real life)

2012-07-30 Thread Steven Atreju
Leif H Silli xn--mlform-...@xn--mlform-iua.no wrote:

 |We now have some data that indicates that what Unicode says about the UTF-8 
 |BOM is worded in a way that is possible to misunderstand. I support you in 

Yeah! Yeah! Yeah!, that is good to read black on #FCFCF9.

 |Steven replied:
 |
 |In XML 1.0 the BOM is in fact described as a signature regardless of 
 | which unicode encoding it is used with:
 |
 |  |http://www.w3.org/TR/xml/#charencoding
 |
 | Yes, simply spoken out and clarified like that, and everybody
 | knows what to deal with.
 |
 | And btw., my local copy of XML 1.1 (Second Edition, thus current)
 | doesn't include this paragraph (in the referenced 4.3.3):
 |
 |   |If the replacement text of an external entity is to begin with
 |   |the character U+FEFF, and no text declaration is present, then
 |   |a Byte Order Mark MUST be present, whether the entity is encoded
 |   |in UTF-8 or UTF-16.
 |
 |I think you must reread. I find the same signature sentence in XML 1.1:
 |
 |http://www.w3.org/TR/xml11/#charencoding
 | 
 | But i don't see the big picture of all that markup standards, i'm
 | just have them in case my own work raises some questions..
 |
 |We now have some data that indicates that what Unicode says about the UTF-8 
 |BOM is worded in a way that is possible to misunderstand. I support you in 
 |that Unicode should be more explicit about the fact that
 |
 |* it is neutral about the BOM in UTF-8 (currently it is possible to read it 
 |as if Unicode advices against the BOM)
 |
 |* The BOM is a encoding signature - for both UTF-8 and UTF-16.
 |--
 |leif halvard silli 



Re: (Informational only: UTF-8 BOM and the real life)

2012-07-30 Thread Steven Atreju
Doug Ewell d...@ewellic.org wrote:

 |Steven Atreju wrote:
 |
 |^Z as an EOF marker for text files was part of the MS-DOS legacy from
 |CP/M, where all files were written to a multiple of the disk block size
 |(I think 128 for CP/M and 512 for MS-DOS 1.x), and there had to be some
 |way to tell where the real text content ended. New stream-based I/O
 |calls in MS-DOS 2.0 made this mechanism unnecessary. Unix systems had no
 |legacy from CP/M, so they never had this problem.

I'm learning in this thread.
(And CP/M was that thing that Microsoft bought cheap to sell it
expensively the very next day to IBM as their consumer box OS.?!
Well, money must be made and sometimes you have to break an egg
to make an omelette.  Sure thing.  Providence really matters.)

 | I.e., this is why we do have this messy text OR binary file I/O
 | distinction like O_BINARY (for open(2)), b (for fopen(3)) or
 | binmode (perl(1)).  Because without those a text file will see
 | End-Of-File at the ^Z, not at the real end of the file.
 |
 |The reason for the text/binary distinction on DOS and Windows is
 |conversion between Unix-standard LF and Windows (DOS, CP/M)-standard

Eh, no, here you are mistaken i think.  Line endings are a
different problem.  There may be I/O libraries which take this
flag into account even for those, but i've not seen such an
approach yet.  Seems dangerous to me, if there were.

(The perfect approach to handle the newline problem is somewhat
costly at runtime.  But this is good for the power industry and
the hardware producers, is it.  I remember that i've seen a tree
implementation test-comparison in the german computer magazine c´t
about a decade ago, it compared a C++ and a Java version of the
very same program, and the Java version was faster than the
full-instance-datatype in Nodedatatype template C++ version due
to the memory allocator!  Microsoft and Intel still had that
«Wintel» alliance back then.  I think that was the
in-between-the-lines tenor, if i recall correctly.
But i'm slowly running out of anti-Microsoftisms in this thread.)

 |CRLF. It might be true that library calls to read a file in text mode
 |will stop at ^Z, but Notepad and Wordpad don't. I know the library
 |doesn't automatically write ^Z. Almost nobody in the MS world uses the
 |^Z convention on purpose any more; many don't even know about it.

I've only seen the Cygwin *code* (very well over a decade ago).
(Well, those were the I/O streams.  You really wouldn't have
wanted to see what was necessary for select(2)..
An operating system without select(2) is simply not imaginable.)

 | (Which rises the immediate question why the Microsoft programmers did
 | not embed the meta information in this section at the end of the file.
 | But i don't really want to know.)
 |
 |See above. The intent of ^Z was never to distinguish data from metadata,
 |as with the Mac data and resource forks.
 |
 |But of course none of this has anything to do with U+FEFF.

Not so.

 | So do the programmers have to face the same conditions?  I don't
 | really think so.  They prefer driving plain text readers up the wall.
 | Successfully.

This seems to have lost its context..

 |Again, we don't really have this kind of evil intent, though it's often
 |fun and convenient for people to imagine we do.

.. hmm ...

 |But of course none of this has anything to do with U+FEFF.

Not so.

 |--
 |Doug Ewell | Thornton, Colorado, USA
 |http://www.ewellic.org | @DougEwell ­ 

  Steven




Re: (Informational only: UTF-8 BOM and the real life)

2012-07-30 Thread Steven Atreju
Rick McGowan r...@unicode.org wrote:

 |No. That wasn't CP/M... It was a different OS.

Oh yes, according to Wikipedia my remembrance was wrong.  Sorry.

Doug Ewell d...@ewellic.org wrote:

 |Steven Atreju wrote:
 |
 | I'm learning in this thread.
 | (And CP/M was that thing that Microsoft bought cheap to sell it
 | expensively the very next day to IBM as their consumer box OS.?!
 |
 |This history isn't correct either, but I'm not going to bother going 

It was remembrance aquired by reading only, anyway.

 |The approach offered by common libraries isn't perfect, doesn't claim to 
 |be, and doesn't need to be. It converts between LF and CRLF, and maybe 
 |also handles bare CR (I don't remember). This is computationally 
 |trivial.

This is trivial.
And refuted by reality.
I was about to write about our internal approach, using a class
Newline and a class NewlineFind which also has a classify().  Note
that we have solutions to recognize the type of newline of a file,
and note even more that the file will be written in exactly the
same way, too, and as necessary.
In this stormy context here this is absolutely remarkable.

But you're terribly right, because compared to Unicode
normalization and collation these processor cache heaters are
really trivial tasks.  The thing i hate the most is that zero-copy
linewise reading seems to be history, because practically I/O
libraries always have a(t least one) layer that has to be passed,
for the text-encoding.  Well.

 | But this is good for the power industry and
 | the hardware producers, is it.
 |
 |Please, no more conspiracy theories.

And i would be pleased if you would not tear up sentences of mine
and out of their context, shall i ever have to say something useful
on this list again.

 |Doug Ewell | Thornton, Colorado, USA
 |http://www.ewellic.org | @DougEwell ­ 

Thanks and Ciao,

  Steven




Re: (Informational only: UTF-8 BOM and the real life)

2012-07-28 Thread Steven Atreju
Leif H Silli xn--mlform-...@xn--mlform-iua.no wrote:

 |Steven Atreju on 28/7/'12,  0:22:
 | Doug Ewell wrote:
 |
 |  | Well, i still see a bug in the Unicode Standard here.
 |  | Whereas for the multioctet UTFs there is «The BOM is not
 |  | considered part of the content of the text» (Conformance, 3.10,
 |  | D98, D101), i cannot find any such clarifying text for it's usage
 |  | as a signature.
 |  |
 |  |There really isn't as much difference between using U+FEFF as a byte 
 |  |order mark and using it as a signature as this makes it seem. The 
 |  |definitions you quote have to do with whether U+FEFF is treated as a 
 |  |BOM/signature or as a zero-width no-break space.
 | 
 | I really think that a clarification in equal spirit to those of
 | D98 and D101 (but maybe with different content :) would be an
 | improvement of the Unicode Standard.
 [.]
 |
 |I agree with Doug that there is no enormous diff between BOM and encoding 
signature. In XML 1.0 the BOM is in fact described as a signature regardless 
of which unicode encoding it is used with:
 |
 |http://www.w3.org/TR/xml/#charencoding

Yes, simply spoken out and clarified like that, and everybody
knows what to deal with.

And btw., my local copy of XML 1.1 (Second Edition, thus current)
doesn't include this paragraph (in the referenced 4.3.3):

  |If the replacement text of an external entity is to begin with
  |the character U+FEFF, and no text declaration is present, then
  |a Byte Order Mark MUST be present, whether the entity is encoded
  |in UTF-8 or UTF-16.

But i don't see the big picture of all that markup standards, i'm
just have them in case my own work raises some questions..

 |Also, whether UTF-16 is one ore two encodings is a definition question. 
(Microsoft at one time defined it as two encodings.) 
 |--
 |Leif Halvard Silli

  Steven.




Re: (Informational only: UTF-8 BOM and the real life)

2012-07-27 Thread Steven Atreju
Asmus Freytag asm...@ix.netcom.com wrote:

 |On 7/25/2012 2:45 PM, Jukka K. Korpela wrote:
 | . One might even argue that the BOM is useful here, too, since it 
 | immediately signals that there is something wrong, and “” is an 
 | encoding error signature, so to say.
 |
 |
 |+8
 |
 |A./

Well, i still see a bug in the Unicode Standard here.
Whereas for the multioctet UTFs there is «The BOM is not
considered part of the content of the text» (Conformance, 3.10,
D98, D101), i cannot find any such clarifying text for it's usage
as a signature.  (Even though that kind of usage itself is promoted
at more and more places of the standard, which will remain a
mystery to me.)
Thus--if this terrible thing really has to be swallowed--the standard
should add «The signature is not..», too.
Thanks,

  Steven




Re: (Informational only: UTF-8 BOM and the real life)

2012-07-27 Thread Steven Atreju
Leif H Silli xn--mlform-...@xn--mlform-iua.no wrote:

 |Asmus Freytag on 26/7/'12,  1:10
 | On 7/25/2012 2:45 PM, Jukka K. Korpela wrote:
 | . One might even argue that the BOM is useful here, too, since it 
 | immediately signals that there is something wrong, and “” is an 
 | encoding error signature, so to say.
 | 
 | +8
 |
 |I agree. But the real issue here is that e-mail application's inability to 
send a decent plain text version of the HTML message. In fact, I don't think 
there are many email apps that are good at that. (But Thunderbird used to be 
good.)
 |--
 |Leif H Silli  

No, the real issue is that the programmers are duds.
Or they were unsure about it all...
Anyway, i've told them they were duds, and as i didn't get any
response sofar, i was right.

  Steven




Re: (Informational only: UTF-8 BOM and the real life)

2012-07-27 Thread Steven Atreju
Doug Ewell d...@ewellic.org wrote:

 |As a programmer, I can attest that we are no more receptive to being 
 |called duds than any other professionals. Constructive suggestions 
 |focused on the end product, instead of the competence of the person, 
 |might get a response.

You're of course right.  The tone was rude.

 |Steven Atreju wrote:
 |
 | Well, i still see a bug in the Unicode Standard here.
 | Whereas for the multioctet UTFs there is «The BOM is not
 | considered part of the content of the text» (Conformance, 3.10,
 | D98, D101), i cannot find any such clarifying text for it's usage
 | as a signature.
 |
 |There really isn't as much difference between using U+FEFF as a byte 
 |order mark and using it as a signature as this makes it seem. The 
 |definitions you quote have to do with whether U+FEFF is treated as a 
 |BOM/signature or as a zero-width no-break space.

I don't understand what you are saying here.
And i do think that more people are uncertain about wether this
has been left off intentionally (which i personally would assume
given the assumed grade of the people involved and the amount of
time that the standard exists and has been reviewed).

I really think that a clarification in equal spirit to those of
D98 and D101 (but maybe with different content :) would be an
improvement of the Unicode Standard.

Once more i want to point out that on Unix/POSIX systems the file
content can be seen as a whole, and i hope and think that this
will not change.  This situation is completely different than on
Windows, which had textfiles with appended (separated by ^Z or so)
meta information that was invisible in normal text editors already
in the ninetees (or even earlier, but i don't know).

I.e., this is why we do have this messy text OR binary file I/O
distinction like O_BINARY (for open(2)), b (for fopen(3)) or
binmode (perl(1)).  Because without those a text file will see
End-Of-File at the ^Z, not at the real end of the file.  (Which
rises the immediate question why the Microsoft programmers did not
embed the meta information in this section at the end of the file.
But i don't really want to know.)
Anyway.  On Unix a UTF-8 file *will* show the BOM, because it is
file content.  I.e.:

  |?0%0[tmp]$ hexdump -C text
  |  ef bb bf 49 20 64 6f 6e  27 74 20 77 61 6e 74 20  |...I don't want 
|
  |0010  74 6f 20 73 65 65 20 79  6f 75 2c 20 65 76 65 72  |to see you, 
ever|
  |0020  21 0a 53 68 65 20 70 75  74 20 6f 6e 20 68 65 72  |!.She put on 
her|
  |0030  20 63 6f 61 74 20 61 6e  64 20 6c 65 66 74 2e 0a  | coat and 
left..|
  |0040

is shown (because even bad english is displayed) as

  |?0%0[tmp]$ v text
  |U+FEFFI don't want to see you, ever!
  |She put on her coat and left.

in an UTF-8 locale and

  |?0%0[tmp]$ LESSCHARSET=ascii v text 
  |EFBBBFI don't want to see you, ever!
  |She put on her coat and left.

otherwise.  And i like that, because it is the truth.  But it of
course implies that it will show up exactly like this wherever the
signature occurs.

 | No, the real issue is that the programmers are duds.
 | Or they were unsure about it all...
 | Anyway, i've told them they were duds, and as i didn't get any
 | response sofar, i was right.
 |
 |As a programmer, I can attest that we are no more receptive to being 
 |called duds than any other professionals. Constructive suggestions 
 |focused on the end product, instead of the competence of the person, 
 |might get a response.

So i apologize again.  I want to state however that the company
in question is heavily automatized and full of robots.  People
have to face Modern Times.  At least in the manufacturing.  (Why
do i own a bicycle of them?  Because people get jobs there, which
they would not have otherwise, *there*.  But real craftsmanship
products, like those from http://www.manufactum.de, or old Rolls
Royce or whatever, are of course preferable.)
So do the programmers have to face the same conditions?  I don't really
think so.  They prefer driving plain text readers up the wall.
Successfully.

 |--
 |Doug Ewell | Thornton, Colorado, USA
 |http://www.ewellic.org | @DougEwell ­ 

  Steven




(Informational only: UTF-8 BOM and the real life)

2012-07-25 Thread Steven Atreju
So, dear list, i'm really sorry for this distress.
I don't want to start any thread, but i can't help it and thus
want to pass this through to you.

I had problems with my bicycle and sent a mail asking for help.
This is a real large company (www.mifa.de).

  |Received: from ds0501.hostingschmiede.de
  |From: informat...@radservice.net informat...@radservice.net
  |Organization: CC GmbH
  |
  |This is a multi-part message in MIME format
  |
  |Content-Type: text/html; charset=UTF-8
  |Content-Transfer-Encoding: 8bit
  |Content-Disposition: inline

The HTML part is all right.

  |td style=width:100px;font:normal 11px 
Arial;vertical-align:topEmpfänger/td

The text part is UTF-8 converted once again to UTF-8.
Which is ridiculous.

  |Content-Type: text/plain; charset=UTF-8
  |Content-Transfer-Encoding: 8bit
  |Content-Disposition: inline
  |
  |Datum:   25.07.2012 15:52:02
  |Absender:informat...@radservice.net
  |---
  |

And that was an Unicode BOM that has been converted to UTF-8 and
then been converted to UTF-8 once again.  As you all see - in the
middle of nowhere.

  |Sehr geehrter Herr Steven,
  |
  |vielen Dank für Ihre E-Mail.

I've sent them a nice mail on UTF-8 BOM and perl(1) programming
in general.  (I can't imagine anything else due to resource
reasons.)

Yes, i also hope this will get better as time goes by.
Yes, consumers should ignore a zero-width non-break space.
It's not visual.
Thanks for your understanding, but i had to send this now.
Good night.

  Steven




Re: pre-HTML5 and the BOM

2012-07-18 Thread Steven Atreju
Except that the internet is almost unusable without cookies
and scripting, lynx(1) works very well, too, if the ncursesw
library is linked against (and the terminal font supports
Unicode characters).  Funny that it writes garbage for

 |htmlbodypä.ü.ö./p/body/html

but uses UTF-8 by default for

 |htmlbodypä.ü.ö./p/body/html

Hypertext offers a lot of possibilities to declare the charset,
and until then an agnostic 8-bit parser will do fine except
for multioctet charsets.

  Steven




Fwd: Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-18 Thread Steven Atreju
 Original Message 
Date: Wed, 18 Jul 2012 13:45:59 +0200
From: Steven Atreju snatr...@googlemail.com
To: Doug Ewell d...@ewellic.org
Subject: Re: UTF-8 BOM (Re: Charset declaration in HTML)

Doug Ewell wrote:

 |For those who haven't yet had enough of this debate yet, here's a link
 |to an informative blog (with some informative comments) from Michael
 |Kaplan:
 |
 |Every character has a story #4: U+feff (alternate title: UTF-8 is the
 |BOM, dude!)
 |http://blogs.msdn.com/b/michkap/archive/2005/01/20/357028.aspx
 |
 |What should be interesting is that this blog dates to January 2005,
 |seven and a half years ago, and yet includes the following:
 |
 |But every 4-6 months another huge thread on the Unicode List gets
 |started about how bad the BOM is for UTF-8 and how it breaks UNIX tools
 |that have been around and able to support UTF-8 without change for
 |decades and about how Microsoft is evil for shipping Notepad that causes
 |all of these problems and how neither the W3C nor Unicode would have
 |ever supported a UTF-8 BOM if Microsoft did not have Notepad doing it,
 |and so on, and so on.
 |
 |And here we are again.

Interesting, thanks for the pointer.  I didn't know that.
Funny that a program that cannot handle files larger than 0x7FFF
bytes (laste time i've used it, 95B) has such a large impact.
And sorry for the noise, then.

  Steven



Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-17 Thread Steven Atreju
Philippe Verdy verd...@wanadoo.fr wrote:

 |2012/7/16 Steven Atreju snatr...@googlemail.com:
 | Fifteen years ago i think i would have put effort in including the
 | BOM after reading this, for complete correctness!  I'm pretty sure
 | that i really would have done so.
 |
 |Fifteen years ago I would not ahave advocated it. Simply because
 |support of UTF-8 was very poor (and there were even differences of
 |interpretations between the ISO/IEC definition and the Unicode
 |definition, notably differences for the conformance requirements).
 |This is no longer the case.
 |
 | So, given that this page ranks 3 when searching for «utf-8 bom»
 | from within Germany i would 1), fix the «ecoding» typo and 2)
 | would change this to be less «neutral».  The answer to «Q.» is
 | simply «Yes.  Software should be capable to strip an encoded BOM
 | in UTF, because some softish Unicode processors fail to do so when
 | converting in between different multioctet UTF schemes.  Using BOM
 | with UTF-8 is not recommended.»
 |
 |  | I know that, in Germany, many, many small libraries become closed
 |  | because there is not enough money available to keep up with the
 |  | digital race, and even the greater *do* have problems to stay in
 |  | touch!
 |  |
 |  |People like to complain about the BOM, but no libraries are shutting
 |  |down because of it. Keeping up with the digital race isn't about
 |  |handling two or three bytes at the beginning of a text file, in a way
 |  |that has been defined for two decades.
 |
 | RFC 2279 doesn't note the BOM.
 |
 | Looking at my 119,90.- German Mark Unicode 3.0 book, there is
 | indeed talk about the UTF-8 BOM.  We have (2.7, page 28)
 | «Conformance to the Unicode Standard does not requires the use of
 | the BOM as such a signature» (typo taken plain; or is it no
 | typo?), and (13.6, page 324) «..never any questions of byte order
 | with UTF-8 text, this sequence can serve as signature for .. this
 | sequence of bytes will be extremely rare at the beginning of text
 | files in other encodings ... for example []Microsoft Windows[]».
 |
 | So this is fine.  It seems UTF-16 and UTF-32 were never ment for
 | data exchange and the BOM was really a byte order indicator for a
 | consumer that was aware of the encoding but not the byte order.
 | And UTF-8 got an additional «wohooo - i'm Unicode text» signature
 | tag, though optional.  I like the term «extremely rare» sooo much!!
 | :-)
 |
 |No need to rant. There's the evidence that the role of BOM in UTF-8
 |has been to help the migration from legacy charsets to Unicode, to
 |avoid mojibake. And this role is still important. As UTF-8 became
 |proeminent in interchanges, and the need for migration from older
 |encodings largely augmented, this small signature has helped knowing
 |which files were converted or not, even if there was no meta data
 |(meta data is freuently dropped as soon as the ressource is no longer
 |on a web server, but stored in a file of a local filesystem).
 |
 |As there are still a lot of local resources using other encodings, the
 |signature really helps managing the local contents. And more and more
 |applications will recognize this signature automatically to avoid
 |using the default legacy encodings of the local system (something they
 |still do in absence of meta data and of the BOM) : you no longer need
 |to use a menu in apps to select the proper encoding (most often it is
 |not available, or requires restarting the application or cancelling an
 |ongoing transaction, and still frequently we still have to manage the
 |situation were resources using legacy local encodings and those in
 |UTF-8 are mixed in the application).
 |
 |The BOM is then extremely useful in a transition that will durate
 |several decennials (or more) each time that resource is not strictly
 |bound to the 7-bit US-ASCII subset.

I disagree, disagree, disagree :).

 |I am also convinced that even Shell interpreters on Linux/Unix should
 |recognize and accept the leading BOM before the hash/bang starting
 |line (which is commonly used for filetype identification and runtime
 |behavior), without claiming that they dont know what to do to run the
 |file or which shell interpreter to use.

Please let it be as agnostic as it is.
While watching the parade i've noticed that some standard Renault
trucks did not have a soot filter.  That's a complete no-go.  We
were shocked.

 |PHP itself should be allowed to use it as well (but unfortunetaly it
 |still does not have the concept of tracking the effective encoding to
 |parse its scripts simply.
 |
 |Yes this requires modifying the database of filetype signatures, but
 |this type of update has always been necessary since long for handling
 |more and more filetypes (see for example the frequent updates and the
 |growth of the /etc/magic database used by the Unix/Linux tool
 |file).

But i'm lucky that you mention this tool, since i've forgotten to
do so in my last post.  It appeared first in 1973

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-16 Thread Steven Atreju
Doug Ewell d...@ewellic.org wrote:

 |Steven Atreju wrote:
 |
 | If Unicode *defines* that the so-called BOM is in fact a Unicode-
 | indicating tag that MUST be present,
 |
 |But Unicode does not define that.

Nope.  On http://unicode.org/faq/utf_bom.html i read:

  Q: Why do some of the UTFs have a BE or LE in their label,
  such as UTF-16LE?

So it seems to me that the Unicode Consortium takes care of
newbies and those people who work at a very high programming
level, say, PHP, Flash, JavaScript or even no programming at all.
And:

  Q: Is the UTF-8 encoding scheme the same irrespective of whether
  the underlying processor is little endian or big endian?
  ...
  Where a BOM is used with UTF-8, it is only used as an ecoding
  signature to distinguish UTF-8 from other encodings — it has
  nothing to do with byte order.

Fifteen years ago i think i would have put effort in including the
BOM after reading this, for complete correctness!  I'm pretty sure
that i really would have done so.

So, given that this page ranks 3 when searching for «utf-8 bom»
from within Germany i would 1), fix the «ecoding» typo and 2)
would change this to be less «neutral».  The answer to «Q.» is
simply «Yes.  Software should be capable to strip an encoded BOM
in UTF, because some softish Unicode processors fail to do so when
converting in between different multioctet UTF schemes.  Using BOM
with UTF-8 is not recommended.»

 | I know that, in Germany, many, many small libraries become closed
 | because there is not enough money available to keep up with the
 | digital race, and even the greater *do* have problems to stay in
 | touch!
 |
 |People like to complain about the BOM, but no libraries are shutting 
 |down because of it. Keeping up with the digital race isn't about 
 |handling two or three bytes at the beginning of a text file, in a way 
 |that has been defined for two decades.

RFC 2279 doesn't note the BOM.

Looking at my 119,90.- German Mark Unicode 3.0 book, there is
indeed talk about the UTF-8 BOM.  We have (2.7, page 28)
«Conformance to the Unicode Standard does not requires the use of
the BOM as such a signature» (typo taken plain; or is it no
typo?), and (13.6, page 324) «..never any questions of byte order
with UTF-8 text, this sequence can serve as signature for .. this
sequence of bytes will be extremely rare at the beginning of text
files in other encodings ... for example []Microsoft Windows[]».

So this is fine.  It seems UTF-16 and UTF-32 were never ment for
data exchange and the BOM was really a byte order indicator for a
consumer that was aware of the encoding but not the byte order.
And UTF-8 got an additional «wohooo - i'm Unicode text» signature
tag, though optional.  I like the term «extremely rare» sooo much!!
:-)

I restart my «rant» UTF-8 filetype thread from the beginning now.
I wonder: was the Unicode Consortium really so unconfident?  Do i
really read «UTF-8 will drown in this evil mess of terroristic
charsets, so rise the torch of freedom in this unfriendly
environment!»?
I have downloaded the 6.0 and 6.1 stuff as a PDF and for free (:-.

If you know how to deal with UTF-8, you can deal with UTF-8.
If you don't, no signature ever will help you, no?!

If you don't know the charset of some text, that comes from
nowhere, i.e., no container format with meta-information, no
filetype extension with implicit meta-information, as is used on
Mac OS and DOS, then UTF-8 is still very easily identifieable by
itself due to the way the algorithm is designed.  Is it??

Tear down the wall!
Tear down the wall!
Tear down the wall!

 |It's about technologies and 
 |standards and platforms and formats that change incompatibly every few 
 |years.

That is of course true.

But what to do with these myriads of aggressive nerds that linger
in these neon-enlightened four square meter boxes, with their
poignant hunger for penthouse windows and four-cylinder
Mercedes-Benz limousines?  I'm asking you.  I've seen photos of
standard committees in palm-covered bays (CSS2?  DOM?  W3M
anyway), i've dropped my subscription to regular IETF discussion
because i can stand only so and so many dozens of dinner,
hotel-room reservation, laptop-compatible socket in Paris? and
whatever threads (the annual ladies steakhouse meeting!).  So here
you are.  These people have deserved it, and no better.

  Steven



Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-14 Thread Steven Atreju
Eli Zaretskii e...@gnu.org wrote:

 | Date: Fri, 13 Jul 2012 22:07:54 +0200
 | From: Steven Atreju snatr...@googlemail.com
 | Cc: unicode@unicode.org
 | 
 | this time without reply-in-same-charset and
 | encoding=8bit and i bet it comes out as UTF-8 on the other end:
 |
 |Yes, it does.

..cheer..

  Steven



Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-13 Thread Steven Atreju
Philippe Verdy verd...@wanadoo.fr wrote:

 |2012/7/12 Steven Atreju snatr...@googlemail.com:
 | UTF-8 is a bytestream, not multioctet(/multisequence).
 |Not even. UTF-8 is a text-stream, not made of arbitrary sequences of
 |bytes. It has a lot of internal semantics and constraints. Some things
 |are very meaningful, some play absolutely no role at all and could
 |even be disacarded from digital signature schemes (this includes
 |ignoring BOMs wherever they are, and ignoring the encoding effectiely
 |useed in checksum algorithms, whose first step will be to uniformize
 |and canonicalize the encoding into a single internal form before
 |processing).
 |The effective binary encoding of text streams should NOT play any
 |semantic role (all UTFs should completely be equivalent on the text
 |interface, the bytestream low level is definitely not suitable for
 |handling text and should not play any role in any text parser or
 |collator).

I don't understand what you are saying here.
UTF-8 is a data interchange format, a text-encoding.
It is not a filetype!

A BOM is a byte-order-mark, used to signal different host endianesses.
There are BOM-less UTF-16{LE,BE} and UTF-32{LE,BE}.  A BOM is
necessarily not an encoding-indicator.  Encoding a byteorder mark
in a byte-oriented data stream must necessarily be born from
either a misunderstanding, laziness or ignorance.  But if it is
part of the content, it is part of the content, and thus belongs
to it.  You cannot simply truncate user data content at will?!
And automatically??

 |The history os a lot
 |different, and text files have always used another paradigm, based n
 |line records. End of lines initially were not really control
 |characters. And even today the Unix-style end od lines (as advertized
 |on other systems now with the C language) is not using the
 |international standard (CR+LF, which was NOT a Microsoft creation for
 |DOS or Windows).

CR+LF seems to originate in teletypewriters (my restricted
knowledge, sorry).  CR+LF is used in a lot of internet protocols.
Unix uses \n U+000A to indicate End-Of-Line in text files for a
long time.  This seems logical to me, because there is no cursor
to transport to the left margin of the screen (unless the content
of the text file is about to be interpreted by a terminal directly,
but for that the terminal must be so configured (POSIX:
http://pubs.opengroup.org/onlinepubs/9699919799/, Base
Definitions, 11. General Terminal Interface), which was the
purpose of a Carriage-Return.

 |May be you would think that cat utf8file1.txt utf8file2.txt
 |utf8file.txt would create problems. For plain text-files, this is no
 |longer a problem, even if there are extra BOMs in the middle, playing
 |as no-ops.
 |now try cat utf8file1.txt utf16file2.txt  unknownfile.txt and it
 |will not work. IT will not work as well each time you'll have text
 |files using various SBCS or DBCS encodings (there's never been any
 |standard encoding in the Unic filesystem, simply because the
 |concention was never stored in it; previous filesystems DID have the
 |way to track the encoding by storing metadata; even NTFS could track
 |the encoding, without guessing it from the content).
 |Nothing in fact prehibits Unix to have support of filesystems
 |supporting out-of-band metadata. But for now, you have to assume that
 |the cat tool is only usable to concatenate binary sequences, in
 |aritrary orders : it is not properly a tool to handle text files.

If there is a file, you can simply look at it.  Use less(1) or any
other pager to view it as text, use hexdump(1) or od(1) or
whatever to view it in a different way.  You can do that.  It is
not that you can't do that -- no dialog will appear to state that
there is no application registered to handle a filetype; you look
at the content, at a glance.  You can use cat(1) to concatenate
whatever files, and the result will be the exact concatenation of
the exact content of the files you've passed.  And you can
concatenate as long as you want.  For example, this mail is
written in an UTF-8 enabled vi(1) basically from 1986, in UTF-8
encoding («Schöne Überraschung, gelle?» -- works from my point of
view), and the next paragraph is inserted plain from a file from
1971 (http://minnie.tuhs.org/cgi-bin/utree.pl):


  K. Thompson

 D. M. Ritchie




November 3, 1971
  INTRODUCTION


This manual gives complete descriptions of all the publicly available features
of UNIX.


This worked well, and there is no magic involved here, ASCII in
UTF-8, just fine.  The metadata is in your head.  (Nothing will
help otherwise!)  For metadata, special file formats exist, i.e.,
SGML.  Or text based approaches from which there are some, but i
can't remember one at a glance ;}.  Anyway, such things are used
for long-time archiving textdata.  Though the
http://www.bitsavers.org/ have chosen a very different way for
historic data.  Metadata in a filesystem is not really something
for me

Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-13 Thread Steven Atreju
Eli Zaretskii e...@gnu.org wrote:

 | For example, this mail is
 | written in an UTF-8 enabled vi(1) basically from 1986, in UTF-8
 | encoding («Schöne Überraschung, gelle?»
 |
 |No, it isn't:
 |
 |Content-Type: text/plain; charset=ISO-8859-1

Oh, it's really terrible.  I do have 'reply-in-same-charset' set,
so as to not trouble people with less sophisticated mailers, and
it seems the mailer maybe ignores the 'sendcharsets=utf-8' if the
same-charset ISO-8859-1 is sufficient to represent the content!
What a considerate mailer!  So, for you

  41744 S+ 0.0   1288  ttys004   9:29pm   0:00.06 nail -f 
  41753 S+ 0.0756.)ttys004   9:35pm   0:00.03 vi -c set sw=2 
/tmp/RepfAbEB

and righteousness, this time without reply-in-same-charset and
encoding=8bit and i bet it comes out as UTF-8 on the other end:

  «Böses Erwachen folgte pünktlich, prompt und ohne Verzögerung.»

(And, just in case -- the Google translation is a bit too rude.)

  Steven



Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-13 Thread Steven Atreju
Philippe Verdy verd...@wanadoo.fr wrote:

 |2012/7/13 Steven Atreju snatr...@googlemail.com:
 | Philippe Verdy verd...@wanadoo.fr wrote:
 |
 |  |2012/7/12 Steven Atreju snatr...@googlemail.com:
 |  | UTF-8 is a bytestream, not multioctet(/multisequence).
 |  |Not even. UTF-8 is a text-stream, not made of arbitrary sequences of
 |  |bytes. It has a lot of internal semantics and constraints.
 |  |The effective binary encoding of text streams should NOT play any
 |  |semantic role (all UTFs should completely be equivalent on the text
 |  |interface, the bytestream low level is definitely not suitable for
 |  |handling text and should not play any role in any text parser or
 |  |collator).
 |
 | I don't understand what you are saying here.
 | UTF-8 is a data interchange format, a text-encoding.
 | It is not a filetype!
 |
 |Not only ! It is a format which is unambiguously bound to a text
 |filetype, even if this file type may not be intended to be interpreted
 |by humans (e.g. program sources or riche text formats like HTML)
 |
 | A BOM is a byte-order-mark, used to signal different host endianesses.[...]
 |
 |I'm on this list since long enough to know all this already. And i've
 |not contradicted this role. However this is not prescriptive for

Sure, i know the former and i bet there has been a lot of discussion.

 |anything else than text file types (whatever they are). For example
 |BOMs have abolutely no role for encoding binary images, even if they
 |include internal multibyte numeric fields.

Well, it boils down to that, does it.  If Unicode *defines* that
the so-called BOM is in fact a Unicode-indicating tag that MUST
be present, then it is very clear what has to happen for, say,
'$ cat tagless tagged  out' (in an UTF-8 environment).  I don't
agree with that though due to the reasons i tried to put in
english words, but this is solely my problem.  Another approach
would be an explicit UTF-8-BOM charset.  Or, of course,
deprecating the -BE/-LE versions.

I don't agree with just about anything you say about automatic
metadata provision.  I know that, in Germany, many, many small
libraries become closed because there is not enough money
available to keep up with the digital race, and even the greater
*do* have problems to stay in touch!  I've mentioned bitsavers
already, but this is a drop in the bucket, almost rhetoric.  In
other countries the situation is worse.

  Steven



UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-12 Thread Steven Atreju
 | As for editors: If your own editor have no problems with the BOM, then
 | what? But I think Notepad can also save as UTF-8 but without the BOM -
 | there should be possible to get an option for choosing when you save
 | it.
 |
 |Perhaps there should be such an option in Notepad, but there isn't. The 
 |decision to have Notepad always write the signature to UTF-8 files, and 
 |always rely on it to read them, has been documented to death.
 |The bottom line is, there are zillions of editors available for Windows, 
 |many of them free,
... 
 |and people who want to create or modify UTF-8 files 
 |which will be consumed by a process that is intolerant of the signature 
 |should not use Notepad.  That goes for HTML (pre-5) pages, Unix shell 
 |scripts, and others.

In the meanwhile the UTF-8 BOM is in the standard and thus
contradicts fourty years of (well) good (Unix/POSIX) engineering
and craftsmanship.  Where a file is a file and everything is a
file, holistically.  Where small tools which do their thing well
can be plugged together to achieve complex tasks.  Unicode is
very, very important.  Really.

In the future simple things like '$ cat File1 File2  File3' will
no longer work that easily.  Currently this works *whatever* file,
and even program code that has been written more than thirty years
ago will work correctly.  No!  You have to modify content to get it
right!!
Unicode is very, very important.  Really.

Adding the UTF-8 BOM is an incarnation of the malicious evil.
(In at least its incarnations ignorance and foolishness, and what
about pretension.)
Tomorrow is Friday, 13th.  Good luck.

P.S.:
I couldn't respond to the thread because the Digest doesn't include
the message-ID.

P.S., 2.:
Microsoft and IBM are involved in the standard groups that have been
undermined.

  Steven



Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-12 Thread Steven Atreju
Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no wrote:

 |Steven Atreju, Thu, 12 Jul 2012 12:32:46 +0200:
 |
 | In the meanwhile the UTF-8 BOM is in the standard and thus
 | contradicts fourty years of (well) good (Unix/POSIX) engineering
 | and craftsmanship.  Where a file is a file and everything is a
 | file, holistically.  Where small tools which do their thing well
 | can be plugged together to achieve complex tasks.  Unicode is
 | very, very important.  Really.
 | 
 | In the future simple things like '$ cat File1 File2  File3' will
 | no longer work that easily.
 |
 |I guess you get the same problem with UTF-16 files also, then?
 |-- 
 |Leif Halvard Silli

UTF-8 is a bytestream, not multioctet(/multisequence).  This is
a perfectly valid data interchange format (IMHO).  The embedded
BOM in UTF-8 streams seems to serve the purpose of enabling
automatic encoding detection.  To handle that, data inspection is
required, and also user-chosen locale settings (LC_CTYPE,
LC_COLLATE..) must be forcefully overwritten.  This _/\_can_/\_;
be the wrong thing, can it.  Especially behind the back of someone.

I do liked ISO 10646 more in respect to the clear 31 bit
statement, yes.  UTF-16 is a multisequence, so that a character
can consist of multiple codepoints which in turn can consist of
multiple UTF-16 instances.  This is harder to handle than having
some UTF-32 integers around, where one integer transports one
codepoint.  I don't really understand why one gives up the 1:1
relationship of codepoint-storage, especially if that doesn't
gain 1:1 relationship on the storage-character side.  Why not
UTF-8 directly, then.  Solely MHO.

'Nothing against UTF-32 as a memory representation from my side.
Or, if it's your real desire, UTF-16.  For data interchange i
prefer bytes.  Besides it is pretty clear that the Unix/POSIX
tools have to be adjusted for real Unicode awareness
(normalization and combining and working on the result).  Why is
there a need to embed completely useless information in a file.
You have to special-case this.  Like running

  $  nice-windows-file.txt iconv -f UTF-16 -t UTF-8 | some-work

or something.  Stripping the BOM silently may change the checksum.
UTF-8 BOM is horrible in normal data interchange.  It maybe ok for
XHTML or XML where some standard uses a fallback encoding, but
then again.  Ach.  ¡Viva la Revolución!

¡Hasta la Victoria Siempre!

  Steven




Re: Combining latin small letters with diacritics

2012-03-26 Thread Steven Atreju
Denis Jacquerye wrote [2012-03-26 13:35+0200]:
 The fact [.] doesn't make it any saner.

 The same could be said [.]
 
 Denis Moyogo Jacquerye

Are you trying to say that extra tables and exact additional
knowledge besides UnicodeData.txt should not be necessary?
In the end you wanna make it a 32-bit standard.

--
Steven