Re: charset parameter in Google Groups

2010-07-12 Thread Mark Crispin

On Mon, 12 Jul 2010, Philippe Verdy wrote:

Traditionally, the MIME types are only given in lowercase, so if you
had written "text/plain; charset=windows-1252", it would have been
orrectly detected.


Nonsense.  Pure, unadulterated nonsense.

I helped write the MIME RFCs, and I can assure you that uppercase MIME
types are permitted and used by long-standing production software.


I don't know which strange email program you use that generates this
form of MIME types, even if they should be interpreted ignoring case.


At least one standard email programs used by millions of users does
uppercase for the MIME types.


[Nonsensical and irrelevant babble about Windows-1252 deleted]


Software that does not recognize ISO-8859-15 is broken.

-- Mark --

http://panda.com/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.



Re: charset parameter in Google Groups

2010-07-12 Thread Philippe Verdy
The problem in this message is probably not in the specified charset
(windows-1252) but on the way the MIME type is specified just before
it "TEXT/PLAIN".

Traditionally, the MIME types are only given in lowercase, so if you
had written "text/plain; charset=windows-1252", it would have been
orrectly detected. Google must have thought that the MIME header tag
was unknown, and it ignored it completely, tryin to guess the charset
instead. But given that there's very little text in the message, the
guess algorithm is more likley to fail.

One of the server may have transcoded the message from windows-1252 to
UTF-8, but forgot to change the MIME type. Then the Google page just
detects the windows-1252 encoding that is still present in the MIME
header, and then displays the message according to it.

I don't know which strange email program you use that generates this
form of MIME types, even if they should be interpreted ignoring case.

And I'm not even sure that Google is the culprit here: it may have
been caused by an relaying SMTP server not operated by Google but by
an ISP that transcoded the message without correctly changing the MIME
header accordingly.

Or it may have been caused by your own specific SMTP agent before it
was even sent to the Internet (and I think that this is probably the
cause of the problem here, because I seriously doubt that a Google
SMTP relay, or a SMTP relay used by an ISP would alter the internal
encoding message during the transit, even if it's allowed for
plain-text contents).

Note that "windows-1252" is likely to have been transcoded by a SMTP
server running on Unix/Linux with a broken version, thinking that only
standard ISO charsets should be admitted, but still ignoring the fact
that windows-1252 is now becoming the preferred charset over
ISO-8859-1 (but not over UTF-8).

Here I see absolutely no relation with the ISO-8859-15 charset (which
is nearly used by absolutely nobody, when windows-1252 is far superior
for compatibility as it fully preserves the ISO-8859-1 charset, and
just maps additional characters within the code area previously
reserved in ISO-8859-1 for C1 controls that have never been used in
plain-text emails). Even HTML5 recognizes this fact : ISO-8859-1 is
being deprecated by windows-1252 for practical reasons

And there's no reason to ignore this charset, just because it contains
the term "windows", given that the registration made by Microsoft was
made in an open way (no problem caused by the trademark citation,
Microsoft does not want us to display a trademark symbol and a notice
about its owner), and Google is certainly not deciding to
discard/ignore this widely used charset.

Philippe.

> Message du 07/07/10 18:04
> De : "Andreas Prilop" 
> A : unicode@unicode.org
> Copie à :
> Objet : Re: charset parameter in Google Groups
>
>
> On Tue, 6 Jul 2010, John Dlugosz wrote:
>
> > I often see  glyps where typesetter chars like curved
> > apostrophes were supposed to be, or characteristic
> > UTF-8-as-Latin-1 pairs, in web pages.
> >
> > I've seen the charset meta tag overridden with header values
> > from the server, without regard to what's actually in the file.
>
> This means that *your* software (browser) behaves *exactly*
> in the way I expect for Google, too -- nothing else:
>
>   Recognize the encoding information (charset) of the document
>   and respect it.
>
> If the document has
> charset=ISO-8859-15
> then you SHALL apply this charset value.
>
> You SHALL NOT look whether the author has a Chinese name,
> whether the document was published in Japan, etc. etc.
>
> Is this clear now?




Re: charset parameter in Google Groups

2010-07-07 Thread Asmus Freytag

Andreas,

I think we all realize your frustration with well-meaning software.
Because tags can be wrong for no fault of the human originating the 
document,
I fully understand that Google might want to attempt to improve the user 
experience in such situations.


The problem is that doing so should not come at the expense of authors 
who correctly tag their documents and whose servers preserve their tags 
and don't mess with them. That your message was broken exposed a bug in 
Google's implementation. And that was acknowledged as well.


I have not seen any design details of the algorithm that Google uses 
(when correctly implemented) so I can't comment on whether it strikes 
the correct balance between honoring tags in the "normal" case, where 
they should be presumed to be correctly applied, vs. detecting the case 
of clearly erroneous tags and doing something about them so users aren't 
stuck when documents are mis-tagged.


However, in principle, I support the development of tools that can 
handle imperfect input - after all, you as a human reader also don't 
reject language that isn't perfectly spelled or that violates some of 
the grammatical rules.


There's a benefit to these kinds of tools, but, as you keep reminding 
us, there's a cost (which needs to be minimized). This cost is similar 
to that of a spell-checker rejecting a correctly spelled word. Still we 
are better off with them than without them.


For that reason, I think you will find few takers for your somewhat 
absolutist position, whereas you would get more sympathy if you were 
simply reminding everyone of the dangers of poorly implemented solutions 
that can break correctly tagged data.


A./



Re: charset parameter in Google Groups

2010-07-07 Thread Andreas Prilop
On Wed, 7 Jul 2010, Shawn Steele wrote:

> however, in general, perhaps not your specific case,
> the charset tag on the web cannot be 100% reliably trusted,
> regardless of what the RFCs say.

You do not understand what I mean!

You have missed my point completely!

You DO NOT understand me!



RE: charset parameter in Google Groups

2010-07-07 Thread Shawn Steele
I meant the author of the web page, the question being that "wouldn't a web 
site author realize their mistake when they looked at the web site they just 
made and it was broken."  It's clear that if you owned this web site you'd do 
something different, and it would probably work for you :)

Obviously I have no clue what the exact situation is at Google, however, in 
general, perhaps not your specific case, the charset tag on the web cannot be 
100% reliably trusted, regardless of what the RFCs say.  Perhaps, in the 
future, enough content will be corrected to sway the balance.  Likely it'll get 
worse though :(, as content is blindly copied between pages without regard to 
charset markup... unless UTF-8 gets more momentum.  If the web suddenly 
switched to being 100% perfect about recognizing charset tags, all the software 
vendors would suddenly get millions of complaints about "hey, why'd my favorite 
web page stop working?"  It's really hard to shift everyone to the "right" 
solution, whether it works or not.

It's possible your message might be more successful if sent in UTF-8?  (I have 
no clue, but it might be worth trying).

-Shawn



From: unicode-bou...@unicode.org [unicode-bou...@unicode.org] on behalf of 
Andreas Prilop [prilop4...@trashmail.net]
Sent: Wednesday, July 07, 2010 8:42 AM
To: unicode@unicode.org
Subject: Re: charset parameter in Google Groups

On Tue, 6 Jul 2010, Shawn Steele wrote:

> "Often" the author seems to use the same code page
> they were expecting as a system default, so it can appear
> to work for them even when it's wrong.

I am the author of this news message:
http://groups.google.co.uk/group/de.etc.sprache.deutsch/msg/c4b913eb39cb875b?dmode=source

Please explain someone to me why groups.google is not smart
enough to display the special, non-ASCII characters correctly.

--
 From the New World:
 http://www.google.co.uk/search?ie=ISO-8859-2&q=Dvofi%E1k



Re: charset parameter in Google Groups

2010-07-07 Thread Andreas Prilop
On Tue, 6 Jul 2010, Shawn Steele wrote:

> "Often" the author seems to use the same code page
> they were expecting as a system default, so it can appear
> to work for them even when it's wrong.

I am the author of this news message:
http://groups.google.co.uk/group/de.etc.sprache.deutsch/msg/c4b913eb39cb875b?dmode=source

Please explain someone to me why groups.google is not smart
enough to display the special, non-ASCII characters correctly.

-- 
 From the New World:
 http://www.google.co.uk/search?ie=ISO-8859-2&q=Dvofi%E1k




Re: charset parameter in Google Groups

2010-07-07 Thread Andreas Prilop
On Tue, 6 Jul 2010, John Dlugosz wrote:

> I often see  glyps where typesetter chars like curved
> apostrophes were supposed to be, or characteristic
> UTF-8-as-Latin-1 pairs, in web pages.
>
> I've seen the charset meta tag overridden with header values
> from the server, without regard to what's actually in the file.

This means that *your* software (browser) behaves *exactly*
in the way I expect for Google, too -- nothing else:

  Recognize the encoding information (charset) of the document
  and respect it.

If the document has
charset=ISO-8859-15
then you SHALL apply this charset value.

You SHALL NOT look whether the author has a Chinese name,
whether the document was published in Japan, etc. etc.

Is this clear now?

-- 
 From the New World:
 http://www.google.co.uk/search?ie=ISO-8859-2&q=Dvofi%E1k



RE: charset parameter in Google Groups

2010-07-06 Thread Shawn Steele
> Do you really think that the author of a webpage or message will not 
> notice?

"Often" the author seems to use the same code page they were expecting as a 
system default, so it can appear to work for them even when it's wrong.

-Shawn




RE: charset parameter in Google Groups

2010-07-06 Thread John Dlugosz
> Do you really think that the author of a webpage or message
> will not notice?

Apparently so, in cases we notice and complain about.  I often see  glyps 
where typesetter chars like curved apostrophes were supposed to be, or 
characteristic UTF-8-as-Latin-1 pairs, in web pages.

They don't tend to write conforming HTML by any means, so that might have to do 
with it.  Without a DOCTYPE, anything goes, right?  I've seen the charset meta 
tag overridden with header values from the server, without regard to what's 
actually in the file.

--John





TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) 
of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, 
FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and 
subscription company, and TradeStation Europe Limited, a United Kingdom, 
FSA-authorized introducing brokerage firm. None of these companies provides 
trading or investment advice, recommendations or endorsements of any kind. The 
information transmitted is intended only for the person or entity to which it 
is addressed and may contain confidential and/or privileged material. Any 
review, retransmission, dissemination or other use of, or taking of any action 
in reliance upon, this information by persons or entities other than the 
intended recipient is prohibited. If you received this in error, please contact 
the sender and delete the material from any computer.




Re: charset parameter in Google Groups

2010-07-02 Thread Jukka K. Korpela

John Burger wrote:


Asmus distinguishes between two kinds of cases: The first is guessing
the charset incorrectly in a way that completely degrades the text,
e.g. 8859-1 vs. 8859-2.  Second is a more subtle kind of mistakes, and
arguably much less objectionable, e.g., 8859-1 vs. 1252, or the "smart
quotes" problem.


I'd say the key distinction is between protocol-incorrect behavior (like 
ignoring a character encoding specified properly, due to some “heuristics”) 
and error handling. If a document is declared as ISO-8859-1 encoded, it is 
protocol-incorrect to treat it as anything else, if all the octets are 
defined in ISO-8859-1 and allowed in the data format. However, an HTML 4.01 
document declared as ISO-8859-1 encoded an containing, say, octet 80 
(hexadecimal) is by definition malformed. A browser may decide to refuse to 
display it at all (not a good decision in practice) or to perform some error 
correction, like interpreting the data as windows-1252 encoded instead.



I like this distinction, and would point out that we can probably
quantify this into a continuum,


No, I think this requires discretion. Incorrect behavior vs. error handling 
(which may vary, though strong arguments may favor one or another approach).


If you ask me, error recovery should be signalled to end user, though 
perhaps discretely (pun intended) in cases where it seems “obvious”.


--
Yucca, http://www.cs.tut.fi/~jkorpela/ 





Re: charset parameter in Google Groups

2010-07-02 Thread Andreas Prilop
On Thu, 1 Jul 2010, John Burger wrote:

> If you have never encountered a web page in which the charset
> parameter encoded in the page (or in the HTTP headers) did not
> accurately reflect the "real charset", as indicated by the actual
> data in the page

How is it possible that you noticed that?

It's because your software indeed respects the charset parameter
and displays the document accordingly.

I'm asking for nothing else.



Re: charset parameter in Google Groups

2010-07-02 Thread John Burger
Asmus distinguishes between two kinds of cases: The first is guessing  
the charset incorrectly in a way that completely degrades the text,  
e.g. 8859-1 vs. 8859-2.  Second is a more subtle kind of mistakes, and  
arguably much less objectionable, e.g., 8859-1 vs. 1252, or the "smart  
quotes" problem.


I like this distinction, and would point out that we can probably  
quantify this into a continuum, in the sense that most of the code  
points in 8859-1 and 1252 are equivalent, while fewer are so in 8859-1  
and 8859-2.  (If we wished, we could refine this further by assigning  
different penalties for showing the wrong glyph for an alphabetic  
character than for punctuation.)



If I were to design a charset-verifier, I would distinguish between
these two cases. If something came tagged with a region-specific
charset, I would honor that, unless I found strong evidence of the  
"this

can't be right" nature. In some cases, to collect such evidence would
require significant statistics. The rule here should be "do no harm",
that is, destroying a document by incorrectly changing a true charset
should receive a nuch higher penalty than failing to detect a broken
charset. (That way, you don't penalize people who live by the  
rules :).


I have always thought that the "right way" to deal with determining  
the correct charset of a document is to treat it as a statistical  
classification problem.  Given a collection of documents as training  
data, we could extract features including the following:


- "suggested" charset, document type, and other information from  
metadata,

  such as HTTP Content-Type, HTML  tags, email headers, etc.
- various statistical signatures from the text itself, e.g. ngrams
- top-level domain of the originating web site
- anything else we can think of

We can then apply one of many possible multi-class algorithms  
developed by the machine learning community to this training set.   
Such an algorithm would learn how to weight the different features so  
as to tag the most documents correctly.  (For some of these algorithms  
we would have to tag each document in the training set with the "real"  
charset, but there are also semi-supervised and unsupervised  
algorithms that would discover the most consistent assignment, if we  
were unable or unwilling to correctly tag everything in our dataset.)


I have always assumed that Google, or someone, must already be doing  
this sort of thing (although perhaps not on Google Groups!).


Asmus' comments made me realize that the machine learning approach I  
outline above can be taken even further: there are many classification  
algorithms that can be trained with different penalties for different  
kinds of mistakes.  These penalties could be determined by hand, or  
could come from quantifying the potential degradation as I describe  
above.  This provides a natural and principled way to require far more  
evidence for overriding 8859-1 with 8859-2 than with 1252, for example.


- John D. Burger
  MITRE




Re: charset parameter in Google Groups

2010-07-01 Thread Asmus Freytag

On 7/1/2010 11:29 AM, John Burger wrote:

Andreas Prilop wrote:


The problem with slavishly following the charset parameter is
that it is often incorrect.


I wonder how you could draw such a conclusion. In order to make
such a statement, there must be some other (god-given?) parameter,
which is the "real charset".



If you have never encountered a web page in which the charset 
parameter encoded in the page (or in the HTTP headers) did not 
accurately reflect the "real charset", as indicated by the actual data 
in the page, then your experience differs sharply from mine, and from 
everyone else I have ever met.



Let's unravel this.

First, there's qualitative vs. quantitative arguments. Yes, mis-tagging 
occurs (for all the reasons Shawn gave in his reply). But Andreas' point 
was that for languages needing more than ASCII, there's a nice 
corrective. If many (most) viewers now base their display on charset, 
then more documents would be expected to be correctly tagged for those 
types of text, because they tend to degrade dramatically otherwise and 
users (authors) would take action to correct the situation. The example 
of this is reading a text as 8859-1 when it is 8859-2 (Eastern European)


This is different from the issue the issue of selecting the correct 
charset, if it only affects some special symbols (copyright, punctuation 
marks, the euro sign). In these cases, the text degrades in much more 
subtle ways, and usually remains readable. I would expect that the 
incidence of mis-tagging in such a situation is larger. The example for 
this is reading a text as 8859-1 when it was 1252 (Windows code page 
with extra characters not in ISO 8859-1 - Shawn mentioned this case as 
well).


If I were to design a charset-verifier, I would distinguish between 
these two cases. If something came tagged with a region-specific 
charset, I would honor that, unless I found strong evidence of the "this 
can't be right" nature. In some cases, to collect such evidence would 
require significant statistics. The rule here should be "do no harm", 
that is, destroying a document by incorrectly changing a true charset 
should receive a nuch higher penalty than failing to detect a broken 
charset. (That way, you don't penalize people who live by the rules :).


When it comes to a document tagged with 8859-1, I might relax this 
slightly, as that tag is one of the common default tags and is more 
likely to have been applied blindly.


When it comes to deciding whether something is Windows code page or a 
true ISO charset, the bar can be set lower - one is a superset of the 
other usually, and detecting any characters in the superset should 
trigger a reassignment. Unlike the other case, the "penalties" for 
getting this wrong are much less severe.


A./



RE: charset parameter in Google Groups

2010-07-01 Thread Shawn Steele
> > The problem with slavishly following the charset parameter is that it 
> > is often incorrect.

> I wonder how you could draw such a conclusion. In order to make such
> a statement, there must be some other (god-given?) parameter, which is the 
> "real charset".

> Each and every program (webbrowser, newsreader, e-mailer ...)

Actually, historically, that's not quite right.  NOW they do (if they're 
behaving), but in the past they often just used whatever the system code page 
is.  Even worse, people would write in one local code page, stick it on an 
en-US server, and then "test" it on the same source machine (same locale), so 
then it "worked", but only for them.  Once it gets read by a different machine 
it doesn't work.

Even worse, either the editing software, or the server, might mistag the code 
page because they were trying to fill in missing information.  And there was a 
common abuse of the ISO code pages for what were really windows code page 
encoded data.

So, now, in theory, and in well-behaved environments, the taggings are much 
more accurate, however it can be difficult to distinguish correctly tagged data 
from mis-tagged data.  Using UTF-8 helps a ton, because it's pretty obvious 
that it's UTF-8.

Anyway, I have no clue what Google's doing, however mis-tagging of data is a 
common problem in the industry, and a great reason to use Unicode.  Some 
countries have an even bigger problem do to variations in implementations of 
their commonly used code pages, and extensions which may, or may not, always be 
supported.  It's also part of why you occasionally see things like badly marked 
up rich quotes on major news sites, even now.

-Shawn





Re: charset parameter in Google Groups

2010-07-01 Thread John Burger

Andreas Prilop wrote:


The problem with slavishly following the charset parameter is
that it is often incorrect.


I wonder how you could draw such a conclusion. In order to make
such a statement, there must be some other (god-given?) parameter,
which is the "real charset".



If you have never encountered a web page in which the charset  
parameter encoded in the page (or in the HTTP headers) did not  
accurately reflect the "real charset", as indicated by the actual data  
in the page, then your experience differs sharply from mine, and from  
everyone else I have ever met.


- John D. Burger
  MITRE




Re: charset parameter in Google Groups

2010-07-01 Thread Andreas Prilop
On Mon, 28 Jun 2010, Mark Davis wrote:

> I'll overlook the lack of civility, since I can understand
> that kind of frustration when something doesn't work.

Well, I am aware of this problem/bug for many years now:
 http://groups.google.co.uk/group/sci.lang/msg/eb55255e1925350f

Over the years I tried again and again and again to write
to Google, for example with such forms as
http://www.google.com/support/contact/bin/request.py?page=&contact_type=suggestion_t&master=suggestion_t&Action.Search=Continue
But nothing happened.

How to file a bug report with Google?

> This is the first I've heard of this as a problem with Google Groups.
> I filed a bug against Groups for this issue; I'll see what they find out.

Thank you!

> BTW, does the same thing happen if you send your email in UTF-8?

No. On the contrary, groups.google tries UTF-8 always:
 http://groups.google.co.uk/group/pl.test/msg/359af83289a00e8e
This messages contains Latin-2 characters with charset=ISO-8859-2.

> The problem with slavishly following the charset parameter is
> that it is often incorrect.

I wonder how you could draw such a conclusion. In order to make
such a statement, there must be some other (god-given?) parameter,
which is the "real charset".

Each and every program (webbrowser, newsreader, e-mailer ...)
reads the charset parameter and displays the document
(webpage, e-mail message, news message) accordingly.
Do you really think that the author of a webpage or message
will not notice?
Do you really think that all these programs (including
the author's tools) render the document incorrectly
and only Google knows better?

How do you come to this conclusion?


I admit that the situation is different with the LANG
attribute in HTML. That's because writing, e.g.
  
has almost no practical implications and the text
might well be French.


But when we have  charset=ISO-8859-7 , all of the author's
programs (I believe) will display the document in Greek.

How can you think that only Google knows better
to "correct" this into  charset=ISO-8859-1 ?


You can still make your guesses when a charset parameter
is completely missing.



Re: charset parameter in Google Groups

2010-06-30 Thread Jon Hanna

António MARTINS-Tuválkin wrote:
If the EU can tell Britain that it can't sell eggs by the dozen any 
more, 


Yesterday I bought a dozen eggs (2 racks of 6, set 2×3) here in Portugal. 
This must be an incredibly new regulation.


The Daily Mail isn't as easily available in Portugal. It's one of 
several EU regulations that exist in the world the Daily Mail writes 
about. Sort of like the way to J K Rowling and her fans there's a rule 
against people under seventeen doing magic outside of Hogwarts, but in 
the real world saying that is just nonsense.





(OT) Myths, was Re: charset parameter in Google Groups

2010-06-30 Thread Julian Bradfield
On 2010-06-29, António MARTINS-Tuválkin  wrote:
> On 2010.06.28, 20:48, Mark Crispin  wrote:
...
>> If the EU can tell Britain that it can't sell eggs by the dozen any 
>> more, 
> Yesterday I bought a dozen eggs (2 racks of 6, set 2×3) here in Portugal. 
> This must be an incredibly new regulation.

Just for the record, the regulations in question are not yet agreed, and
they don't say you can't sell by the dozen. What they do say is that
the weight of the contents must be shown. This is arguably silly, and
arguably sensible - it's not obviously either.

> I guess it is like the UK — you can get arrested for flying the Union 
> Jack on afloat (even in a row boat on the Serpentine), least you be 
> mistook for a RN Admiral or the Queen,  but you can rob and steal and 
> lie and still get (and keep) knighthood.

Neither of these is strictly true, either - well, except the lie bit. These
days, knights convicted of serious offences are usually un-knighted.
Unfortunately, neither lying (out of court) nor catastrophic financial
mismanagement are actually offences.
As for flags, the laws for shipping specify those flags that can be
flown to indicate that you are a British ship, and forbid the flying
of other national colours. The Union Flag, like the White Ensign, is a flag
specified for government vessels. So flying your Union Flag on a
row-boat on the Serpentine is technically illegal, and also tactless,
since you're in a Royal Park, but I'd be surprised if anybody cares.
Most, if not all countries, will have similar laws for shipping, as
the flag of nationality is an important concept in international
maritime law.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: charset parameter in Google Groups

2010-06-29 Thread António MARTINS-Tuválkin
On 2010.06.28, 20:48, Mark Crispin  wrote:

> I have often been the victim of having my valid data transformed into 
> garbage by the attitude of standards-violators who think that I can't 
> possibly be following the standards, therefore my data must be "fixed". 
>  When I protest that my data was "fixed" into garbage, I get told that 
> my complaint doesn't matter.

+1, Mark. I actually lost a job for being “cheeky” — insisting in doing 
things properly when creating a website version in Arabic. The company’s 
“wizard” was stuck in 8859 garbage and didn’t even want to hear about 
Unicode (because it allows for virus and phishing, he heard); his hacks 
to show "õ" and "ظ" on the same page with SPAN tags and piped fonts were 
painful to withstand. That company eventually lost that particular client 
(jjj.ppvnc.cg) to a better outfit, after I was already away in greener 
pastures.

> If the EU can tell Britain that it can't sell eggs by the dozen any 
> more, 

Yesterday I bought a dozen eggs (2 racks of 6, set 2×3) here in Portugal. 
This must be an incredibly new regulation.

> it can shut down a messaging service in Europe that does not comply 
> with published standards.

I guess it is like the UK — you can get arrested for flying the Union 
Jack on afloat (even in a row boat on the Serpentine), least you be 
mistook for a RN Admiral or the Queen,  but you can rob and steal and 
lie and still get (and keep) knighthood.

-- 
António MARTINS-Tuválkin.
  |  ()|
 Não me invejo de quem tem ||
PT-1500-111 LISBOA   carros, parelhas e montes  |
+351 934 821 700, +351 212 463 477   só me invejo de quem bebe  |
ICQ:193279138  http://tuvalkin.web.pt/   a água em todas as fontes  |
-
De sable uma fonte e bordadura escaqueada de jalde e goles, por timbre a 
bandeira, por mote o 1º verso acima, e por grito de guerra "Mi rajtas!".
-





Re: charset parameter in Google Groups (was Re: Indian Rupee Sign to be chosen today)

2010-06-28 Thread Asmus Freytag

On 6/28/2010 11:38 AM, Mark Davis ☕ wrote:



The problem with slavishly following the charset parameter is that it 
is often incorrect. However, the charset parameter is a signal into 
the character detection module, so the charset is correctly supplied 
from the message then the results of the detection will be weighted 
that direction.


The weighting factor / mechanism may be something that you might look at 
for possible improvement.


Doug raised an interesting argument, i.e. that some values of a charset 
parameter might have a higher probability of being correct than other 
values.


If something is tagged Latin-1 or Windows-1252, the chances are that 
this is merely an unexamined default setting. Most of the other 8859 
values should be much less likely to be such "blind" defaults.


I wonder whether the probability of successful charset assignment 
increases if you were to give these more "specific" charset values a 
higher weight.


When I played with simple recognition algorithms about 15 years ago, I 
found that some simple methods for crude language detection gave 
signatures that would allow charset detection. Even though these methods 
weren't sophisticated enough to resolve actual languages (esp. among 
closely related languages) they were good enough to narrow things down 
to the point, where one could pick or confirm charsets.


For example, significant stretches of German can be written without 
diacritics, and can fool charset detection unless it picks up on the 
statistic patterns for German. With that in hand, the first non-ASCII 
character encountered is then likely to "nail" the charset. Or, absent 
such character, the statistics can be used to confirm that an existing 
charset assignment is plausible. (8859-15 having been deliberately 
designed to be "undetectable" is the exception, unless there's a Euro 
sign in the scanned part of the document...)


A./



RE: charset parameter in Google Groups (was Re: Indian Rupee Sign to be chosen today)

2010-06-28 Thread Doug Ewell
Mark Crispin  wrote:

> On Mon, 28 Jun 2010, Mark Davis ☕ wrote:
>> The problem with slavishly following the charset parameter is that it
>> is often incorrect. However, the charset parameter is a signal into
>> the character detection module, so the charset is correctly supplied
>> from the message then the results of the detection will be weighted
>> that direction.
> 
> I interpret these two sentences as:
> 
> "The problem with following the standards is that some people don't
> follow the standards.  So instead of following the standards
> ourselves, we will guess if the other guy follows the standards or
> not, no matter how much he claims to follow standards.  Too bad if our
> fix transforms his valid data into garbage."

At the very least, it would be nice if the charset parameter constituted
a much stronger signal into the detection module than it apparently did
in Andreas' case, so that if he says the text is 8859-15, and we already
know that 8859-15 is nearly impossible to distinguish heuristically from
8859-1, the module might as well take his word for it.

I do tend to agree with Mark that the complaint against Google Groups
(with which I am not affiliated) might have been posted with more
civility and less invective.

--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­






Re: charset parameter in Google Groups (was Re: Indian Rupee Sign to be chosen today)

2010-06-28 Thread Mark Crispin

On Mon, 28 Jun 2010, Mark Davis ☕ wrote:

The problem with slavishly following the charset parameter is that it is
often incorrect. However, the charset parameter is a signal into the
character detection module, so the charset is correctly supplied from the
message then the results of the detection will be weighted that direction.


I interpret these two sentences as:

"The problem with following the standards is that some people don't follow
the standards.  So instead of following the standards ourselves, we will
guess if the other guy follows the standards or not, no matter how much he
claims to follow standards.  Too bad if our fix transforms his valid data
into garbage."

I have heard that song many times over 30 years.  I have often been the
victim of having my valid data transformed into garbage by the attitude of
standards-violators who think that I can't possibly be following the
standards, therefore my data must be "fixed".  When I protest that my data
was "fixed" into garbage, I get told that my complaint doesn't matter.

I take a hard line these days.  I generally don't like the concept of
"fixing" things; I believe in GIGO (Garbage In, Garbage Out).  More
importantly, I utterly reject VIGO (Valid In, Garbage Out) caused by
ill-considered efforts to create GIVO.

What I don't understand is why the EU doesn't use its regulatory power to
force compliance.  If the EU can tell Britain that it can't sell eggs by
the dozen any more, it can shut down a messaging service in Europe that
does not comply with published standards.

-- Mark --

http://panda.com/mrc
Democracy is two wolves and a sheep deciding what to eat for lunch.
Liberty is a well-armed sheep contesting the vote.