Re: charset parameter in Google Groups
On Mon, 12 Jul 2010, Philippe Verdy wrote: Traditionally, the MIME types are only given in lowercase, so if you had written "text/plain; charset=windows-1252", it would have been orrectly detected. Nonsense. Pure, unadulterated nonsense. I helped write the MIME RFCs, and I can assure you that uppercase MIME types are permitted and used by long-standing production software. I don't know which strange email program you use that generates this form of MIME types, even if they should be interpreted ignoring case. At least one standard email programs used by millions of users does uppercase for the MIME types. [Nonsensical and irrelevant babble about Windows-1252 deleted] Software that does not recognize ISO-8859-15 is broken. -- Mark -- http://panda.com/mrc Science does not emerge from voting, party politics, or public debate. Si vis pacem, para bellum.
Re: charset parameter in Google Groups
The problem in this message is probably not in the specified charset (windows-1252) but on the way the MIME type is specified just before it "TEXT/PLAIN". Traditionally, the MIME types are only given in lowercase, so if you had written "text/plain; charset=windows-1252", it would have been orrectly detected. Google must have thought that the MIME header tag was unknown, and it ignored it completely, tryin to guess the charset instead. But given that there's very little text in the message, the guess algorithm is more likley to fail. One of the server may have transcoded the message from windows-1252 to UTF-8, but forgot to change the MIME type. Then the Google page just detects the windows-1252 encoding that is still present in the MIME header, and then displays the message according to it. I don't know which strange email program you use that generates this form of MIME types, even if they should be interpreted ignoring case. And I'm not even sure that Google is the culprit here: it may have been caused by an relaying SMTP server not operated by Google but by an ISP that transcoded the message without correctly changing the MIME header accordingly. Or it may have been caused by your own specific SMTP agent before it was even sent to the Internet (and I think that this is probably the cause of the problem here, because I seriously doubt that a Google SMTP relay, or a SMTP relay used by an ISP would alter the internal encoding message during the transit, even if it's allowed for plain-text contents). Note that "windows-1252" is likely to have been transcoded by a SMTP server running on Unix/Linux with a broken version, thinking that only standard ISO charsets should be admitted, but still ignoring the fact that windows-1252 is now becoming the preferred charset over ISO-8859-1 (but not over UTF-8). Here I see absolutely no relation with the ISO-8859-15 charset (which is nearly used by absolutely nobody, when windows-1252 is far superior for compatibility as it fully preserves the ISO-8859-1 charset, and just maps additional characters within the code area previously reserved in ISO-8859-1 for C1 controls that have never been used in plain-text emails). Even HTML5 recognizes this fact : ISO-8859-1 is being deprecated by windows-1252 for practical reasons And there's no reason to ignore this charset, just because it contains the term "windows", given that the registration made by Microsoft was made in an open way (no problem caused by the trademark citation, Microsoft does not want us to display a trademark symbol and a notice about its owner), and Google is certainly not deciding to discard/ignore this widely used charset. Philippe. > Message du 07/07/10 18:04 > De : "Andreas Prilop" > A : unicode@unicode.org > Copie à : > Objet : Re: charset parameter in Google Groups > > > On Tue, 6 Jul 2010, John Dlugosz wrote: > > > I often see glyps where typesetter chars like curved > > apostrophes were supposed to be, or characteristic > > UTF-8-as-Latin-1 pairs, in web pages. > > > > I've seen the charset meta tag overridden with header values > > from the server, without regard to what's actually in the file. > > This means that *your* software (browser) behaves *exactly* > in the way I expect for Google, too -- nothing else: > > Recognize the encoding information (charset) of the document > and respect it. > > If the document has > charset=ISO-8859-15 > then you SHALL apply this charset value. > > You SHALL NOT look whether the author has a Chinese name, > whether the document was published in Japan, etc. etc. > > Is this clear now?
Re: charset parameter in Google Groups
Andreas, I think we all realize your frustration with well-meaning software. Because tags can be wrong for no fault of the human originating the document, I fully understand that Google might want to attempt to improve the user experience in such situations. The problem is that doing so should not come at the expense of authors who correctly tag their documents and whose servers preserve their tags and don't mess with them. That your message was broken exposed a bug in Google's implementation. And that was acknowledged as well. I have not seen any design details of the algorithm that Google uses (when correctly implemented) so I can't comment on whether it strikes the correct balance between honoring tags in the "normal" case, where they should be presumed to be correctly applied, vs. detecting the case of clearly erroneous tags and doing something about them so users aren't stuck when documents are mis-tagged. However, in principle, I support the development of tools that can handle imperfect input - after all, you as a human reader also don't reject language that isn't perfectly spelled or that violates some of the grammatical rules. There's a benefit to these kinds of tools, but, as you keep reminding us, there's a cost (which needs to be minimized). This cost is similar to that of a spell-checker rejecting a correctly spelled word. Still we are better off with them than without them. For that reason, I think you will find few takers for your somewhat absolutist position, whereas you would get more sympathy if you were simply reminding everyone of the dangers of poorly implemented solutions that can break correctly tagged data. A./
Re: charset parameter in Google Groups
On Wed, 7 Jul 2010, Shawn Steele wrote: > however, in general, perhaps not your specific case, > the charset tag on the web cannot be 100% reliably trusted, > regardless of what the RFCs say. You do not understand what I mean! You have missed my point completely! You DO NOT understand me!
RE: charset parameter in Google Groups
I meant the author of the web page, the question being that "wouldn't a web site author realize their mistake when they looked at the web site they just made and it was broken." It's clear that if you owned this web site you'd do something different, and it would probably work for you :) Obviously I have no clue what the exact situation is at Google, however, in general, perhaps not your specific case, the charset tag on the web cannot be 100% reliably trusted, regardless of what the RFCs say. Perhaps, in the future, enough content will be corrected to sway the balance. Likely it'll get worse though :(, as content is blindly copied between pages without regard to charset markup... unless UTF-8 gets more momentum. If the web suddenly switched to being 100% perfect about recognizing charset tags, all the software vendors would suddenly get millions of complaints about "hey, why'd my favorite web page stop working?" It's really hard to shift everyone to the "right" solution, whether it works or not. It's possible your message might be more successful if sent in UTF-8? (I have no clue, but it might be worth trying). -Shawn From: unicode-bou...@unicode.org [unicode-bou...@unicode.org] on behalf of Andreas Prilop [prilop4...@trashmail.net] Sent: Wednesday, July 07, 2010 8:42 AM To: unicode@unicode.org Subject: Re: charset parameter in Google Groups On Tue, 6 Jul 2010, Shawn Steele wrote: > "Often" the author seems to use the same code page > they were expecting as a system default, so it can appear > to work for them even when it's wrong. I am the author of this news message: http://groups.google.co.uk/group/de.etc.sprache.deutsch/msg/c4b913eb39cb875b?dmode=source Please explain someone to me why groups.google is not smart enough to display the special, non-ASCII characters correctly. -- From the New World: http://www.google.co.uk/search?ie=ISO-8859-2&q=Dvofi%E1k
Re: charset parameter in Google Groups
On Tue, 6 Jul 2010, Shawn Steele wrote: > "Often" the author seems to use the same code page > they were expecting as a system default, so it can appear > to work for them even when it's wrong. I am the author of this news message: http://groups.google.co.uk/group/de.etc.sprache.deutsch/msg/c4b913eb39cb875b?dmode=source Please explain someone to me why groups.google is not smart enough to display the special, non-ASCII characters correctly. -- From the New World: http://www.google.co.uk/search?ie=ISO-8859-2&q=Dvofi%E1k
Re: charset parameter in Google Groups
On Tue, 6 Jul 2010, John Dlugosz wrote: > I often see glyps where typesetter chars like curved > apostrophes were supposed to be, or characteristic > UTF-8-as-Latin-1 pairs, in web pages. > > I've seen the charset meta tag overridden with header values > from the server, without regard to what's actually in the file. This means that *your* software (browser) behaves *exactly* in the way I expect for Google, too -- nothing else: Recognize the encoding information (charset) of the document and respect it. If the document has charset=ISO-8859-15 then you SHALL apply this charset value. You SHALL NOT look whether the author has a Chinese name, whether the document was published in Japan, etc. etc. Is this clear now? -- From the New World: http://www.google.co.uk/search?ie=ISO-8859-2&q=Dvofi%E1k
RE: charset parameter in Google Groups
> Do you really think that the author of a webpage or message will not > notice? "Often" the author seems to use the same code page they were expecting as a system default, so it can appear to work for them even when it's wrong. -Shawn
RE: charset parameter in Google Groups
> Do you really think that the author of a webpage or message > will not notice? Apparently so, in cases we notice and complain about. I often see glyps where typesetter chars like curved apostrophes were supposed to be, or characteristic UTF-8-as-Latin-1 pairs, in web pages. They don't tend to write conforming HTML by any means, so that might have to do with it. Without a DOCTYPE, anything goes, right? I've seen the charset meta tag overridden with header values from the server, without regard to what's actually in the file. --John TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.
Re: charset parameter in Google Groups
John Burger wrote: Asmus distinguishes between two kinds of cases: The first is guessing the charset incorrectly in a way that completely degrades the text, e.g. 8859-1 vs. 8859-2. Second is a more subtle kind of mistakes, and arguably much less objectionable, e.g., 8859-1 vs. 1252, or the "smart quotes" problem. I'd say the key distinction is between protocol-incorrect behavior (like ignoring a character encoding specified properly, due to some “heuristics”) and error handling. If a document is declared as ISO-8859-1 encoded, it is protocol-incorrect to treat it as anything else, if all the octets are defined in ISO-8859-1 and allowed in the data format. However, an HTML 4.01 document declared as ISO-8859-1 encoded an containing, say, octet 80 (hexadecimal) is by definition malformed. A browser may decide to refuse to display it at all (not a good decision in practice) or to perform some error correction, like interpreting the data as windows-1252 encoded instead. I like this distinction, and would point out that we can probably quantify this into a continuum, No, I think this requires discretion. Incorrect behavior vs. error handling (which may vary, though strong arguments may favor one or another approach). If you ask me, error recovery should be signalled to end user, though perhaps discretely (pun intended) in cases where it seems “obvious”. -- Yucca, http://www.cs.tut.fi/~jkorpela/
Re: charset parameter in Google Groups
On Thu, 1 Jul 2010, John Burger wrote: > If you have never encountered a web page in which the charset > parameter encoded in the page (or in the HTTP headers) did not > accurately reflect the "real charset", as indicated by the actual > data in the page How is it possible that you noticed that? It's because your software indeed respects the charset parameter and displays the document accordingly. I'm asking for nothing else.
Re: charset parameter in Google Groups
Asmus distinguishes between two kinds of cases: The first is guessing the charset incorrectly in a way that completely degrades the text, e.g. 8859-1 vs. 8859-2. Second is a more subtle kind of mistakes, and arguably much less objectionable, e.g., 8859-1 vs. 1252, or the "smart quotes" problem. I like this distinction, and would point out that we can probably quantify this into a continuum, in the sense that most of the code points in 8859-1 and 1252 are equivalent, while fewer are so in 8859-1 and 8859-2. (If we wished, we could refine this further by assigning different penalties for showing the wrong glyph for an alphabetic character than for punctuation.) If I were to design a charset-verifier, I would distinguish between these two cases. If something came tagged with a region-specific charset, I would honor that, unless I found strong evidence of the "this can't be right" nature. In some cases, to collect such evidence would require significant statistics. The rule here should be "do no harm", that is, destroying a document by incorrectly changing a true charset should receive a nuch higher penalty than failing to detect a broken charset. (That way, you don't penalize people who live by the rules :). I have always thought that the "right way" to deal with determining the correct charset of a document is to treat it as a statistical classification problem. Given a collection of documents as training data, we could extract features including the following: - "suggested" charset, document type, and other information from metadata, such as HTTP Content-Type, HTML tags, email headers, etc. - various statistical signatures from the text itself, e.g. ngrams - top-level domain of the originating web site - anything else we can think of We can then apply one of many possible multi-class algorithms developed by the machine learning community to this training set. Such an algorithm would learn how to weight the different features so as to tag the most documents correctly. (For some of these algorithms we would have to tag each document in the training set with the "real" charset, but there are also semi-supervised and unsupervised algorithms that would discover the most consistent assignment, if we were unable or unwilling to correctly tag everything in our dataset.) I have always assumed that Google, or someone, must already be doing this sort of thing (although perhaps not on Google Groups!). Asmus' comments made me realize that the machine learning approach I outline above can be taken even further: there are many classification algorithms that can be trained with different penalties for different kinds of mistakes. These penalties could be determined by hand, or could come from quantifying the potential degradation as I describe above. This provides a natural and principled way to require far more evidence for overriding 8859-1 with 8859-2 than with 1252, for example. - John D. Burger MITRE
Re: charset parameter in Google Groups
On 7/1/2010 11:29 AM, John Burger wrote: Andreas Prilop wrote: The problem with slavishly following the charset parameter is that it is often incorrect. I wonder how you could draw such a conclusion. In order to make such a statement, there must be some other (god-given?) parameter, which is the "real charset". If you have never encountered a web page in which the charset parameter encoded in the page (or in the HTTP headers) did not accurately reflect the "real charset", as indicated by the actual data in the page, then your experience differs sharply from mine, and from everyone else I have ever met. Let's unravel this. First, there's qualitative vs. quantitative arguments. Yes, mis-tagging occurs (for all the reasons Shawn gave in his reply). But Andreas' point was that for languages needing more than ASCII, there's a nice corrective. If many (most) viewers now base their display on charset, then more documents would be expected to be correctly tagged for those types of text, because they tend to degrade dramatically otherwise and users (authors) would take action to correct the situation. The example of this is reading a text as 8859-1 when it is 8859-2 (Eastern European) This is different from the issue the issue of selecting the correct charset, if it only affects some special symbols (copyright, punctuation marks, the euro sign). In these cases, the text degrades in much more subtle ways, and usually remains readable. I would expect that the incidence of mis-tagging in such a situation is larger. The example for this is reading a text as 8859-1 when it was 1252 (Windows code page with extra characters not in ISO 8859-1 - Shawn mentioned this case as well). If I were to design a charset-verifier, I would distinguish between these two cases. If something came tagged with a region-specific charset, I would honor that, unless I found strong evidence of the "this can't be right" nature. In some cases, to collect such evidence would require significant statistics. The rule here should be "do no harm", that is, destroying a document by incorrectly changing a true charset should receive a nuch higher penalty than failing to detect a broken charset. (That way, you don't penalize people who live by the rules :). When it comes to a document tagged with 8859-1, I might relax this slightly, as that tag is one of the common default tags and is more likely to have been applied blindly. When it comes to deciding whether something is Windows code page or a true ISO charset, the bar can be set lower - one is a superset of the other usually, and detecting any characters in the superset should trigger a reassignment. Unlike the other case, the "penalties" for getting this wrong are much less severe. A./
RE: charset parameter in Google Groups
> > The problem with slavishly following the charset parameter is that it > > is often incorrect. > I wonder how you could draw such a conclusion. In order to make such > a statement, there must be some other (god-given?) parameter, which is the > "real charset". > Each and every program (webbrowser, newsreader, e-mailer ...) Actually, historically, that's not quite right. NOW they do (if they're behaving), but in the past they often just used whatever the system code page is. Even worse, people would write in one local code page, stick it on an en-US server, and then "test" it on the same source machine (same locale), so then it "worked", but only for them. Once it gets read by a different machine it doesn't work. Even worse, either the editing software, or the server, might mistag the code page because they were trying to fill in missing information. And there was a common abuse of the ISO code pages for what were really windows code page encoded data. So, now, in theory, and in well-behaved environments, the taggings are much more accurate, however it can be difficult to distinguish correctly tagged data from mis-tagged data. Using UTF-8 helps a ton, because it's pretty obvious that it's UTF-8. Anyway, I have no clue what Google's doing, however mis-tagging of data is a common problem in the industry, and a great reason to use Unicode. Some countries have an even bigger problem do to variations in implementations of their commonly used code pages, and extensions which may, or may not, always be supported. It's also part of why you occasionally see things like badly marked up rich quotes on major news sites, even now. -Shawn
Re: charset parameter in Google Groups
Andreas Prilop wrote: The problem with slavishly following the charset parameter is that it is often incorrect. I wonder how you could draw such a conclusion. In order to make such a statement, there must be some other (god-given?) parameter, which is the "real charset". If you have never encountered a web page in which the charset parameter encoded in the page (or in the HTTP headers) did not accurately reflect the "real charset", as indicated by the actual data in the page, then your experience differs sharply from mine, and from everyone else I have ever met. - John D. Burger MITRE
Re: charset parameter in Google Groups
On Mon, 28 Jun 2010, Mark Davis wrote: > I'll overlook the lack of civility, since I can understand > that kind of frustration when something doesn't work. Well, I am aware of this problem/bug for many years now: http://groups.google.co.uk/group/sci.lang/msg/eb55255e1925350f Over the years I tried again and again and again to write to Google, for example with such forms as http://www.google.com/support/contact/bin/request.py?page=&contact_type=suggestion_t&master=suggestion_t&Action.Search=Continue But nothing happened. How to file a bug report with Google? > This is the first I've heard of this as a problem with Google Groups. > I filed a bug against Groups for this issue; I'll see what they find out. Thank you! > BTW, does the same thing happen if you send your email in UTF-8? No. On the contrary, groups.google tries UTF-8 always: http://groups.google.co.uk/group/pl.test/msg/359af83289a00e8e This messages contains Latin-2 characters with charset=ISO-8859-2. > The problem with slavishly following the charset parameter is > that it is often incorrect. I wonder how you could draw such a conclusion. In order to make such a statement, there must be some other (god-given?) parameter, which is the "real charset". Each and every program (webbrowser, newsreader, e-mailer ...) reads the charset parameter and displays the document (webpage, e-mail message, news message) accordingly. Do you really think that the author of a webpage or message will not notice? Do you really think that all these programs (including the author's tools) render the document incorrectly and only Google knows better? How do you come to this conclusion? I admit that the situation is different with the LANG attribute in HTML. That's because writing, e.g. has almost no practical implications and the text might well be French. But when we have charset=ISO-8859-7 , all of the author's programs (I believe) will display the document in Greek. How can you think that only Google knows better to "correct" this into charset=ISO-8859-1 ? You can still make your guesses when a charset parameter is completely missing.
Re: charset parameter in Google Groups
António MARTINS-Tuválkin wrote: If the EU can tell Britain that it can't sell eggs by the dozen any more, Yesterday I bought a dozen eggs (2 racks of 6, set 2×3) here in Portugal. This must be an incredibly new regulation. The Daily Mail isn't as easily available in Portugal. It's one of several EU regulations that exist in the world the Daily Mail writes about. Sort of like the way to J K Rowling and her fans there's a rule against people under seventeen doing magic outside of Hogwarts, but in the real world saying that is just nonsense.
(OT) Myths, was Re: charset parameter in Google Groups
On 2010-06-29, António MARTINS-Tuválkin wrote: > On 2010.06.28, 20:48, Mark Crispin wrote: ... >> If the EU can tell Britain that it can't sell eggs by the dozen any >> more, > Yesterday I bought a dozen eggs (2 racks of 6, set 2×3) here in Portugal. > This must be an incredibly new regulation. Just for the record, the regulations in question are not yet agreed, and they don't say you can't sell by the dozen. What they do say is that the weight of the contents must be shown. This is arguably silly, and arguably sensible - it's not obviously either. > I guess it is like the UK — you can get arrested for flying the Union > Jack on afloat (even in a row boat on the Serpentine), least you be > mistook for a RN Admiral or the Queen, but you can rob and steal and > lie and still get (and keep) knighthood. Neither of these is strictly true, either - well, except the lie bit. These days, knights convicted of serious offences are usually un-knighted. Unfortunately, neither lying (out of court) nor catastrophic financial mismanagement are actually offences. As for flags, the laws for shipping specify those flags that can be flown to indicate that you are a British ship, and forbid the flying of other national colours. The Union Flag, like the White Ensign, is a flag specified for government vessels. So flying your Union Flag on a row-boat on the Serpentine is technically illegal, and also tactless, since you're in a Royal Park, but I'd be surprised if anybody cares. Most, if not all countries, will have similar laws for shipping, as the flag of nationality is an important concept in international maritime law. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Re: charset parameter in Google Groups
On 2010.06.28, 20:48, Mark Crispin wrote: > I have often been the victim of having my valid data transformed into > garbage by the attitude of standards-violators who think that I can't > possibly be following the standards, therefore my data must be "fixed". > When I protest that my data was "fixed" into garbage, I get told that > my complaint doesn't matter. +1, Mark. I actually lost a job for being “cheeky” — insisting in doing things properly when creating a website version in Arabic. The company’s “wizard” was stuck in 8859 garbage and didn’t even want to hear about Unicode (because it allows for virus and phishing, he heard); his hacks to show "õ" and "ظ" on the same page with SPAN tags and piped fonts were painful to withstand. That company eventually lost that particular client (jjj.ppvnc.cg) to a better outfit, after I was already away in greener pastures. > If the EU can tell Britain that it can't sell eggs by the dozen any > more, Yesterday I bought a dozen eggs (2 racks of 6, set 2×3) here in Portugal. This must be an incredibly new regulation. > it can shut down a messaging service in Europe that does not comply > with published standards. I guess it is like the UK — you can get arrested for flying the Union Jack on afloat (even in a row boat on the Serpentine), least you be mistook for a RN Admiral or the Queen, but you can rob and steal and lie and still get (and keep) knighthood. -- António MARTINS-Tuválkin. | ()| Não me invejo de quem tem || PT-1500-111 LISBOA carros, parelhas e montes | +351 934 821 700, +351 212 463 477 só me invejo de quem bebe | ICQ:193279138 http://tuvalkin.web.pt/ a água em todas as fontes | - De sable uma fonte e bordadura escaqueada de jalde e goles, por timbre a bandeira, por mote o 1º verso acima, e por grito de guerra "Mi rajtas!". -
Re: charset parameter in Google Groups (was Re: Indian Rupee Sign to be chosen today)
On 6/28/2010 11:38 AM, Mark Davis ☕ wrote: The problem with slavishly following the charset parameter is that it is often incorrect. However, the charset parameter is a signal into the character detection module, so the charset is correctly supplied from the message then the results of the detection will be weighted that direction. The weighting factor / mechanism may be something that you might look at for possible improvement. Doug raised an interesting argument, i.e. that some values of a charset parameter might have a higher probability of being correct than other values. If something is tagged Latin-1 or Windows-1252, the chances are that this is merely an unexamined default setting. Most of the other 8859 values should be much less likely to be such "blind" defaults. I wonder whether the probability of successful charset assignment increases if you were to give these more "specific" charset values a higher weight. When I played with simple recognition algorithms about 15 years ago, I found that some simple methods for crude language detection gave signatures that would allow charset detection. Even though these methods weren't sophisticated enough to resolve actual languages (esp. among closely related languages) they were good enough to narrow things down to the point, where one could pick or confirm charsets. For example, significant stretches of German can be written without diacritics, and can fool charset detection unless it picks up on the statistic patterns for German. With that in hand, the first non-ASCII character encountered is then likely to "nail" the charset. Or, absent such character, the statistics can be used to confirm that an existing charset assignment is plausible. (8859-15 having been deliberately designed to be "undetectable" is the exception, unless there's a Euro sign in the scanned part of the document...) A./
RE: charset parameter in Google Groups (was Re: Indian Rupee Sign to be chosen today)
Mark Crispin wrote: > On Mon, 28 Jun 2010, Mark Davis ☕ wrote: >> The problem with slavishly following the charset parameter is that it >> is often incorrect. However, the charset parameter is a signal into >> the character detection module, so the charset is correctly supplied >> from the message then the results of the detection will be weighted >> that direction. > > I interpret these two sentences as: > > "The problem with following the standards is that some people don't > follow the standards. So instead of following the standards > ourselves, we will guess if the other guy follows the standards or > not, no matter how much he claims to follow standards. Too bad if our > fix transforms his valid data into garbage." At the very least, it would be nice if the charset parameter constituted a much stronger signal into the detection module than it apparently did in Andreas' case, so that if he says the text is 8859-15, and we already know that 8859-15 is nearly impossible to distinguish heuristically from 8859-1, the module might as well take his word for it. I do tend to agree with Mark that the complaint against Google Groups (with which I am not affiliated) might have been posted with more civility and less invective. -- Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s
Re: charset parameter in Google Groups (was Re: Indian Rupee Sign to be chosen today)
On Mon, 28 Jun 2010, Mark Davis ☕ wrote: The problem with slavishly following the charset parameter is that it is often incorrect. However, the charset parameter is a signal into the character detection module, so the charset is correctly supplied from the message then the results of the detection will be weighted that direction. I interpret these two sentences as: "The problem with following the standards is that some people don't follow the standards. So instead of following the standards ourselves, we will guess if the other guy follows the standards or not, no matter how much he claims to follow standards. Too bad if our fix transforms his valid data into garbage." I have heard that song many times over 30 years. I have often been the victim of having my valid data transformed into garbage by the attitude of standards-violators who think that I can't possibly be following the standards, therefore my data must be "fixed". When I protest that my data was "fixed" into garbage, I get told that my complaint doesn't matter. I take a hard line these days. I generally don't like the concept of "fixing" things; I believe in GIGO (Garbage In, Garbage Out). More importantly, I utterly reject VIGO (Valid In, Garbage Out) caused by ill-considered efforts to create GIVO. What I don't understand is why the EU doesn't use its regulatory power to force compliance. If the EU can tell Britain that it can't sell eggs by the dozen any more, it can shut down a messaging service in Europe that does not comply with published standards. -- Mark -- http://panda.com/mrc Democracy is two wolves and a sheep deciding what to eat for lunch. Liberty is a well-armed sheep contesting the vote.