Re: [Python-Dev] Encoding detection in the standard library?
Bill Janssen: Since the site that receives the POST doesn't necessarily have access to the Web page that originally contained the form, that's not really helpful. However, POSTs can use the MIME type multipart/form-data for non-Latin-1 content, and should. That contains facilities for indicating the encoding and other things as well. Yup, but DrProject (the target application) also serves as a relay and archive for email. We have no control over the agent used for composition, and AFAIK there's no standard way to include encoding information. Thanks, Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
On 2008-04-23 07:26, Terry Reedy wrote: Martin v. Löwis [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] | I certainly agree that if the target set of documents is small enough it | | Ok. What advantage would you (or somebody working on a similar project) | gain if chardet was part of the standard library? What if it was not | chardet, but some other algorithm? It seems to me that since there is not a 'correct' algorithm but only competing heuristics, encoding detection modules should be made available via PyPI and only be considered for stdlib after a best of breed emerges with community support. +1 Though in practice, determining the best of breed often becomes a problem (see e.g. the JSON implementation discussion). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 23 2008) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
[EMAIL PROTECTED] writes: When a web browser POSTs data, there is no standard way of communicating which encoding it's using. That's just not true. Web browser should and do use the encoding of the web page that originally contained the form. I wonder if the discussion is confusing two different things. Take a look at http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13.4. There are two prescribed ways of sending form data: application/x-www-form-urlencoded, which can only be used with ASCII data, and multipart/form-data. ``The content type multipart/form-data should be used for submitting forms that contain files, non-ASCII data, and binary data.'' It's true that the page containing the form may specify which of these two forms to use, but the character encodings are determined by the choice. For web forms, I always encode the pages in UTF-8, and that always works. Should work, if you use the multipart/form-data format. Bill ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
For web forms, I always encode the pages in UTF-8, and that always works. Should work, if you use the multipart/form-data format. Right - I was implicitly assuming that. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
On 2008-04-21 23:31, Martin v. Löwis wrote: This is useful when you get a hunk of data which _should_ be some sort of intelligible text from the Big Scary Internet (say, a posted web form or email message), and you want to do something useful with it (say, search the content). I don't think that should be part of the standard library. People will mistake what it tells them for certain. +1 I also think that it's better to educate people to add (correct) encoding information to their text data, rather than give them a guess mechanism... http://chardet.feedparser.org/docs/faq.html#faq.yippie chardet is based on the Mozilla algorithm and at least in my experience that algorithm doesn't work too well. The Mozilla algorithm may work for Asian encodings due to the fact that those encodings are usually also bound to a specific language (and you can then use character and word frequency analysis), but for encodings which can encode far more than just a single language (e.g. UTF-8 or Latin-1), the correct detection rate is rather low. The problem becomes completely even more difficult when leaving the normal text domain or when mixing languages in the same text, e.g. when trying to detect source code with comments using a non-ASCII encoding. The trick to just pass the text through a codec and see whether it roundtrips also doesn't necessarily help: Latin-1, for example, will always round-trip, since Latin-1 is a subset of Unicode. IMHO, more research has to be done into this area before a standard module can be added to the Python's stdlib... and who knows, perhaps we're lucky and by the time everyone is using UTF-8 anyway :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 22 2008) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
IMHO, more research has to be done into this area before a standard module can be added to the Python's stdlib... and who knows, perhaps we're lucky and by the time everyone is using UTF-8 anyway :-) I walked over to our computational linguistics group and asked. This is often combined with language guessing (which uses a similar approach, but using characters instead of bytes), and apparently can usually be done with high confidence. Of course, they're usually looking at clean texts, not random stuff. I'll see if I can get some references and report back -- most of the research on this was done in the 90's. Bill ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
On 22-Apr-08, at 12:30 AM, Martin v. Löwis wrote: IMO, encoding estimation is something that many web programs will have to deal with Can you please explain why that is? Web programs should not normally have the need to detect the encoding; instead, it should be specified always - unless you are talking about browsers specifically, which need to support web pages that specify the encoding incorrectly. Two cases come immediately to mind: email and web forms. When a web browser POSTs data, there is no standard way of communicating which encoding it's using. There are some hints which make it easier (accept-charset attributes, the encoding used to send the page to the browser), but no guarantees. Email is a smaller problem, because it usually has a helpful content- type header, but that's no guarantee. Now, at the moment, the only data I have to support this claim is my experience with DrProject in non-English locations. If I'm the only one who has had these sorts of problems, I'll go back to Unicode for Dummies. so it might as well be built in; I would prefer the option to run `text=input.encode('guess')` (or something similar) than relying on an external dependency or worse yet using a hand-rolled algorithm. Ok, let me try differently then. Please feel free to post a patch to bugs.python.org, and let other people rip it apart. For example, I don't think it should be a codec, as I can't imagine it working on streams. As things frequently are, it seems like this is a much larger problem that I originally believed. I'll go back and take another look at the problem, then come back if new revelations appear. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
When a web browser POSTs data, there is no standard way of communicating which encoding it's using. That's just not true. Web browser should and do use the encoding of the web page that originally contained the form. There are some hints which make it easier (accept-charset attributes, the encoding used to send the page to the browser), but no guarantees. Not true. The latter is guaranteed (unless you assume bugs - but if you do, can you present a specific browser that has that bug?) Email is a smaller problem, because it usually has a helpful content-type header, but that's no guarantee. Then assume windows-1252. Mailers who don't use MIME for non-ASCII characters mostly died 10 years ago; those people who continue to use them likely can accept occasional moji-bake (or else they would have switched long ago). Now, at the moment, the only data I have to support this claim is my experience with DrProject in non-English locations. If I'm the only one who has had these sorts of problems, I'll go back to Unicode for Dummies. For web forms, I always encode the pages in UTF-8, and that always works. For email, I once added encoding processing to the pipermail (the mailman archiver), and that also always works. I'll go back and take another look at the problem, then come back if new revelations appear. Good luck! Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
On 22-Apr-08, at 3:31 AM, M.-A. Lemburg wrote: I don't think that should be part of the standard library. People will mistake what it tells them for certain. +1 I also think that it's better to educate people to add (correct) encoding information to their text data, rather than give them a guess mechanism... That is a fallacious alternative: the programmers that need encoding detection are not the same people who are omitting encoding information. I only have a small opinion on whether charset detection should appear in the stdlib, but I am somewhat perplexed by the arguments in this thread. I don't see how inclusion in the stdlib would make people more inclined to think that the algorithm is always correct. In terms of the need of this functionality: Martin wrote: Can you please explain why that is? Web programs should not normally have the need to detect the encoding; instead, it should be specified always - unless you are talking about browsers specifically, which need to support web pages that specify the encoding incorrectly. Any program that needs to examine the contents of documents/feeds/ whatever on the web needs to deal with incorrectly-specified encodings (which, sadly, is rather common). The set of programs of programs that need this functionality is probably the same set that needs BeautifulSoup--I think that set is larger than just browsers grin -Mike ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
[CCing python-dev again] On 2008-04-22 12:38, Greg Wilson wrote: I don't think that should be part of the standard library. People will mistake what it tells them for certain. [etc] These are all good arguments, but the fact remains that we can't control our inputs (e.g., we're archiving mail messages sent to lists managed by DrProject), and some of those inputs *don't* tell us how they're encoded. Under those circumstances, what would you recommend? I haven't done much research into this, but in general, I think it's better to: * first try to look at other characteristics of a text message, e.g. language, origin, topic, etc., * then narrow down the number of encodings which could apply, * rank them to try to avoid ambiguities and * then try to see what percentage of the text you can decode using each of the encodings in reverse ranking order (ie. more specialized encodings should be tested first, latin-1 last). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 22 2008) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
Can you please explain why that is? Web programs should not normally have the need to detect the encoding; instead, it should be specified always - unless you are talking about browsers specifically, which need to support web pages that specify the encoding incorrectly. Any program that needs to examine the contents of documents/feeds/whatever on the web needs to deal with incorrectly-specified encodings That's not true. Most programs that need to examine the contents of a web page don't need to guess the encoding. In most such programs, the encoding can be hard-coded if the declared encoding is not correct. Most such programs *know* what page they are webscraping, or else they couldn't extract the information out of it that they want to get at. As for feeds - can you give examples of incorrectly encoded one (I don't ever use feeds, so I honestly don't know whether they are typically encoded incorrectly. I've heard they are often XML, in which case I strongly doubt they are incorrectly encoded) As for whatever - can you give specific examples? (which, sadly, is rather common). The set of programs of programs that need this functionality is probably the same set that needs BeautifulSoup--I think that set is larger than just browsers grin Again, can you give *specific* examples that are not web browsers? Programs needing BeautifulSoup may still not need encoding guessing, since they still might be able to hard-code the encoding of the web page they want to process. In any case, I'm very skeptical that a general guess encoding module would do a meaningful thing when applied to incorrectly encoded HTML pages. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
On 2008-04-22 18:33, Bill Janssen wrote: The 2002 paper A language and character set determination method based on N-gram statistics by Izumi Suzuki and Yoshiki Mikami and Ario Ohsato and Yoshihide Chubachi seems to me a pretty good way to go about this. Thanks for the reference. Looks like the existing research on this just hasn't made it into the mainstream yet. Here's their current project: http://www.language-observatory.org/ Looks like they are focusing more on language detection. Another interesting paper using n-grams: Language Identification in Web Pages by Bruno Martins and Mário J. Silva http://xldb.fc.ul.pt/data/Publications_attach/ngram-article.pdf And one using compression: Text Categorization Using Compression Models by Eibe Frank, Chang Chui, Ian H. Witten http://portal.acm.org/citation.cfm?id=789742 They're looking at LSEs, language-script-encoding triples; a script is a way of using a particular character set to write in a particular language. Their system has these requirements: R1. the response must be either correct answer or unable to detect where unable to detect includes other than registered [the registered set of LSEs]; R2. Applicable to multi-LSE texts; R3. never accept a wrong answer, even when the program does not have enough data on an LSE; and R4. applicable to any LSE text. So, no wrong answers. The biggest disadvantage would seem to be that the registration data for a particular LSE is kind of bulky; on the order of 10,000 shift-codons, each of three bytes, about 30K uncompressed. http://portal.acm.org/ft_gateway.cfm?id=772759type=pdf For a server based application that doesn't sound too large. Unless you're using a very broad scope, I don't think that you'd need more than a few hundred LSEs for a typical application - nothing you'd want to put in the Python stdlib, though. Bill IMHO, more research has to be done into this area before a standard module can be added to the Python's stdlib... and who knows, perhaps we're lucky and by the time everyone is using UTF-8 anyway :-) I walked over to our computational linguistics group and asked. This is often combined with language guessing (which uses a similar approach, but using characters instead of bytes), and apparently can usually be done with high confidence. Of course, they're usually looking at clean texts, not random stuff. I'll see if I can get some references and report back -- most of the research on this was done in the 90's. Bill -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 22 2008) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
When a web browser POSTs data, there is no standard way of communicating which encoding it's using. That's just not true. Web browser should and do use the encoding of the web page that originally contained the form. Since the site that receives the POST doesn't necessarily have access to the Web page that originally contained the form, that's not really helpful. However, POSTs can use the MIME type multipart/form-data for non-Latin-1 content, and should. That contains facilities for indicating the encoding and other things as well. Bill ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
Unless you're using a very broad scope, I don't think that you'd need more than a few hundred LSEs for a typical application - nothing you'd want to put in the Python stdlib, though. I tend to agree with this (and I'm generally in favor of putting everything in the standard library!). For those of us doing document-processing applications (Martin, it's not just about Web browsers), this would be a very useful package to have up on PyPI. Bill ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
On 22-Apr-08, at 2:16 PM, Martin v. Löwis wrote: Any program that needs to examine the contents of documents/feeds/whatever on the web needs to deal with incorrectly-specified encodings That's not true. Most programs that need to examine the contents of a web page don't need to guess the encoding. In most such programs, the encoding can be hard-coded if the declared encoding is not correct. Most such programs *know* what page they are webscraping, or else they couldn't extract the information out of it that they want to get at. I certainly agree that if the target set of documents is small enough it is possible to hand-code the encoding. There are many applications, however, that need to examine the content of an arbitrary, or at least non-small set of web documents. To name a few such applications: - web search engines - translation software - document/bookmark management systems - other kinds of document analysis (market research, seo, etc.) As for feeds - can you give examples of incorrectly encoded one (I don't ever use feeds, so I honestly don't know whether they are typically encoded incorrectly. I've heard they are often XML, in which case I strongly doubt they are incorrectly encoded) I also don't have much experience with feeds. My statement is based on the fact that chardet, the tool that has been cited most in this thread, was written specifically for use with the author's feed parsing package. As for whatever - can you give specific examples? Not that I can substantiate. Documents feeds covers a lot of what is on the web--I was only trying to make the point that on the web, whenever an encoding can be specified, it will be specified incorrectly for a significant chunk of exemplars. (which, sadly, is rather common). The set of programs of programs that need this functionality is probably the same set that needs BeautifulSoup--I think that set is larger than just browsers grin Again, can you give *specific* examples that are not web browsers? Programs needing BeautifulSoup may still not need encoding guessing, since they still might be able to hard-code the encoding of the web page they want to process. Indeed, if it is only one site it is pretty easy to work around. My main use of python is processing and analyzing hundreds of millions of web documents, so it is pretty easy to see applications (which I have listed above). I think that libraries like Mark Pilgrim's FeedParser and BeautifulSoup are possible consumers of guessing as well. In any case, I'm very skeptical that a general guess encoding module would do a meaningful thing when applied to incorrectly encoded HTML pages. Well, it does. I wish I could easily provide data on how often it is necessary over the whole web, but that would be somewhat difficult to generate. I can say that it is much more important to be able to parse all the different kinds of encoding _specification_ on the web (Content-Type/Content-Encoding/meta http-equiv tags, etc), and the malformed cases of these. I can also think of good arguments for excluding encoding detection for maintenance reasons: is every case of the algorithm guessing wrong a bug that needs to be fixed in the stdlib? That is an unbounded commitment. -Mike ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
Yup, but DrProject (the target application) also serves as a relay and archive for email. We have no control over the agent used for composition, and AFAIK there's no standard way to include encoding information. Greg, Internet-compliant email actually has well-specified mechanisms for including encoding information; see RFCs 2047 and 2231. There's no need to guess; you can just look. Bill ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
Bill Janssen writes: Internet-compliant email actually has well-specified mechanisms for including encoding information; see RFCs 2047 and 2231. There's no need to guess; you can just look. You must be very special to get only compliant email. About half my colleagues use RFC 2047 to encode Japanese file names in MIME attachments (a MUST NOT behavior according to RFC 2047), and a significant fraction of the rest end up with binary Shift JIS or EUC or MacRoman in there. And those are just the most widespread violations I can think of off the top of my head. Not to mention that I find this: =?X-UNKNOWN?Q?Martin_v=2E_L=F6wis?= [EMAIL PROTECTED], in the header I got from you. (I'm not ragging on you, I get Martin's name wrong a significant portion of the time myself. :-( ) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
Martin v. Löwis writes: In any case, I'm very skeptical that a general guess encoding module would do a meaningful thing when applied to incorrectly encoded HTML pages. That depends on whether you can get meaningful information about the language from the fact that you're looking at the page. In the browser context, for one, 99.44% of users are monolingual, so you only have to distinguish among the encodings for their language. In this context a two stage process of determining a category of encoding (eg, ISO 8859, ISO 2022 7-bit, ISO 2022 8-bit multibyte, UTF-8, etc), and then picking an encoding from the category according to a user-specified configuration has served Emacs/MULE users very well for about 20 years. It does *not* work in a context where multiple encodings from the same category are in use (eg, the email folder of a Polish Gastarbeiter in Berlin). Nonetheless it is pretty useful for user agents like mail clients, web browsers, and editors. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
Guido van Rossum writes: To the contrary, an encoding-guessing module is often needed, and guessing can be done with a pretty high success rate. Other Unicode libraries (e.g. ICU) contain guessing modules. I suppose the API could return two values: the guessed encoding and a confidence indicator. Note that the locale settings might figure in the guess. Not locale settings, but user configuration. A Bayesian detector (CodeBayes? hi, Skip!) might be a good way to go for servers, while a simple language preference might really up the probability for user agents. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
Yup, but DrProject (the target application) also serves as a relay and archive for email. We have no control over the agent used for composition, and AFAIK there's no standard way to include encoding information. That's not at all the case. MIME defines that in full detail, since 1993. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
I certainly agree that if the target set of documents is small enough it is possible to hand-code the encoding. There are many applications, however, that need to examine the content of an arbitrary, or at least non-small set of web documents. To name a few such applications: - web search engines - translation software I'll question whether these are many programs. Web search engines and translation software have many more challenges to master, and they are fairly special-cased, so I would expect they need to find their own answer to character set detection, anyway (see Bill Janssen's answer on machine translation, also). - document/bookmark management systems - other kinds of document analysis (market research, seo, etc.) Not sure what specifically you have in mind, however, I expect that these also have their own challenges. For example, I would expect that MS-Word documents are frequent. You don't need character set detection there (Word is all Unicode), but you need an API to look into the structure of .doc files. Not that I can substantiate. Documents feeds covers a lot of what is on the web--I was only trying to make the point that on the web, whenever an encoding can be specified, it will be specified incorrectly for a significant chunk of exemplars. I firmly believe this assumption is false. If the encoding comes out of software (which it often does), it will be correct most of the time. It's incorrect only if the content editor has to type it. Indeed, if it is only one site it is pretty easy to work around. My main use of python is processing and analyzing hundreds of millions of web documents, so it is pretty easy to see applications (which I have listed above). Ok. What advantage would you (or somebody working on a similar project) gain if chardet was part of the standard library? What if it was not chardet, but some other algorithm? I can also think of good arguments for excluding encoding detection for maintenance reasons: is every case of the algorithm guessing wrong a bug that needs to be fixed in the stdlib? That is an unbounded commitment. Indeed, that's what I meant with my initial remark. People will expect that it works correctly - both with the consequence of unknowingly proceeding with the incorrect response, and then complaining when they find out that it did produce an incorrect answer. For chardet specifically, my usual standard-library remark applies: it can't become part of the standard library unless the original author contributes it, anyway. I would then hope that he or a group of people would volunteer to maintain it, with the threat of removing it from the stdlib again if these volunteers go away and too many problems show up. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
Martin v. Löwis [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] | I certainly agree that if the target set of documents is small enough it | | Ok. What advantage would you (or somebody working on a similar project) | gain if chardet was part of the standard library? What if it was not | chardet, but some other algorithm? It seems to me that since there is not a 'correct' algorithm but only competing heuristics, encoding detection modules should be made available via PyPI and only be considered for stdlib after a best of breed emerges with community support. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Encoding detection in the standard library?
Is there some sort of text encoding detection module is the standard library? And, if not, is there any reason not to add one? After some googling, I've come across this: http://mail.python.org/pipermail/python-3000/2006-September/003537.html But I can't find any changes that resulted from that thread. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
David Is there some sort of text encoding detection module is the David standard library? And, if not, is there any reason not to add David one? No, there's not. I suspect the fact that you can't correctly determine the encoding of a chunk of text 100% of the time mitigates against it. Skip ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
David Wolever schrieb: Is there some sort of text encoding detection module is the standard library? And, if not, is there any reason not to add one? You cannot detect the encoding unless it's explicitly defined through a header (e.g. the UTF BOM). It's technically impossible. The best you can do is an educated guess. Christian ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
On Mon, 21 Apr 2008 17:50:43 +0100, Michael Foord [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: David Is there some sort of text encoding detection module is the David standard library? And, if not, is there any reason not to add David one? No, there's not. I suspect the fact that you can't correctly determine the encoding of a chunk of text 100% of the time mitigates against it. The only approach I know of is a heuristic based approach. e.g. http://www.voidspace.org.uk/python/articles/guessing_encoding.shtml (Which was 'borrowed' from docutils in the first place.) This isn't the only approach, although you're right that in general you have to rely on heuristics. See the charset detection features of ICU: http://www.icu-project.org/userguide/charsetDetection.html I think OSAF's pyicu exposes these APIs: http://pyicu.osafoundation.org/ Jean-Paul ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
[EMAIL PROTECTED] wrote: David Is there some sort of text encoding detection module is the David standard library? And, if not, is there any reason not to add David one? No, there's not. I suspect the fact that you can't correctly determine the encoding of a chunk of text 100% of the time mitigates against it. The only approach I know of is a heuristic based approach. e.g. http://www.voidspace.org.uk/python/articles/guessing_encoding.shtml (Which was 'borrowed' from docutils in the first place.) Michael Foord Skip ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
On 21-Apr-08, at 12:44 PM, [EMAIL PROTECTED] wrote: David Is there some sort of text encoding detection module is the David standard library? And, if not, is there any reason not to add David one? No, there's not. I suspect the fact that you can't correctly determine the encoding of a chunk of text 100% of the time mitigates against it. Sorry, I wasn't very clear what I was asking. I was thinking about making an educated guess -- just like chardet (http://chardet.feedparser.org/). This is useful when you get a hunk of data which _should_ be some sort of intelligible text from the Big Scary Internet (say, a posted web form or email message), and you want to do something useful with it (say, search the content). ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
To the contrary, an encoding-guessing module is often needed, and guessing can be done with a pretty high success rate. Other Unicode libraries (e.g. ICU) contain guessing modules. I suppose the API could return two values: the guessed encoding and a confidence indicator. Note that the locale settings might figure in the guess. On Mon, Apr 21, 2008 at 10:28 AM, Georg Brandl [EMAIL PROTECTED] wrote: Christian Heimes schrieb: David Wolever schrieb: Is there some sort of text encoding detection module is the standard library? And, if not, is there any reason not to add one? You cannot detect the encoding unless it's explicitly defined through a header (e.g. the UTF BOM). It's technically impossible. The best you can do is an educated guess. Exactly, and in light of that, I'm -1 for such a standard module. We've enough issues with modules implementing (apparently) fully specified standards. :) Georg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
Michael The only approach I know of is a heuristic based approach. e.g. Michael http://www.voidspace.org.uk/python/articles/guessing_encoding.shtml Michael (Which was 'borrowed' from docutils in the first place.) Yes, I implemented a heuristic approach for the Musi-Cal web server. I was able to rely on domain knowledge to guess correctly almost all the time. The heuristic was that almost all form submissions came from the US and the rest which didn't came from Western Europe. Python could never embed such a narrow-focused heuristic into its core distribution. Skip ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
Guido Note that the locale settings might figure in the guess. Alas, locale settings in a web server have little or nothing to do with the locale settings in the client submitting the form. Skip ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
At 1:14 PM -0400 4/21/08, David Wolever wrote: On 21-Apr-08, at 12:44 PM, [EMAIL PROTECTED] wrote: David Is there some sort of text encoding detection module is the David standard library? And, if not, is there any reason not to add David one? No, there's not. I suspect the fact that you can't correctly determine the encoding of a chunk of text 100% of the time mitigates against it. Sorry, I wasn't very clear what I was asking. I was thinking about making an educated guess -- just like chardet (http://chardet.feedparser.org/). This is useful when you get a hunk of data which _should_ be some sort of intelligible text from the Big Scary Internet (say, a posted web form or email message), and you want to do something useful with it (say, search the content). Feedparser.org's chardet can't guess 'latin1', so it should be used as a last resort, just as the docs say. -- TonyN.:' mailto:[EMAIL PROTECTED] ' http://www.georgeanelson.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
On 21-Apr-08, at 5:31 PM, Martin v. Löwis wrote: This is useful when you get a hunk of data which _should_ be some sort of intelligible text from the Big Scary Internet (say, a posted web form or email message), and you want to do something useful with it (say, search the content). I don't think that should be part of the standard library. People will mistake what it tells them for certain. As Oleg mentioned, if the method is called something like 'guess_encoding', I think we could live with clear consciences. IMO, encoding estimation is something that many web programs will have to deal with, so it might as well be built in; I would prefer the option to run `text=input.encode('guess')` (or something similar) than relying on an external dependency or worse yet using a hand- rolled algorithm. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Encoding detection in the standard library?
David Wolever wrote: IMO, encoding estimation is something that many web programs will have to deal with, so it might as well be built in; I would prefer the option to run `text=input.encode('guess')` (or something similar) than relying on an external dependency or worse yet using a hand-rolled algorithm The (still draft) html5 spec is trying to get error-correction standardized, so it includes all sort of if this fails, do X. Encoding detection will be standardized, so there will be an external standard that we can reference. http://dev.w3.org/html5/spec/Overview.html#determining Note that this portion of the spec is probably not stable yet, as there was some new analysis on which wrong answers provided better results on real world web pages. e.g., http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2008-March/014127.html http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2008-March/014190.html There was also a recent analysis of how many characters it takes to sniff successfully X% of the time on today's web, though I'm not finding it at the moment. -jJ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Encoding detection in the standard library?
IMO, encoding estimation is something that many web programs will have to deal with Can you please explain why that is? Web programs should not normally have the need to detect the encoding; instead, it should be specified always - unless you are talking about browsers specifically, which need to support web pages that specify the encoding incorrectly. so it might as well be built in; I would prefer the option to run `text=input.encode('guess')` (or something similar) than relying on an external dependency or worse yet using a hand-rolled algorithm. Ok, let me try differently then. Please feel free to post a patch to bugs.python.org, and let other people rip it apart. For example, I don't think it should be a codec, as I can't imagine it working on streams. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com