Re: trying to strip out non ascii.. or rather convert non ascii

2013-11-01 Thread Mark Lawrence
On 01/11/2013 09:00, wxjmfa...@gmail.com wrote: I'll ask again, would you please read, digest and action this https://wiki.python.org/moin/GoogleGroupsPython -- Python is the second best programming language in the world. But the best has yet to be invented. Christian Tismer Mark Lawrence -

Re: trying to strip out non ascii.. or rather convert non ascii

2013-11-01 Thread wxjmfauth
Le vendredi 1 novembre 2013 08:16:36 UTC+1, Steven D'Aprano a écrit : > On Thu, 31 Oct 2013 03:33:15 -0700, wxjmfauth wrote: > > > > > Le jeudi 31 octobre 2013 08:10:18 UTC+1, Steven D'Aprano a écrit : > > > > >> I'm glad that you know so much better than Google, Bing, Yahoo, and > > >> othe

Re: trying to strip out non ascii.. or rather convert non ascii

2013-11-01 Thread Steven D'Aprano
On Thu, 31 Oct 2013 03:33:15 -0700, wxjmfauth wrote: > Le jeudi 31 octobre 2013 08:10:18 UTC+1, Steven D'Aprano a écrit : >> I'm glad that you know so much better than Google, Bing, Yahoo, and >> other >> search engines. When I search for "mispealled" Google gives me: [...] > As far as I know, I

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-31 Thread Tim Chase
On 2013-10-30 19:28, Roy Smith wrote: > For example, it's reasonable to consider any vowel (or string of > vowels, for that matter) to be closer to another vowel than to a > consonant. A great example is the word, "bureaucrat". As far as > I'm concerned, it's spelled {b, vowels, r, vowels, c, r,

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-31 Thread wxjmfauth
Le jeudi 31 octobre 2013 08:10:18 UTC+1, Steven D'Aprano a écrit : > On Wed, 30 Oct 2013 01:49:28 -0700, wxjmfauth wrote: > > > > >> The right solution to that is to treat it no differently from other > > >> fuzzy > > >> searches. A good search engine should be tolerant of spelling errors > >

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-31 Thread Mark Lawrence
On 31/10/2013 07:10, Steven D'Aprano wrote: On Wed, 30 Oct 2013 01:49:28 -0700, wxjmfauth wrote: The right solution to that is to treat it no differently from other fuzzy searches. A good search engine should be tolerant of spelling errors and alternative spellings for any letter, not just thos

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-31 Thread Steven D'Aprano
On Wed, 30 Oct 2013 01:49:28 -0700, wxjmfauth wrote: >> The right solution to that is to treat it no differently from other >> fuzzy >> searches. A good search engine should be tolerant of spelling errors >> and >> alternative spellings for any letter, not just those with diacritics. >> Ideally, a

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-30 Thread Roy Smith
In article , Michael Torrie wrote: > On 10/30/2013 10:08 AM, wxjmfa...@gmail.com wrote: > > My comment had nothing to do with Python, it was a > > general comment. A diacritical mark just makes a letter > > a different letter; a "ï " and a "i" are "as > > diferent" as a "a" from a "z". A diacri

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-30 Thread Terry Reedy
On 10/30/2013 12:08 PM, wxjmfa...@gmail.com wrote: From a unicode perspective. Unicode.org "knows", these chars a very important, that's the reason why they exist in two forms, precomposed and composed forms. Only some chars have both forms. I believe the precomposed forms are partly a histo

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-30 Thread wxjmfauth
Le mercredi 30 octobre 2013 18:54:05 UTC+1, Michael Torrie a écrit : > On 10/30/2013 10:08 AM, wxjmfa...@gmail.com wrote: > > > My comment had nothing to do with Python, it was a > > > general comment. A diacritical mark just makes a letter > > > a different letter; a "ï " and a "i" are "as > >

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-30 Thread Michael Torrie
On 10/30/2013 10:08 AM, wxjmfa...@gmail.com wrote: > My comment had nothing to do with Python, it was a > general comment. A diacritical mark just makes a letter > a different letter; a "ï " and a "i" are "as > diferent" as a "a" from a "z". A diacritical mark > is more than a simple ornementation.

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-30 Thread Ned Batchelder
On 10/30/13 12:08 PM, wxjmfa...@gmail.com wrote: Le mercredi 30 octobre 2013 13:44:47 UTC+1, Ned Batchelder a écrit : On 10/30/13 4:49 AM, wxjmfa...@gmail.com wrote: Le mardi 29 octobre 2013 06:24:50 UTC+1, Steven D'Aprano a écrit : On Mon, 28 Oct 2013 09:23:41 -0500, Tim Chase wrote: On 201

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-30 Thread Mark Lawrence
On 30/10/2013 16:08, wxjmfa...@gmail.com wrote: Would you please read, digest and action this https://wiki.python.org/moin/GoogleGroupsPython TIA. -- Python is the second best programming language in the world. But the best has yet to be invented. Christian Tismer Mark Lawrence -- https://

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-30 Thread wxjmfauth
Le mercredi 30 octobre 2013 13:44:47 UTC+1, Ned Batchelder a écrit : > On 10/30/13 4:49 AM, wxjmfa...@gmail.com wrote: > > > Le mardi 29 octobre 2013 06:24:50 UTC+1, Steven D'Aprano a écrit : > > >> On Mon, 28 Oct 2013 09:23:41 -0500, Tim Chase wrote: > > >> > > >> > > >> > > >>> On 2013-10-2

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-30 Thread Mark Lawrence
On 30/10/2013 08:13, wxjmfa...@gmail.com wrote: Le mercredi 30 octobre 2013 03:17:21 UTC+1, Chris Angelico a écrit : On Wed, Oct 30, 2013 at 2:56 AM, Mark Lawrence wrote: You've stated above that logically unicode is badly handled by the fsr. You then provide a trivial timing example. WT

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-30 Thread Ned Batchelder
On 10/30/13 4:49 AM, wxjmfa...@gmail.com wrote: Le mardi 29 octobre 2013 06:24:50 UTC+1, Steven D'Aprano a écrit : On Mon, 28 Oct 2013 09:23:41 -0500, Tim Chase wrote: On 2013-10-28 07:01, wxjmfa...@gmail.com wrote: Simply ignoring diactrics won't get you very far. Right. As an example, t

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-30 Thread Mark Lawrence
On 30/10/2013 01:33, Piet van Oostrum wrote: Mark Lawrence writes: Please provide hard evidence to support your claims or stop posting this ridiculous nonsense. Give us real world problems that can be reported on the bug tracker, investigated and resolved. I think it is much better just to

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-30 Thread wxjmfauth
Le mardi 29 octobre 2013 06:24:50 UTC+1, Steven D'Aprano a écrit : > On Mon, 28 Oct 2013 09:23:41 -0500, Tim Chase wrote: > > > > > On 2013-10-28 07:01, wxjmfa...@gmail.com wrote: > > >>> Simply ignoring diactrics won't get you very far. > > >> > > >> Right. As an example, these four French

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-30 Thread wxjmfauth
Le mercredi 30 octobre 2013 03:17:21 UTC+1, Chris Angelico a écrit : > On Wed, Oct 30, 2013 at 2:56 AM, Mark Lawrence > wrote: > > > You've stated above that logically unicode is badly handled by the fsr. You > > > then provide a trivial timing example. WTF??? > > > > His idea of bad handl

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-29 Thread Chris Angelico
On Wed, Oct 30, 2013 at 2:56 AM, Mark Lawrence wrote: > You've stated above that logically unicode is badly handled by the fsr. You > then provide a trivial timing example. WTF??? His idea of bad handling is "oh how terrible, ASCII and BMP have optimizations". He hates the idea that it could be

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-29 Thread Piet van Oostrum
Mark Lawrence writes: > Please provide hard evidence to support your claims or stop posting this > ridiculous nonsense. Give us real world problems that can be reported > on the bug tracker, investigated and resolved. I think it is much better just to ignore this nonsense instead of asking for

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-29 Thread Mark Lawrence
On 29/10/2013 19:16, wxjmfa...@gmail.com wrote: Le mardi 29 octobre 2013 16:52:49 UTC+1, Tim Chase a écrit : On 2013-10-29 08:38, wxjmfa...@gmail.com wrote: import timeit timeit.timeit("a = 'hundred'; 'x' in a") 0.12621293837694095 timeit.timeit("a = 'hundreij'; 'x' in a") 0.2641155

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-29 Thread wxjmfauth
Le mardi 29 octobre 2013 16:52:49 UTC+1, Tim Chase a écrit : > On 2013-10-29 08:38, wxjmfa...@gmail.com wrote: > > > >>> import timeit > > > >>> timeit.timeit("a = 'hundred'; 'x' in a") > > > 0.12621293837694095 > > > >>> timeit.timeit("a = 'hundreij'; 'x' in a") > > > 0.26411553466961735 >

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-29 Thread Mark Lawrence
On 29/10/2013 15:38, wxjmfa...@gmail.com wrote: It's okay folks I'll snip all the double spaced google crap as the poster is clearly too bone idle to follow the instructions that have been repeatedly posted here asking for people not to post double spaced google crap. Le mardi 29 octobre 20

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-29 Thread Tim Chase
On 2013-10-29 08:38, wxjmfa...@gmail.com wrote: > >>> import timeit > >>> timeit.timeit("a = 'hundred'; 'x' in a") > 0.12621293837694095 > >>> timeit.timeit("a = 'hundreij'; 'x' in a") > 0.26411553466961735 That reads to me as "If things were purely UCS4 internally, Python would normally take 0

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-29 Thread wxjmfauth
Le mardi 29 octobre 2013 06:22:27 UTC+1, Steven D'Aprano a écrit : > On Mon, 28 Oct 2013 07:01:16 -0700, wxjmfauth wrote: > > > > > And of course, logically, they are very, very badly handled with the > > > Flexible String Representation. > > > > I'm reminded of Cato the Elder, the Roman sen

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-28 Thread Steven D'Aprano
On Mon, 28 Oct 2013 09:23:41 -0500, Tim Chase wrote: > On 2013-10-28 07:01, wxjmfa...@gmail.com wrote: >>> Simply ignoring diactrics won't get you very far. >> >> Right. As an example, these four French words : cote, côte, coté, côté >> . > > Distinct words with distinct meanings, sure. > > But

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-28 Thread Steven D'Aprano
On Mon, 28 Oct 2013 07:01:16 -0700, wxjmfauth wrote: > And of course, logically, they are very, very badly handled with the > Flexible String Representation. I'm reminded of Cato the Elder, the Roman senator who would end every speech, no matter the topic, with "Ceterum censeo Carthaginem esse

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-28 Thread Tim Chase
On 2013-10-28 07:01, wxjmfa...@gmail.com wrote: >> Simply ignoring diactrics won't get you very far. > > Right. As an example, these four French words : > cote, côte, coté, côté . Distinct words with distinct meanings, sure. But when a naïve (naive? ☺) person or one without the easy ability to e

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-28 Thread Mark Lawrence
On 28/10/2013 14:01, wxjmfa...@gmail.com wrote: Just as a reminder. They are 1272 characters considered as Latin characters (how to count them it not a simple task), and if my knowledge is correct, they are covering and/or are here to cover the 17 languages, to be exact, the 17 European language

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-28 Thread wxjmfauth
Le dimanche 27 octobre 2013 04:21:46 UTC+1, Nobody a écrit : > > > > Simply ignoring diactrics won't get you very far. > > Right. As an example, these four French words : cote, côte, coté, côté . > > Most languages which use diactrics have standard conversions, e.g. > > ö -> oe, which are

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-27 Thread Mark Lawrence
On 27/10/2013 01:11, Roy Smith wrote: In article , Dennis Lee Bieber wrote: Compared to Baudot, both ASCII and EBCDIC were probably considered wondrous. Wonderous, indeed. Why would anybody ever need more than one case of the alphabet? It's almost as absurd as somebody wanting to put fun

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-26 Thread Nobody
On Sat, 26 Oct 2013 20:41:58 -0500, Tim Chase wrote: > I'd be just as happy if Python provided a "sloppy string compare" > that ignored case, diacritical marks, and the like. Simply ignoring diactrics won't get you very far. Most languages which use diactrics have standard conversions, e.g. ö ->

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-26 Thread Chris Angelico
On Sun, Oct 27, 2013 at 1:05 PM, Steven D'Aprano wrote: > On Sat, 26 Oct 2013 21:11:55 -0400, Roy Smith wrote: > >> In article , >> Dennis Lee Bieber wrote: >> >>> Compared to Baudot, both ASCII and EBCDIC were probably considered >>> wondrous. >> >> Wonderous, indeed. Why would anybody ever ne

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-26 Thread Tim Chase
On 2013-10-26 21:54, Roy Smith wrote: > In article , > Tim Chase wrote: >> I'd be just as happy if Python provided a "sloppy string compare" >> that ignored case, diacritical marks, and the like. > > The problem with putting fuzzy matching in the core language is > that there is no general agree

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-26 Thread Steven D'Aprano
On Sat, 26 Oct 2013 21:11:55 -0400, Roy Smith wrote: > In article , > Dennis Lee Bieber wrote: > >> Compared to Baudot, both ASCII and EBCDIC were probably considered >> wondrous. > > Wonderous, indeed. Why would anybody ever need more than one case of > the alphabet? It's almost as absurd a

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-26 Thread Roy Smith
In article , Tim Chase wrote: > I'd be just as happy if Python provided a "sloppy string compare" > that ignored case, diacritical marks, and the like. The problem with putting fuzzy matching in the core language is that there is no general agreement on how it's supposed to work. There are, h

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-26 Thread Tim Chase
On 2013-10-26 22:24, Steven D'Aprano wrote: > Why on earth would you want to throw away perfectly good > information? The main reason I've needed to do it in the past is for normalization of search queries. When a user wants to find something containing "pingüino", I want to have those results c

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-26 Thread Roy Smith
In article , Dennis Lee Bieber wrote: > Compared to Baudot, both ASCII and EBCDIC were probably considered > wondrous. Wonderous, indeed. Why would anybody ever need more than one case of the alphabet? It's almost as absurd as somebody wanting to put funny little marks on top of their vowel

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-26 Thread Steven D'Aprano
On Sat, 26 Oct 2013 16:11:25 -0400, bruce wrote: > hi.. > > getting some files via curl, and want to convert them from what i'm > guessing to be unicode. > > I'd like to convert a string like this:: href="ShowRatings.jsp?tid=1312168">Alcántar, Iliana > > to:: > Alcantar, > Iliana > > where I

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-26 Thread MRAB
On 26/10/2013 21:11, bruce wrote: hi.. getting some files via curl, and want to convert them from what i'm guessing to be unicode. I'd like to convert a string like this:: Alcántar, Iliana to:: Alcantar, Iliana where I convert the " á " to " a" which appears to be a shift of 128, but I'm not

trying to strip out non ascii.. or rather convert non ascii

2013-10-26 Thread bruce
hi.. getting some files via curl, and want to convert them from what i'm guessing to be unicode. I'd like to convert a string like this:: Alcántar, Iliana to:: Alcantar, Iliana where I convert the " á " to " a" which appears to be a shift of 128, but I'm not sure how to accomplish this.. I've