subject:"trying to strip out non ascii.. or rather convert non ascii"

Re: trying to strip out non ascii.. or rather convert non ascii

2013-11-01 Thread Mark Lawrence


On 01/11/2013 09:00, wxjmfa...@gmail.com wrote:

I'll ask again, would you please read, digest and action this 
https://wiki.python.org/moin/GoogleGroupsPython


--
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-11-01 Thread wxjmfauth

Le vendredi 1 novembre 2013 08:16:36 UTC+1, Steven D'Aprano a écrit :
> On Thu, 31 Oct 2013 03:33:15 -0700, wxjmfauth wrote:
> 
> 
> 
> > Le jeudi 31 octobre 2013 08:10:18 UTC+1, Steven D'Aprano a écrit :
> 
> 
> 
> >> I'm glad that you know so much better than Google, Bing, Yahoo, and
> 
> >> other
> 
> >> search engines. When I search for "mispealled" Google gives me:
> 
> [...]
> 
> > As far as I know, I recognized my mistake. I had more text processing
> 
> > systems in mind, than search engines.
> 
> 
> 
> Yes, you have, I acknowledge that now. I see now that at the time I made 
> 
> my response to you, you had already replied recognising your error. 
> 
> Unfortunately I had not seen that. So in that case, I withdraw my 
> 
> comments and apologize.
> 
> 
> 
> 
> 
> > I can even tell you, I am really stupid. I wrote pure Unicode software
> 
> > to sort French or German strings.
> 
> > 
> 
> > Pure unicode == independent from any locale.
> 
> 
> 
> Unfortunately it is not that simple. The same code point can have 
> 
> different meanings in different languages, and should be treated 
> 
> differently when sorting. The natural Unicode sort order satisfies very 
> 
> few European languages, including English. A few examples:
> 
> 
> 
> * Swedish ä is a distinct letters of the alphabet, appearing 
> 
>   after z: "a b c z ä" is sorted according to Swedish rules.
> 
>   But in German ä is considered to be the letter 'a' plus an
> 
>   umlaut, and is collated after 'a': "a ä b c z" is sorted 
> 
>   according to German rules.
> 
> 
> 
> * In German ö is considered to be a variant of o, equivalent
> 
>   to 'oe', while in Finish ö is a distinct letter which 
> 
>   cannot be expanded to 'oe', and which appears at the end
> 
>   of the alphabet.
> 
> 
> 
> * Similarly, in modern English æ is a ligature of ae, while in
> 
>   Danish and Norwegian is it a distinct letter of the alphabet
> 
>   appearing after z: in English dictionaries, "Æsir" will be 
> 
>   found with other "A" words, often expanded to "Aesir", while
> 
>   in Norwegian it will be found after "Z" words.
> 
> 
> 
> * Most European languages convert uppercase I to lowercase i, 
> 
>   but Turkish has distinct letters for dotted and dotless I. 
> 
>   According to Turkish rules, lowercase(I) is ı and uppercase(i)
> 
>   is İ.
> 
> 
> 
> 
> 
> While it is true that the Unicode character set is independent of locale, 
> 
> for natural processing of characters, it isn't enough to just use Unicode.
> 
> 
> 
> 
> 
> -- 
> 
> Steven


I'm aware of all the points you gave. That's why
I wrote "French or German strings".

The hard task is not on the side of Unicode or sorting,
it is on the creation of key(s) used for sorting.

Eg, cote, côte, coté, côté. French editors are not all
sorting these words in the same way (diacritics).

jmf

PS A *real* case to test the FSR.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-11-01 Thread Steven D'Aprano

On Thu, 31 Oct 2013 03:33:15 -0700, wxjmfauth wrote:

> Le jeudi 31 octobre 2013 08:10:18 UTC+1, Steven D'Aprano a écrit :

>> I'm glad that you know so much better than Google, Bing, Yahoo, and
>> other
>> search engines. When I search for "mispealled" Google gives me:
[...]
> As far as I know, I recognized my mistake. I had more text processing
> systems in mind, than search engines.

Yes, you have, I acknowledge that now. I see now that at the time I made 
my response to you, you had already replied recognising your error. 
Unfortunately I had not seen that. So in that case, I withdraw my 
comments and apologize.

> I can even tell you, I am really stupid. I wrote pure Unicode software
> to sort French or German strings.
> 
> Pure unicode == independent from any locale.

Unfortunately it is not that simple. The same code point can have 
different meanings in different languages, and should be treated 
differently when sorting. The natural Unicode sort order satisfies very 
few European languages, including English. A few examples:

* Swedish ä is a distinct letters of the alphabet, appearing 
  after z: "a b c z ä" is sorted according to Swedish rules.
  But in German ä is considered to be the letter 'a' plus an
  umlaut, and is collated after 'a': "a ä b c z" is sorted 
  according to German rules.

* In German ö is considered to be a variant of o, equivalent
  to 'oe', while in Finish ö is a distinct letter which 
  cannot be expanded to 'oe', and which appears at the end
  of the alphabet.

* Similarly, in modern English æ is a ligature of ae, while in
  Danish and Norwegian is it a distinct letter of the alphabet
  appearing after z: in English dictionaries, "Æsir" will be 
  found with other "A" words, often expanded to "Aesir", while
  in Norwegian it will be found after "Z" words.

* Most European languages convert uppercase I to lowercase i, 
  but Turkish has distinct letters for dotted and dotless I. 
  According to Turkish rules, lowercase(I) is ı and uppercase(i)
  is İ.

While it is true that the Unicode character set is independent of locale, 
for natural processing of characters, it isn't enough to just use Unicode.

-- 
Steven
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-31 Thread Tim Chase

On 2013-10-30 19:28, Roy Smith wrote:
> For example, it's reasonable to consider any vowel (or string of
> vowels, for that matter) to be closer to another vowel than to a
> consonant.  A great example is the word, "bureaucrat".  As far as
> I'm concerned, it's spelled {b, vowels, r, vowels, c, r, a, t}.  It
> usually takes me three or four tries to get auto-correct to even
> recognize what I'm trying to type and fix it for me.

[glad I'm not the only one who has trouble spelling "bureaucrat"]

Steven D'Aprano wisely mentioned elsewhere in the thread that "The
right solution to that is to treat it no differently from other fuzzy
searches. A good search engine should be tolerant of spelling errors
and alternative spellings for any letter, not just those with
diacritics."

Often the Levenshtein distance is used for calculating closeness, and
the off-the-shelf algorithm assigns a cost of one per difference
(addition, change, or removal).  It doesn't sound like it would be
that hard[1] to assign varying costs based on what character was
added/changed/removed.  A diacritic might have a cost of N while a
similar character (vowel->vowel or consonant->consonant, or
consonant-cluster shift) might have a cost of 2N, and a totally
arbitrary character shift might have a cost of 3N (or higher).
Unfortunately, the Levenshtein algorithm is already O(M*N) slow and
can't be reasonably precalculated without knowing both strings, so
this just ends up heaping additional lookups/comparisons atop
already-slow code.

-tkc

[1]
http://en.wikipedia.org/wiki/Levenshtein_distance#Possible_modifications

.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-31 Thread wxjmfauth

Le jeudi 31 octobre 2013 08:10:18 UTC+1, Steven D'Aprano a écrit :
> On Wed, 30 Oct 2013 01:49:28 -0700, wxjmfauth wrote:
> 
> 
> 
> >> The right solution to that is to treat it no differently from other
> 
> >> fuzzy
> 
> >> searches. A good search engine should be tolerant of spelling errors
> 
> >> and
> 
> >> alternative spellings for any letter, not just those with diacritics.
> 
> >> Ideally, a good search engine would successfully match all three of
> 
> >> "naïve", "naive" and "niave", and it shouldn't rely on special handling
> 
> >> of diacritics.
> 
> > 
> 
> > This is a non sense. The purpose of a diacritical mark is to make a
> 
> > letter a different letter. If a tool is supposed to match an ô, there is
> 
> > absolutely no reason to match something else.
> 
> 
> 
> 
> 
> I'm glad that you know so much better than Google, Bing, Yahoo, and other 
> 
> search engines. When I search for "mispealled" Google gives me:
> 
> 
> 
> Showing results for misspelled
> 
> Search instead for mispealled
> 
> 
> 
> 
> 
> But I see now that this is nonsense and there is *absolutely no reason* 
> 
> to match something other than the ecaxt wrods I typed.
> 
> 
> 
> Perhaps you should submit a bug report to Google:
> 
> 
> 
> "When I mistype a word, Google correctly gives me the search results I 
> 
> wanted, instead of the wrong results I didn't want."
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> Steven


As far as I know, I recognized my mistake. I had more
text processing systems in mind, than search engines.

I can even tell you, I am really stupid. I wrote pure
Unicode software to sort French or German strings.

Pure unicode == independent from any locale.

jmf
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-31 Thread Mark Lawrence


On 31/10/2013 07:10, Steven D'Aprano wrote:

On Wed, 30 Oct 2013 01:49:28 -0700, wxjmfauth wrote:


The right solution to that is to treat it no differently from other
fuzzy
searches. A good search engine should be tolerant of spelling errors
and
alternative spellings for any letter, not just those with diacritics.
Ideally, a good search engine would successfully match all three of
"naïve", "naive" and "niave", and it shouldn't rely on special handling
of diacritics.


This is a non sense. The purpose of a diacritical mark is to make a
letter a different letter. If a tool is supposed to match an ô, there is
absolutely no reason to match something else.



I'm glad that you know so much better than Google, Bing, Yahoo, and other
search engines. When I search for "mispealled" Google gives me:

 Showing results for misspelled
 Search instead for mispealled


But I see now that this is nonsense and there is *absolutely no reason*
to match something other than the ecaxt wrods I typed.

Perhaps you should submit a bug report to Google:

"When I mistype a word, Google correctly gives me the search results I
wanted, instead of the wrong results I didn't want."



I'm sorry Steven but you're completely out of your depth here.  Please 
bow down to the superior intellect of jmf, where jm is for Joseph McCarthy.


--
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-31 Thread Steven D'Aprano

On Wed, 30 Oct 2013 01:49:28 -0700, wxjmfauth wrote:

>> The right solution to that is to treat it no differently from other
>> fuzzy
>> searches. A good search engine should be tolerant of spelling errors
>> and
>> alternative spellings for any letter, not just those with diacritics.
>> Ideally, a good search engine would successfully match all three of
>> "naïve", "naive" and "niave", and it shouldn't rely on special handling
>> of diacritics.
> 
> This is a non sense. The purpose of a diacritical mark is to make a
> letter a different letter. If a tool is supposed to match an ô, there is
> absolutely no reason to match something else.


I'm glad that you know so much better than Google, Bing, Yahoo, and other 
search engines. When I search for "mispealled" Google gives me:

Showing results for misspelled
Search instead for mispealled


But I see now that this is nonsense and there is *absolutely no reason* 
to match something other than the ecaxt wrods I typed.

Perhaps you should submit a bug report to Google:

"When I mistype a word, Google correctly gives me the search results I 
wanted, instead of the wrong results I didn't want."



-- 
Steven
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-30 Thread Roy Smith

In article ,
 Michael Torrie  wrote:

> On 10/30/2013 10:08 AM, wxjmfa...@gmail.com wrote:
> > My comment had nothing to do with Python, it was a
> > general comment. A diacritical mark just makes a letter
> > a different letter; a "Ã¯ " and a "i" are "as
> > diferent" as a "a" from a "z". A diacritical mark
> > is more than a simple ornementation.
> 
> That's nice, but you didn't actually read what Ned said (or the OP).
> The OP doesn't care that "Ã¯ " and a "i" are as different as "a" and "z".
> For the purposes of his search he wants them treated as the same
> letter.  A fuzzy searching treats them all the same.

That's one definition of fuzzy.  But, there's nothing that says you 
can't build a fuzzy matching algorithm which considers some mismatches 
to be worse than others.

For example, it's reasonable to consider any vowel (or string of vowels, 
for that matter) to be closer to another vowel than to a consonant.  A 
great example is the word, "bureaucrat".  As far as I'm concerned, it's 
spelled {b, vowels, r, vowels, c, r, a, t}.  It usually takes me three 
or four tries to get auto-correct to even recognize what I'm trying to 
type and fix it for me.

Likewise for pairs like {c, s}, {j, g}, {v, w}, and so on.

In that spirit, I would think that a, Ã¡, and Ã¢ would all be considered 
more conservative replacements for each other than they would be for k, 
x, or z.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-30 Thread Terry Reedy


On 10/30/2013 12:08 PM, wxjmfa...@gmail.com wrote:


 From a unicode perspective.
Unicode.org "knows", these chars a very important, that's
the reason why they exist in two forms, precomposed and
composed forms.


Only some chars have both forms. I believe the precomposed forms are 
partly a historical accident of what precomposed forms were in the 
various latin-1 sets.


--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-30 Thread wxjmfauth

Le mercredi 30 octobre 2013 18:54:05 UTC+1, Michael Torrie a écrit :
> On 10/30/2013 10:08 AM, wxjmfa...@gmail.com wrote:
> 
> > My comment had nothing to do with Python, it was a
> 
> > general comment. A diacritical mark just makes a letter
> 
> > a different letter; a "ï " and a "i" are "as
> 
> > diferent" as a "a" from a "z". A diacritical mark
> 
> > is more than a simple ornementation.
> 
> 
> 
> That's nice, but you didn't actually read what Ned said (or the OP).
> 
> The OP doesn't care that "ï " and a "i" are as different as "a" and "z".
> 
>  For the purposes of his search he wants them treated as the same
> 
> letter.  A fuzzy searching treats them all the same. For example, a
> 
> search for "Godel, Escher, Bach" should find "Gödel, Escher, Bach" just
> 
> fine.  Even though "o" and "ö" are different characters.  And lo and
> 
> behold Google actually does this!  Try it.  It's nice for those of use
> 
> who want to find something and our US keyboards don't have the right marks.
> 
> 
> 
> https://www.google.ca/search?q=godel+escher+bach
> 
> 
> 
> After all this nonsense, that's what the original poster is looking for
> 
> (I think... can't be sure since it's been so many days now).  Seems to
> 
> me a python module does this quite nicely:
> 
> 
> 
> https://pypi.python.org/pypi/Unidecode


Ok. You are right. I recognize my mistake. Independently
from the top poster's task, I did not understand in that
way.

Let say it depends on the context, for a general
search engine, it's good that diacritics are ignored.
For, let say, a text processing system, it's good
to have only precised matches. It does not mean, other
matching possibilities may exist.

jmf

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-30 Thread Michael Torrie

On 10/30/2013 10:08 AM, wxjmfa...@gmail.com wrote:
> My comment had nothing to do with Python, it was a
> general comment. A diacritical mark just makes a letter
> a different letter; a "ï " and a "i" are "as
> diferent" as a "a" from a "z". A diacritical mark
> is more than a simple ornementation.

That's nice, but you didn't actually read what Ned said (or the OP).
The OP doesn't care that "ï " and a "i" are as different as "a" and "z".
 For the purposes of his search he wants them treated as the same
letter.  A fuzzy searching treats them all the same. For example, a
search for "Godel, Escher, Bach" should find "Gödel, Escher, Bach" just
fine.  Even though "o" and "ö" are different characters.  And lo and
behold Google actually does this!  Try it.  It's nice for those of use
who want to find something and our US keyboards don't have the right marks.

https://www.google.ca/search?q=godel+escher+bach

After all this nonsense, that's what the original poster is looking for
(I think... can't be sure since it's been so many days now).  Seems to
me a python module does this quite nicely:

https://pypi.python.org/pypi/Unidecode
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-30 Thread Ned Batchelder


On 10/30/13 12:08 PM, wxjmfa...@gmail.com wrote:

Le mercredi 30 octobre 2013 13:44:47 UTC+1, Ned Batchelder a écrit :

On 10/30/13 4:49 AM, wxjmfa...@gmail.com wrote:


Le mardi 29 octobre 2013 06:24:50 UTC+1, Steven D'Aprano a écrit :

On Mon, 28 Oct 2013 09:23:41 -0500, Tim Chase wrote:

On 2013-10-28 07:01, wxjmfa...@gmail.com wrote:

Simply ignoring diactrics won't get you very far.

Right. As an example, these four French words : cote, côte, coté, côté
.

Distinct words with distinct meanings, sure.
But when a naïve (naive? ☺) person or one without the easy ability to
enter characters with diacritics searches for "cote", I want to return
possible matches containing any of your 4 examples.  It's slightly
fuzzier if they search for "coté", in which case they may mean "coté" or
they might mean be unable to figure out how to add a hat and want to
type "côté". Though I'd rather get more results, even if it has some
that only match fuzzily.

The right solution to that is to treat it no differently from other fuzzy
searches. A good search engine should be tolerant of spelling errors and
alternative spellings for any letter, not just those with diacritics.
Ideally, a good search engine would successfully match all three of
"naïve", "naive" and "niave", and it shouldn't rely on special handling
of diacritics.

--
This is a non sense. The purpose of a diacritical mark is to
make a letter a different letter. If a tool is supposed to
match an ô, there is absolutely no reason to match something
else.
jmf



jmf, Tim Chase described his use case, and it seems reasonable to me.

I'm not sure why you would describe it as nonsense.



--Ned.



My comment had nothing to do with Python, it was a
general comment. A diacritical mark just makes a letter
a different letter; a "ï " and a "i" are "as
diferent" as a "a" from a "z". A diacritical mark
is more than a simple ornementation.


Yes, we understand that.  Tim outlined a need that had to do with users' 
informal typing.  In his case, he needs to deal with that sloppiness.  
You can't simply insist that users be more precise.


Unicode is a way to represent text, and text gets used in many different 
ways.  Each of us has to acknowledge that our text needs may be 
different than someone else's.  jmf, I'm guessing from your comments 
over the last few months that you are doing detailed linguistic work 
with corpora in many languages.  That work leads to one style of Unicode 
use.  In your domain, it is "nonsense" to ignore diacriticals.


Other people do different kinds of work with Unicode, and that leads to 
different needs.  In Tim's system, it is important to ignore 
diacriticals.  You might not have a use personally for Tim's system.  
That doesn't make it nonsense.


--Ned.

 From a unicode perspective.
Unicode.org "knows", these chars a very important, that's
the reason why they exist in two forms, precomposed and
composed forms.

 From a software perspective.
Luckily for the end users, all the serious software
are considering all these chars in an equal way. They
are all belonging to the BMP plane. An "Ą" is treated
as an "ê", same memory consumption, same performance,
==> very smooth software.

jmf



--
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-30 Thread Mark Lawrence


On 30/10/2013 16:08, wxjmfa...@gmail.com wrote:

Would you please read, digest and action this 
https://wiki.python.org/moin/GoogleGroupsPython


TIA.

--
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-30 Thread wxjmfauth

Le mercredi 30 octobre 2013 13:44:47 UTC+1, Ned Batchelder a écrit :
> On 10/30/13 4:49 AM, wxjmfa...@gmail.com wrote:
> 
> > Le mardi 29 octobre 2013 06:24:50 UTC+1, Steven D'Aprano a écrit :
> 
> >> On Mon, 28 Oct 2013 09:23:41 -0500, Tim Chase wrote:
> 
> >>
> 
> >>
> 
> >>
> 
> >>> On 2013-10-28 07:01, wxjmfa...@gmail.com wrote:
> 
> > Simply ignoring diactrics won't get you very far.
> 
>  Right. As an example, these four French words : cote, côte, coté, côté
> 
>  .
> 
> >>> Distinct words with distinct meanings, sure.
> 
> >>> But when a naïve (naive? ☺) person or one without the easy ability to
> 
> >>> enter characters with diacritics searches for "cote", I want to return
> 
> >>> possible matches containing any of your 4 examples.  It's slightly
> 
> >>> fuzzier if they search for "coté", in which case they may mean "coté" or
> 
> >>> they might mean be unable to figure out how to add a hat and want to
> 
> >>> type "côté". Though I'd rather get more results, even if it has some
> 
> >>> that only match fuzzily.
> 
> >>
> 
> >>
> 
> >> The right solution to that is to treat it no differently from other fuzzy
> 
> >>
> 
> >> searches. A good search engine should be tolerant of spelling errors and
> 
> >>
> 
> >> alternative spellings for any letter, not just those with diacritics.
> 
> >>
> 
> >> Ideally, a good search engine would successfully match all three of
> 
> >>
> 
> >> "naïve", "naive" and "niave", and it shouldn't rely on special handling
> 
> >>
> 
> >> of diacritics.
> 
> >>
> 
> >>
> 
> >>
> 
> > --
> 
> >
> 
> > This is a non sense. The purpose of a diacritical mark is to
> 
> > make a letter a different letter. If a tool is supposed to
> 
> > match an ô, there is absolutely no reason to match something
> 
> > else.
> 
> >
> 
> > jmf
> 
> >
> 
> 
> 
> jmf, Tim Chase described his use case, and it seems reasonable to me.  
> 
> I'm not sure why you would describe it as nonsense.
> 
> 
> 
> --Ned.



My comment had nothing to do with Python, it was a
general comment. A diacritical mark just makes a letter
a different letter; a "ï " and a "i" are "as
diferent" as a "a" from a "z". A diacritical mark
is more than a simple ornementation.

From a unicode perspective.
Unicode.org "knows", these chars a very important, that's
the reason why they exist in two forms, precomposed and
composed forms.

From a software perspective.
Luckily for the end users, all the serious software
are considering all these chars in an equal way. They
are all belonging to the BMP plane. An "Ą" is treated
as an "ê", same memory consumption, same performance,
==> very smooth software.

jmf

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-30 Thread Mark Lawrence


On 30/10/2013 08:13, wxjmfa...@gmail.com wrote:

Le mercredi 30 octobre 2013 03:17:21 UTC+1, Chris Angelico a écrit :

On Wed, Oct 30, 2013 at 2:56 AM, Mark Lawrence  wrote:


You've stated above that logically unicode is badly handled by the fsr.  You



then provide a trivial timing example.  WTF???




His idea of bad handling is "oh how terrible, ASCII and BMP have

optimizations". He hates the idea that it could be better in some

areas instead of even timings all along. But the FSR actually has some

distinct benefits even in the areas he's citing - watch this:




import timeit



timeit.timeit("a = 'hundred'; 'x' in a")


0.3625614428649451


timeit.timeit("a = 'hundreĳ'; 'x' in a")


0.6753936603674484


timeit.timeit("a = 'hundred'; 'ģ' in a")


0.25663261671525106


timeit.timeit("a = 'hundreĳ'; 'ģ' in a")


0.3582399439035271



The first two examples are his examples done on my computer, so you

can see how all four figures compare. Note how testing for the

presence of a non-Latin1 character in an 8-bit string is very fast.

Same goes for testing for non-BMP character in a 16-bit string. The

difference gets even larger if the string is longer:




timeit.timeit("a = 'hundred'*1000; 'x' in a")


10.083378194714726


timeit.timeit("a = 'hundreĳ'*1000; 'x' in a")


18.656413035735


timeit.timeit("a = 'hundreĳ'*1000; 'ģ' in a")


18.436268855399135


timeit.timeit("a = 'hundred'*1000; 'ģ' in a")


2.8308718007456264



Wow! The FSR speeds up searches immensely! It's obviously the best

thing since sliced bread!



ChrisA


-


It is not obvious to make comparaisons with all these
methods and characters (lookup depending on the position
in the table, ...). The only think that can be done and
observed is the tendency between the subsets the FSR
artificially creates.
One can use the best algotithms to adjust bytes, it is
very hard to escape from the fact that if one manipulates
two strings with different internal representations, it
is necessary to find a way to have a "common internal
coding " prior manipulations.
It seems to me that this FSR, with its "negative logic"
is always attempting to "optimize" with the worst
case instead of "optimizing" with the best case.
This kind of effect is shining on the memory side.
Compare utf-8, which has a memory optimization on
a per code point basis with the FSR which has an
optimization based on subsets (One of its purpose).


# FSR
sys.getsizeof( ('a'*1000) + 'z')

1026

sys.getsizeof( ('a'*1000) + '€')

2040

# utf-8
sys.getsizeof( (('a'*1000) + 'z').encode('utf-8'))

1018

sys.getsizeof( (('a'*1000) + '€').encode('utf-8'))

1020

jmf



How do theses figures compare to the ones quoted here 
https://mail.python.org/pipermail/python-dev/2011-September/113714.html ?


--
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-30 Thread Ned Batchelder



On 10/30/13 4:49 AM, wxjmfa...@gmail.com wrote:

Le mardi 29 octobre 2013 06:24:50 UTC+1, Steven D'Aprano a écrit :

On Mon, 28 Oct 2013 09:23:41 -0500, Tim Chase wrote:




On 2013-10-28 07:01, wxjmfa...@gmail.com wrote:

Simply ignoring diactrics won't get you very far.

Right. As an example, these four French words : cote, côte, coté, côté
.

Distinct words with distinct meanings, sure.
But when a naïve (naive? ☺) person or one without the easy ability to
enter characters with diacritics searches for "cote", I want to return
possible matches containing any of your 4 examples.  It's slightly
fuzzier if they search for "coté", in which case they may mean "coté" or
they might mean be unable to figure out how to add a hat and want to
type "côté". Though I'd rather get more results, even if it has some
that only match fuzzily.



The right solution to that is to treat it no differently from other fuzzy

searches. A good search engine should be tolerant of spelling errors and

alternative spellings for any letter, not just those with diacritics.

Ideally, a good search engine would successfully match all three of

"naïve", "naive" and "niave", and it shouldn't rely on special handling

of diacritics.




--

This is a non sense. The purpose of a diacritical mark is to
make a letter a different letter. If a tool is supposed to
match an ô, there is absolutely no reason to match something
else.

jmf



jmf, Tim Chase described his use case, and it seems reasonable to me.  
I'm not sure why you would describe it as nonsense.


--Ned.
--
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-30 Thread Mark Lawrence


On 30/10/2013 01:33, Piet van Oostrum wrote:

Mark Lawrence  writes:


Please provide hard evidence to support your claims or stop posting this
ridiculous nonsense.  Give us real world problems that can be reported
on the bug tracker, investigated and resolved.


I think it is much better just to ignore this nonsense instead of asking for 
evidence you know you will never get.



A good point, but note he doesn't have the courage to reply to me but 
always to others.  I guess he spends a lot of time clucking, not because 
he's run out of supplies, but because he's simply a chicken.


--
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-30 Thread wxjmfauth

Le mardi 29 octobre 2013 06:24:50 UTC+1, Steven D'Aprano a écrit :
> On Mon, 28 Oct 2013 09:23:41 -0500, Tim Chase wrote:
> 
> 
> 
> > On 2013-10-28 07:01, wxjmfa...@gmail.com wrote:
> 
> >>> Simply ignoring diactrics won't get you very far.
> 
> >> 
> 
> >> Right. As an example, these four French words : cote, côte, coté, côté
> 
> >> .
> 
> > 
> 
> > Distinct words with distinct meanings, sure.
> 
> > 
> 
> > But when a naïve (naive? ☺) person or one without the easy ability to
> 
> > enter characters with diacritics searches for "cote", I want to return
> 
> > possible matches containing any of your 4 examples.  It's slightly
> 
> > fuzzier if they search for "coté", in which case they may mean "coté" or
> 
> > they might mean be unable to figure out how to add a hat and want to
> 
> > type "côté". Though I'd rather get more results, even if it has some
> 
> > that only match fuzzily.
> 
> 
> 
> The right solution to that is to treat it no differently from other fuzzy 
> 
> searches. A good search engine should be tolerant of spelling errors and 
> 
> alternative spellings for any letter, not just those with diacritics. 
> 
> Ideally, a good search engine would successfully match all three of 
> 
> "naïve", "naive" and "niave", and it shouldn't rely on special handling 
> 
> of diacritics.
> 
> 
> 
--

This is a non sense. The purpose of a diacritical mark is to
make a letter a different letter. If a tool is supposed to
match an ô, there is absolutely no reason to match something
else.

jmf

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-30 Thread wxjmfauth

Le mercredi 30 octobre 2013 03:17:21 UTC+1, Chris Angelico a écrit :
> On Wed, Oct 30, 2013 at 2:56 AM, Mark Lawrence  
> wrote:
> 
> > You've stated above that logically unicode is badly handled by the fsr.  You
> 
> > then provide a trivial timing example.  WTF???
> 
> 
> 
> His idea of bad handling is "oh how terrible, ASCII and BMP have
> 
> optimizations". He hates the idea that it could be better in some
> 
> areas instead of even timings all along. But the FSR actually has some
> 
> distinct benefits even in the areas he's citing - watch this:
> 
> 
> 
> >>> import timeit
> 
> >>> timeit.timeit("a = 'hundred'; 'x' in a")
> 
> 0.3625614428649451
> 
> >>> timeit.timeit("a = 'hundreĳ'; 'x' in a")
> 
> 0.6753936603674484
> 
> >>> timeit.timeit("a = 'hundred'; 'ģ' in a")
> 
> 0.25663261671525106
> 
> >>> timeit.timeit("a = 'hundreĳ'; 'ģ' in a")
> 
> 0.3582399439035271
> 
> 
> 
> The first two examples are his examples done on my computer, so you
> 
> can see how all four figures compare. Note how testing for the
> 
> presence of a non-Latin1 character in an 8-bit string is very fast.
> 
> Same goes for testing for non-BMP character in a 16-bit string. The
> 
> difference gets even larger if the string is longer:
> 
> 
> 
> >>> timeit.timeit("a = 'hundred'*1000; 'x' in a")
> 
> 10.083378194714726
> 
> >>> timeit.timeit("a = 'hundreĳ'*1000; 'x' in a")
> 
> 18.656413035735
> 
> >>> timeit.timeit("a = 'hundreĳ'*1000; 'ģ' in a")
> 
> 18.436268855399135
> 
> >>> timeit.timeit("a = 'hundred'*1000; 'ģ' in a")
> 
> 2.8308718007456264
> 
> 
> 
> Wow! The FSR speeds up searches immensely! It's obviously the best
> 
> thing since sliced bread!
> 
> 
> 
> ChrisA

-


It is not obvious to make comparaisons with all these
methods and characters (lookup depending on the position
in the table, ...). The only think that can be done and
observed is the tendency between the subsets the FSR
artificially creates.
One can use the best algotithms to adjust bytes, it is
very hard to escape from the fact that if one manipulates
two strings with different internal representations, it
is necessary to find a way to have a "common internal
coding " prior manipulations.
It seems to me that this FSR, with its "negative logic"
is always attempting to "optimize" with the worst
case instead of "optimizing" with the best case.
This kind of effect is shining on the memory side.
Compare utf-8, which has a memory optimization on
a per code point basis with the FSR which has an
optimization based on subsets (One of its purpose).

>>> # FSR
>>> sys.getsizeof( ('a'*1000) + 'z')
1026
>>> sys.getsizeof( ('a'*1000) + '€')
2040
>>> # utf-8
>>> sys.getsizeof( (('a'*1000) + 'z').encode('utf-8'))
1018
>>> sys.getsizeof( (('a'*1000) + '€').encode('utf-8'))
1020

jmf

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-29 Thread Chris Angelico

On Wed, Oct 30, 2013 at 2:56 AM, Mark Lawrence  wrote:
> You've stated above that logically unicode is badly handled by the fsr.  You
> then provide a trivial timing example.  WTF???

His idea of bad handling is "oh how terrible, ASCII and BMP have
optimizations". He hates the idea that it could be better in some
areas instead of even timings all along. But the FSR actually has some
distinct benefits even in the areas he's citing - watch this:

>>> import timeit
>>> timeit.timeit("a = 'hundred'; 'x' in a")
0.3625614428649451
>>> timeit.timeit("a = 'hundreĳ'; 'x' in a")
0.6753936603674484
>>> timeit.timeit("a = 'hundred'; 'ģ' in a")
0.25663261671525106
>>> timeit.timeit("a = 'hundreĳ'; 'ģ' in a")
0.3582399439035271

The first two examples are his examples done on my computer, so you
can see how all four figures compare. Note how testing for the
presence of a non-Latin1 character in an 8-bit string is very fast.
Same goes for testing for non-BMP character in a 16-bit string. The
difference gets even larger if the string is longer:

>>> timeit.timeit("a = 'hundred'*1000; 'x' in a")
10.083378194714726
>>> timeit.timeit("a = 'hundreĳ'*1000; 'x' in a")
18.656413035735
>>> timeit.timeit("a = 'hundreĳ'*1000; 'ģ' in a")
18.436268855399135
>>> timeit.timeit("a = 'hundred'*1000; 'ģ' in a")
2.8308718007456264

Wow! The FSR speeds up searches immensely! It's obviously the best
thing since sliced bread!

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-29 Thread Piet van Oostrum

Mark Lawrence  writes:

> Please provide hard evidence to support your claims or stop posting this
> ridiculous nonsense.  Give us real world problems that can be reported
> on the bug tracker, investigated and resolved.

I think it is much better just to ignore this nonsense instead of asking for 
evidence you know you will never get.
-- 
Piet van Oostrum 
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-29 Thread Mark Lawrence


On 29/10/2013 19:16, wxjmfa...@gmail.com wrote:

Le mardi 29 octobre 2013 16:52:49 UTC+1, Tim Chase a écrit :

On 2013-10-29 08:38, wxjmfa...@gmail.com wrote:


import timeit



timeit.timeit("a = 'hundred'; 'x' in a")



0.12621293837694095



timeit.timeit("a = 'hundreĳ'; 'x' in a")



0.26411553466961735




That reads to me as "If things were purely UCS4 internally, Python

would normally take 0.264... seconds to execute this test, but core

devs managed to optimize a particular (lower 127 ASCII characters

only) case so that it runs in less than half the time."



Is this not what you intended to demonstrate?  'cuz that sounds

like a pretty awesome optimization to me.



-tkc




That's very naive. In fact, what happens is just the opposite.
The "best case" with the FSR is worst than the "worst case"
without the FSR.

And this is just without counting the effect that this poor
Python is spending its time in switching from one internal
representation to one another, without forgetting the fact
that this has to be tested every time.
The more unicode manipulations one applies, the more time
it demands.

Two tasks, that come in my mind: re and normalization.
It's very interesting to observe what happens when one
normalizes latin text and polytonic Greek text, both with
plenty of diactrics.



Something different, based on my previous example.

What a European user is supposed to think, when she/he
sees, she/he can be "penalized" by such an amount,
simply by using non ascii characters for a product
which is supposed to be "unicode compliant" ?

jmf



Please provide hard evidence to support your claims or stop posting this 
ridiculous nonsense.  Give us real world problems that can be reported 
on the bug tracker, investigated and resolved.


--
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-29 Thread wxjmfauth

Le mardi 29 octobre 2013 16:52:49 UTC+1, Tim Chase a écrit :
> On 2013-10-29 08:38, wxjmfa...@gmail.com wrote:
> 
> > >>> import timeit
> 
> > >>> timeit.timeit("a = 'hundred'; 'x' in a")  
> 
> > 0.12621293837694095
> 
> > >>> timeit.timeit("a = 'hundreĳ'; 'x' in a")  
> 
> > 0.26411553466961735
> 
> 
> 
> That reads to me as "If things were purely UCS4 internally, Python
> 
> would normally take 0.264... seconds to execute this test, but core
> 
> devs managed to optimize a particular (lower 127 ASCII characters
> 
> only) case so that it runs in less than half the time."
> 
> 
> 
> Is this not what you intended to demonstrate?  'cuz that sounds
> 
> like a pretty awesome optimization to me.
> 
> 
> 
> -tkc

That's very naive. In fact, what happens is just the opposite.
The "best case" with the FSR is worst than the "worst case"
without the FSR.

And this is just without counting the effect that this poor
Python is spending its time in switching from one internal
representation to one another, without forgetting the fact
that this has to be tested every time.
The more unicode manipulations one applies, the more time
it demands.

Two tasks, that come in my mind: re and normalization.
It's very interesting to observe what happens when one
normalizes latin text and polytonic Greek text, both with
plenty of diactrics.

Something different, based on my previous example.

What a European user is supposed to think, when she/he
sees, she/he can be "penalized" by such an amount,
simply by using non ascii characters for a product
which is supposed to be "unicode compliant" ?

jmf

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-29 Thread Mark Lawrence


On 29/10/2013 15:38, wxjmfa...@gmail.com wrote:

It's okay folks I'll snip all the double spaced google crap as the 
poster is clearly too bone idle to follow the instructions that have 
been repeatedly posted here asking for people not to post double spaced 
google crap.



Le mardi 29 octobre 2013 06:22:27 UTC+1, Steven D'Aprano a écrit :

On Mon, 28 Oct 2013 07:01:16 -0700, wxjmfauth wrote:

And of course, logically, they are very, very badly handled with the
Flexible String Representation.


I'm reminded of Cato the Elder, the Roman senator who would end every
speech, no matter the topic, with "Ceterum censeo Carthaginem esse
delendam" ("Furthermore, I consider that Carthage must be destroyed").

But at least he had the good grace to present that as an opinion, instead
of repeating a falsehood as if it were a fact.

--

Steven


--


import timeit
timeit.timeit("a = 'hundred'; 'x' in a")

0.12621293837694095

timeit.timeit("a = 'hundreĳ'; 'x' in a")

0.26411553466961735

If you are understanding the coding of characters, Unicode
and what this FSR does, it is a child play to produce gazillion
of examples like this.

(Notice the usage of a Dutch character instead of a boring €).

jmf



You've stated above that logically unicode is badly handled by the fsr. 
 You then provide a trivial timing example.  WTF???


--
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-29 Thread Tim Chase

On 2013-10-29 08:38, wxjmfa...@gmail.com wrote:
> >>> import timeit
> >>> timeit.timeit("a = 'hundred'; 'x' in a")  
> 0.12621293837694095
> >>> timeit.timeit("a = 'hundreĳ'; 'x' in a")  
> 0.26411553466961735

That reads to me as "If things were purely UCS4 internally, Python
would normally take 0.264... seconds to execute this test, but core
devs managed to optimize a particular (lower 127 ASCII characters
only) case so that it runs in less than half the time."

Is this not what you intended to demonstrate?  'cuz that sounds
like a pretty awesome optimization to me.

-tkc

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-29 Thread wxjmfauth

Le mardi 29 octobre 2013 06:22:27 UTC+1, Steven D'Aprano a écrit :
> On Mon, 28 Oct 2013 07:01:16 -0700, wxjmfauth wrote:
> 
> 
> 
> > And of course, logically, they are very, very badly handled with the
> 
> > Flexible String Representation.
> 
> 
> 
> I'm reminded of Cato the Elder, the Roman senator who would end every 
> 
> speech, no matter the topic, with "Ceterum censeo Carthaginem esse 
> 
> delendam" ("Furthermore, I consider that Carthage must be destroyed").
> 
> 
> 
> But at least he had the good grace to present that as an opinion, instead 
> 
> of repeating a falsehood as if it were a fact.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> Steven

--

>>> import timeit
>>> timeit.timeit("a = 'hundred'; 'x' in a")
0.12621293837694095
>>> timeit.timeit("a = 'hundreĳ'; 'x' in a")
0.26411553466961735

If you are understanding the coding of characters, Unicode
and what this FSR does, it is a child play to produce gazillion
of examples like this.

(Notice the usage of a Dutch character instead of a boring €).

jmf


-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-28 Thread Steven D'Aprano

On Mon, 28 Oct 2013 09:23:41 -0500, Tim Chase wrote:

> On 2013-10-28 07:01, wxjmfa...@gmail.com wrote:
>>> Simply ignoring diactrics won't get you very far.
>> 
>> Right. As an example, these four French words : cote, côte, coté, côté
>> .
> 
> Distinct words with distinct meanings, sure.
> 
> But when a naïve (naive? ☺) person or one without the easy ability to
> enter characters with diacritics searches for "cote", I want to return
> possible matches containing any of your 4 examples.  It's slightly
> fuzzier if they search for "coté", in which case they may mean "coté" or
> they might mean be unable to figure out how to add a hat and want to
> type "côté". Though I'd rather get more results, even if it has some
> that only match fuzzily.

The right solution to that is to treat it no differently from other fuzzy 
searches. A good search engine should be tolerant of spelling errors and 
alternative spellings for any letter, not just those with diacritics. 
Ideally, a good search engine would successfully match all three of 
"naïve", "naive" and "niave", and it shouldn't rely on special handling 
of diacritics.

-- 
Steven
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-28 Thread Steven D'Aprano

On Mon, 28 Oct 2013 07:01:16 -0700, wxjmfauth wrote:

> And of course, logically, they are very, very badly handled with the
> Flexible String Representation.

I'm reminded of Cato the Elder, the Roman senator who would end every 
speech, no matter the topic, with "Ceterum censeo Carthaginem esse 
delendam" ("Furthermore, I consider that Carthage must be destroyed").

But at least he had the good grace to present that as an opinion, instead 
of repeating a falsehood as if it were a fact.

-- 
Steven
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-28 Thread Tim Chase

On 2013-10-28 07:01, wxjmfa...@gmail.com wrote:
>> Simply ignoring diactrics won't get you very far.
> 
> Right. As an example, these four French words :
> cote, côte, coté, côté .

Distinct words with distinct meanings, sure.

But when a naïve (naive? ☺) person or one without the easy ability
to enter characters with diacritics searches for "cote", I want to
return possible matches containing any of your 4 examples.  It's
slightly fuzzier if they search for "coté", in which case they may
mean "coté" or they might mean be unable to figure out how to
add a hat and want to type "côté". Though I'd rather get more
results, even if it has some that only match fuzzily.

Circumflexually-circumspectly-yers,

-tkc

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-28 Thread Mark Lawrence


On 28/10/2013 14:01, wxjmfa...@gmail.com wrote:


Just as a reminder. They are 1272 characters considered
as Latin characters (how to count them it not a simple
task), and if my knowledge is correct, they are covering
and/or are here to cover the 17 languages, to be exact,
the 17 European languages based on a Latin alphabet which
can not be covered with iso-8859-1.

And of course, logically, they are very, very badly handled
with the Flexible String Representation.

jmf



Please provide us with evidence to back up your statement.

--
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-28 Thread wxjmfauth

Le dimanche 27 octobre 2013 04:21:46 UTC+1, Nobody a écrit :
> 
> 
> 
> Simply ignoring diactrics won't get you very far.
> 
> 

Right. As an example, these four French words :
cote, côte, coté, côté .

> 
> Most languages which use diactrics have standard conversions, e.g.
> 
> ö -> oe, which are likely to be used by anyone familiar with the
> 
> language e.g. when using software (or a keyboard) which can't handle
> 
> diactrics.
> 
> 

I'm quite confortable with Unicode, esp. with the
Latin blocks.
Except this German case (I remember very old typewriters),
what are the other languages presenting this kind of
allowed feature ?

Just as a reminder. They are 1272 characters considered
as Latin characters (how to count them it not a simple
task), and if my knowledge is correct, they are covering
and/or are here to cover the 17 languages, to be exact,
the 17 European languages based on a Latin alphabet which
can not be covered with iso-8859-1.

And of course, logically, they are very, very badly handled
with the Flexible String Representation.

jmf

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-27 Thread Mark Lawrence


On 27/10/2013 01:11, Roy Smith wrote:

In article ,
  Dennis Lee Bieber  wrote:


Compared to Baudot, both ASCII and EBCDIC were probably considered
wondrous.


Wonderous, indeed.  Why would anybody ever need more than one case of
the alphabet?  It's almost as absurd as somebody wanting to put funny
little marks on top of their vowels.



True indeed but it gets worse.  For example those silly Spanish speaking 
types consider this ñ a letter in its own right and not a funny little 
mark on top of a consonant :)


--
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-26 Thread Nobody

On Sat, 26 Oct 2013 20:41:58 -0500, Tim Chase wrote:

> I'd be just as happy if Python provided a "sloppy string compare"
> that ignored case, diacritical marks, and the like.

Simply ignoring diactrics won't get you very far.

Most languages which use diactrics have standard conversions, e.g.
ö -> oe, which are likely to be used by anyone familiar with the
language e.g. when using software (or a keyboard) which can't handle
diactrics.

OTOH, others (particularly native English speakers) may simply discard the
diactric. So to be of much use, a fuzzy match needs to handle either
possibility.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-26 Thread Chris Angelico

On Sun, Oct 27, 2013 at 1:05 PM, Steven D'Aprano
 wrote:
> On Sat, 26 Oct 2013 21:11:55 -0400, Roy Smith wrote:
>
>> In article ,
>>  Dennis Lee Bieber  wrote:
>>
>>> Compared to Baudot, both ASCII and EBCDIC were probably considered
>>> wondrous.
>>
>> Wonderous, indeed.  Why would anybody ever need more than one case of
>> the alphabet?  It's almost as absurd as somebody wanting to put funny
>> little marks on top of their vowels.
>
> Vwls? Wh wst tm wrtng dwn th vwls?

There's really no reason to; you can always provide them by
their entities!

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-26 Thread Tim Chase

On 2013-10-26 21:54, Roy Smith wrote:
> In article ,
>  Tim Chase  wrote:
>> I'd be just as happy if Python provided a "sloppy string compare"
>> that ignored case, diacritical marks, and the like.
> 
> The problem with putting fuzzy matching in the core language is
> that there is no general agreement on how it's supposed to work.
> 
> There are, however, third-party libraries which do fuzzy matching.
> One popular one is jellyfish
> (https://pypi.python.org/pypi/jellyfish/0.1.2).

Bookmarking and archiving your email for future reference.

> Don't expect you can just download and use it right out of the box,
> however. You'll need to do a little thinking about which of the
> several algorithms it includes makes sense for your application.

I'd be content with a baseline that denormalizes and then strips out
combining diacritical marks, something akin to MRAB's

  from unicodedata import normalize
  "".join(c for c in normalize("NFKD", s) if ord(c) < 0x80)

and tweaking it if that was insufficient.

Thanks for the link to Jellyfish.

-tkc



-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-26 Thread Steven D'Aprano

On Sat, 26 Oct 2013 21:11:55 -0400, Roy Smith wrote:

> In article ,
>  Dennis Lee Bieber  wrote:
> 
>> Compared to Baudot, both ASCII and EBCDIC were probably considered
>> wondrous.
> 
> Wonderous, indeed.  Why would anybody ever need more than one case of
> the alphabet?  It's almost as absurd as somebody wanting to put funny
> little marks on top of their vowels.

Vwls? Wh wst tm wrtng dwn th vwls?



-- 
Steven
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-26 Thread Roy Smith

In article ,
 Tim Chase  wrote:

> I'd be just as happy if Python provided a "sloppy string compare"
> that ignored case, diacritical marks, and the like.

The problem with putting fuzzy matching in the core language is that 
there is no general agreement on how it's supposed to work.

There are, however, third-party libraries which do fuzzy matching.  One 
popular one is jellyfish (https://pypi.python.org/pypi/jellyfish/0.1.2).  
Don't expect you can just download and use it right out of the box, 
however. You'll need to do a little thinking about which of the several 
algorithms it includes makes sense for your application.

So, for example, you probably expect U+004 (Latin Capital letter N) to 
match U+006 (Latin Small Letter N).  But, what about these (all cribbed 
from Wikipedia):

U+00D1   Ã Ñ  Ñ Latin Capital letter N with tilde
U+00F1   Ã± ñ  ñ Latin Small Letter N with tilde
U+0143   C  Ń  Latin Capital Letter N with acute
U+0144   D  ń  Latin Small Letter N with acute
U+0145   E  Ņ  Latin Capital Letter N with cedilla
U+0146   F  ņ  Latin Small Letter N with cedilla
U+0147   G  Ň  Latin Capital Letter N with caron
U+0148   H  ň  Latin Small Letter N with caron
U+0149   I  ŉ  Latin Small Letter N preceded by apostrophe[1]
U+014A   J  Ŋ  Latin Capital Letter Eng
U+014B   K  ŋ  Latin Small Letter Eng
U+019D   #413;   Latin Capital Letter N with left hook
U+019E   #414;   Latin Small Letter N with long right leg
U+01CA   #458;   Latin Capital Letter NJ
U+01CB   #459;   Latin Capital Letter N with Small Letter J
U+01CC   #460;   Latin Small Letter NJ
U+0235   #565;   Latin Small Letter N with curl

I can't even begin to guess if they should match for your application.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-26 Thread Tim Chase

On 2013-10-26 22:24, Steven D'Aprano wrote:
> Why on earth would you want to throw away perfectly good
> information? 

The main reason I've needed to do it in the past is for normalization
of search queries.  When a user wants to find something containing
"pingüino", I want to have those results come back even if they type
"pinguino" in the search box.

For the same reason searches are often normalized to ignore case.
The difference between "Polish" and "polish" is visually just
capitalization, but most folks don't think twice about

  if term.upper() in datum.upper():
it_matches()

I'd be just as happy if Python provided a "sloppy string compare"
that ignored case, diacritical marks, and the like.

  unicode_haystack1 = u"pingüino"
  unicode_haystack2 = u"¡Miré un pingüino!"
  needle = u"pinguino"
  if unicode_haystack1.sloppy_equals(needle):
it_matches()
  if unicode_haystack2.sloppy_contains(needle):
it_contains()

As a matter of fact, I'd even be happier if Python did the heavy
lifting, since I wouldn't have to think about whether I want my code
to force upper-vs-lower for the comparison. :-)

-tkc

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-26 Thread Roy Smith

In article ,
 Dennis Lee Bieber  wrote:

> Compared to Baudot, both ASCII and EBCDIC were probably considered
> wondrous.

Wonderous, indeed.  Why would anybody ever need more than one case of 
the alphabet?  It's almost as absurd as somebody wanting to put funny 
little marks on top of their vowels.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-26 Thread Steven D'Aprano

On Sat, 26 Oct 2013 16:11:25 -0400, bruce wrote:

> hi..
> 
> getting some files via curl, and want to convert them from what i'm
> guessing to be unicode.
> 
> I'd like to convert a string like this::  href="ShowRatings.jsp?tid=1312168">Alcántar, Iliana
> 
> to::
> Alcantar,
> Iliana
> 
> where I convert the
> " á " to " a"

Why on earth would you want to throw away perfectly good information? 
It's 2013, not 1953, and if you're still unable to cope with languages 
other than English, you need to learn new skills.

(Actually, not even English, since ASCII doesn't even support all the 
characters used in American English, let alone British English. ASCII was 
broken from the day it was invented.)

Start by getting some understanding:

http://www.joelonsoftware.com/articles/Unicode.html

Then read this post from just over a week ago:

https://mail.python.org/pipermail/python-list/2013-October/657827.html

-- 
Steven
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

2013-10-26 Thread MRAB


On 26/10/2013 21:11, bruce wrote:

hi..

getting some files via curl, and want to convert them from what i'm
guessing to be unicode.

I'd like to convert a string like this::
Alcántar,
Iliana

to::
Alcantar,
Iliana

where I convert the
" á " to " a"

which appears to be a shift of 128, but I'm not sure how to accomplish this..

I've tested using the different decode/encode functions using
utf-8/ascii with no luck.

I've reviewed stack overflow, as well as a few other sites, but
haven't hit the aha moment.

pointers/comments would be welcome.


Why do you want to do that?

The short answer is that you should accept that these days you should
be using Unicode, not ASCII.

The longer answer is that you could normalise the Unicode codepoints to
the NFKD form and then discard any codepoints outside the ASCII range:


import unicodedata
t = unicodedata.normalize("NFKD", "Alcántar")
"".join(c for c in t if ord(c) < 0x80)

'Alcantar'

The disadvantage, of course, is that it'll throw away a whole lot of
codepoints that can't be 'converted'.

Have a look at Unidecode:

http://pypi.python.org/pypi/Unidecode

--
https://mail.python.org/mailman/listinfo/python-list

trying to strip out non ascii.. or rather convert non ascii

2013-10-26 Thread bruce

hi..

getting some files via curl, and want to convert them from what i'm
guessing to be unicode.

I'd like to convert a string like this::
Alcántar,
Iliana

to::
Alcantar,
Iliana

where I convert the
" á " to " a"

which appears to be a shift of 128, but I'm not sure how to accomplish this..

I've tested using the different decode/encode functions using
utf-8/ascii with no luck.

I've reviewed stack overflow, as well as a few other sites, but
haven't hit the aha moment.

pointers/comments would be welcome.

thanks
-- 
https://mail.python.org/mailman/listinfo/python-list

42 matches

Mail list logo