Re: trying to strip out non ascii.. or rather convert non ascii

Ned Batchelder Wed, 30 Oct 2013 10:12:47 -0700

On 10/30/13 12:08 PM, [email protected] wrote:

Le mercredi 30 octobre 2013 13:44:47 UTC+1, Ned Batchelder a écrit :

On 10/30/13 4:49 AM, [email protected] wrote:

Le mardi 29 octobre 2013 06:24:50 UTC+1, Steven D'Aprano a écrit :

On Mon, 28 Oct 2013 09:23:41 -0500, Tim Chase wrote:

On 2013-10-28 07:01, [email protected] wrote:

Simply ignoring diactrics won't get you very far.

Right. As an example, these four French words : cote, côte, coté, côté
.

Distinct words with distinct meanings, sure.
But when a naïve (naive? ☺) person or one without the easy ability to
enter characters with diacritics searches for "cote", I want to return
possible matches containing any of your 4 examples.  It's slightly
fuzzier if they search for "coté", in which case they may mean "coté" or
they might mean be unable to figure out how to add a hat and want to
type "côté". Though I'd rather get more results, even if it has some
that only match fuzzily.

The right solution to that is to treat it no differently from other fuzzy
searches. A good search engine should be tolerant of spelling errors and
alternative spellings for any letter, not just those with diacritics.
Ideally, a good search engine would successfully match all three of
"naïve", "naive" and "niave", and it shouldn't rely on special handling
of diacritics.

------
This is a non sense. The purpose of a diacritical mark is to
make a letter a different letter. If a tool is supposed to
match an ô, there is absolutely no reason to match something
else.
jmf



jmf, Tim Chase described his use case, and it seems reasonable to me.

I'm not sure why you would describe it as nonsense.



--Ned.

--------

My comment had nothing to do with Python, it was a
general comment. A diacritical mark just makes a letter
a different letter; a "ï " and a "i" are "as
diferent" as a "a" from a "z". A diacritical mark
is more than a simple ornementation.

Yes, we understand that. Tim outlined a need that had to do with users'informal typing. In his case, he needs to deal with that sloppiness.You can't simply insist that users be more precise.

Unicode is a way to represent text, and text gets used in many differentways. Each of us has to acknowledge that our text needs may bedifferent than someone else's. jmf, I'm guessing from your commentsover the last few months that you are doing detailed linguistic workwith corpora in many languages. That work leads to one style of Unicodeuse. In your domain, it is "nonsense" to ignore diacriticals.

Other people do different kinds of work with Unicode, and that leads todifferent needs. In Tim's system, it is important to ignorediacriticals. You might not have a use personally for Tim's system.That doesn't make it nonsense.


--Ned.

 From a unicode perspective.
Unicode.org "knows", these chars a very important, that's
the reason why they exist in two forms, precomposed and
composed forms.

 From a software perspective.
Luckily for the end users, all the serious software
are considering all these chars in an equal way. They
are all belonging to the BMP plane. An "Ą" is treated
as an "ê", same memory consumption, same performance,
==> very smooth software.

jmf


--
https://mail.python.org/mailman/listinfo/python-list

Re: trying to strip out non ascii.. or rather convert non ascii

Reply via email to