Re: [sqlite] Advice needed for fuzzy search

2009-07-02 Thread Jean-Christophe Deschamps
Simon,

At 15:26 02/07/2009, you wrote:
´¯¯¯
>What we need is a new version of Soundex which is written to deal with 
>unicode instead of ASCII.
`---

Umm, soundex already fails often with plain english names.  It would 
need a whole lot of native speakers of all those languages around to 
come up with anything usable worldwide, if at all possible (which I 
strongly doubt).

But even with that at hand, it would have hard time to solve "common" 
cases where, for instance, someone has a Greak name, a Danish first 
name and lives in China.

´¯¯¯
>   The best known code along those lines is a perl function
>called unidecode.  Reading about it may help you decide how to proceed:
`---
Thanks for the pointers, I'll have a look at it.


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Advice needed for fuzzy search

2009-07-02 Thread Simon Slavin

On 2 Jul 2009, at 2:01pm, Jean-Christophe Deschamps wrote:

> I need to deal with codepoints that would expand to several individual
> characters.  Examples are ligatures or fractions.  I've never seen
> ligatures used in French, nor in any european language, when it comes
> to user input.  I believe such ligatures are more a typesetting or  
> word
> processing finesse which is beyond most users care / knowledge.
>
> But if I ever encounter some, how should I deal with them?  If I leave
> them alone, then for instance ligature 'fi' would not compare to the
> letter sequence 'f' 'i'.  If I expand them, then ligature 'fi' would
> get to 'f' 'i' but if the corresponding char in the second string is
> 'g' then it would count for two errors instead of one.

You /do/ need to expand ligatures, especially since some sources will  
already have them expanded.  You may have to consider a distance of 2  
to be near enough for a match.  I assume ... I hope ... you have  
access to a unicode library that has functions which can do things  
like expand ligatures.

I've not come across any good standard way of dealing with this  
problem.  You are at the leading edge of technology !  What we need is  
a new version of Soundex which is written to deal with unicode instead  
of ASCII.  The best known code along those lines is a perl function  
called unidecode.  Reading about it may help you decide how to proceed:

http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/Unidecode.pm

http://interglacial.com/~sburke/tpj/as_html/tpj22.html

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Advice needed for fuzzy search

2009-07-02 Thread Jean-Christophe Deschamps
Hello group,

I'm writing a fuzzy search extension.  The current code is getting a 
little messy and I'm not completely satisfied by the way it works.  So 
I'm about to rewrite it from scratch on stronger foundations.

The goal is to provide a fuzzy search on _short_ fields like names, 
street adress, city or similar typical user input that has to be 
consolidated against an existing base.  I'm in no way aiming at 
searching a fuzzy substring inside a large text.  The current 
underlying algorithm would be terrible for that (I use a 
Damerau-Levenshtein distance for counting typos).

I need to deal with codepoints that would expand to several individual 
characters.  Examples are ligatures or fractions.  I've never seen 
ligatures used in French, nor in any european language, when it comes 
to user input.  I believe such ligatures are more a typesetting or word 
processing finesse which is beyond most users care / knowledge.

But if I ever encounter some, how should I deal with them?  If I leave 
them alone, then for instance ligature 'fi' would not compare to the 
letter sequence 'f' 'i'.  If I expand them, then ligature 'fi' would 
get to 'f' 'i' but if the corresponding char in the second string is 
'g' then it would count for two errors instead of one.

Things could get worse in more "exotic" (to me) scripts.

The same questions arise for upper, lower, ... functions.

Again, the goal is certainly not to duplicate ICU with all its 
complexity but to offer a decent code base to achieve mostly correct 
results in as many languages as possible, keeping in mind that the 
memory footprint and the code overhead should be kept reasonable.


Feel free to give advises on how to deal with those issues.


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users