John Wilcock writes:
> Justin Mason a écrit :
> > John GALLET writes:
> >> Well, thanks for writing it. I think its main weak point for French and 
> >> other accented languages is handling the different encodings for a same 
> >> char with an accent, some kind of "synonyms" list. The same letter, say "a 
> >> with an accent", can be misspelled with a plain "a", encoded in various 
> >> charsets (latin, utf-8) to a "normal" à, or html encoded agrave (I left & 
> >> and ; out). I do not know if it is possible at all, it might complicate 
> >> things *a lot*.
> > 
> > The tool can take care of this -- it will replace mutating single-characters
> > with a /./.  It also supports /.?/, /.{0,3}/, /.{0,10}/ and a few other
> > "any" patterns.
> 
> If the number of permutations is small (as would be the case for
> accented letters and the equivalent unaccented ones, or for that matter
> obfuscation with lookalike characters), wouldn't it be better for it to
> replace the character by a [] list of those permutations (i.e. replace
> something that mutates between e and é with [eé] or replace obfuscation
> of i with l and 1 by [il1] ?

It would be.  but fixing the pattern-discovery algorithm to discover this
in a relatively speedy way is not so easy.  Patches accepted ;)

Reply via email to