John Wilcock writes: > Justin Mason a écrit : > > John GALLET writes: > >> Well, thanks for writing it. I think its main weak point for French and > >> other accented languages is handling the different encodings for a same > >> char with an accent, some kind of "synonyms" list. The same letter, say "a > >> with an accent", can be misspelled with a plain "a", encoded in various > >> charsets (latin, utf-8) to a "normal" à, or html encoded agrave (I left & > >> and ; out). I do not know if it is possible at all, it might complicate > >> things *a lot*. > > > > The tool can take care of this -- it will replace mutating single-characters > > with a /./. It also supports /.?/, /.{0,3}/, /.{0,10}/ and a few other > > "any" patterns. > > If the number of permutations is small (as would be the case for > accented letters and the equivalent unaccented ones, or for that matter > obfuscation with lookalike characters), wouldn't it be better for it to > replace the character by a [] list of those permutations (i.e. replace > something that mutates between e and é with [eé] or replace obfuscation > of i with l and 1 by [il1] ?
It would be. but fixing the pattern-discovery algorithm to discover this in a relatively speedy way is not so easy. Patches accepted ;)