At 14:56 29/10/99 -0500, you wrote:

 >Yikes!  I have a hard time believing that your patch_accents program would
 >not start clobbering all sorts of data in db.docdb that it shouldn't.
 >I'm assuming the whole point of this is to strip out the accents from
 >the document excerpts, so that excerpt highlighting works for unaccented
 >search words.

 >If so, why not just strip out the accents on the fly in
 >htsearch/Display.cc, before doing any searches on the excerpt, or
 >better yet, just poke in some entries in the translate table, set in
 >StringMatch::IgnoreCase() (in htlib/StringMatch.cc), to map accented
 >letters to equivalent lower-case unaccented letters?  The letter mapping
 >in String.cc could also be done much more efficiently with a mapping
 >table.

 >The best approach, though, would be to define a new "accent" fuzzy match
 >algorithm, which, when given a word, would search the word database
 >for all accented and unaccented equivalents.  The main engine of this
 >would be very much like the current htfuzzy/Substring.cc algorithm.
 >It would be more work, but you'd have something that would be selectable
 >by the search_algorithm config attribute, and would fit in well with
 >the existing code.

Gilles,

I agree with all of your remarks.

I have been also amazed by the fact that my patch_accent
was not totally corrupting de db file ;)

Looks like, ASCII codes modified are not used as separators and attributes.
Please note I took care of only modifying bytes that were in an ASCII 
string :)

In fact I just have written this patch to match my purposes.

I made it public because after searching  "accents french"
on the htdig site, I found a huge numbers of people trying
to get a solution ....

Don't be wrong, this patch is not an academic one,
it is a dirty and straightforward one (as I said on my page).

My point of vue, of a *good* patch is something like a conf file, let's 
call it transcode.conf
which would contains characters equivalences.
this file would be used by htsearch and htfuzzy.

Best regards,

Salim


***********************************************
Salim Gasmi <http://www.gasmi.net>
System and network administrator.
SdV Plurimedia <http://www.sdv.fr>

PGP Key: http://www.gasmi.net/pgp.txt
***********************************************

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.

Reply via email to