Re: [htdig] word list in htfuzzy accents verbose mode

Gregory Szeszko Thu, 14 Nov 2002 13:29:25 -0800

Thanks for a quick response.
Yes, it is too bad that the table is hard coded. I don't know of a quick way of doing this with locale support. But, at the risk of being presumptuous, I would expect a similar functionality should be doable with a help of some configuration parameter. I am sure a syntax could be thought of that would let a user override/augment whatever was put in the default table. After all, I can imagine that even if locale did provide some way of accomplishing this, a particular user preference might require that it be done in a custom way.
But in any case, I will dig around the source and redo it to my liking.

Greg

Gilles Detillieux wrote:

According to Gregory Szeszko:

I set up htdig to index a prmerily Polish (ISO-8859-2) web site.  Things 
appear to be working well for the most part.  I can search for 
words/phrases as long as I type in the search keywords with the accented 
characters.  But if I replace the Polish characters with their ASCII 
"equivalents" then the search comes up empty even though my rundig 
script runs the "htfuzzy accents" command.  My understanding of htfuzzy 
accents is that it is supposed to enter into htdig's database words with 
accented characters replaced by the unacceneted equivalents.  But it 
would appear that it doesn't happen exactly like this.


To try to debug the problem I ran "htfuzzy -vvv accents".  This spits 
out a long list of word pairs.  Each pair appears to contain an 
"unaccented word" along with the original word.  But after glancing at 
that list it appears to me that not all the original accented words are 
in there.  That is, I know of accented words on the site's pages that 
are not displayed in the list.  I am certain that ALL of the pages are 
digged through, because I specify every single one of them in the 
start_url (to avoid the fact that htdig doesn't follow JavaScript linked 
pages).  So how come I don't see all of the accent words in that list? 
 Am I overlooking something?


It doesn't mention this in the documentation (yet), but the accents
algorithm currently only supports the iso-8859-1 (Latin 1) character set.
The conversion from accented to unaccented characters is hard-coded in
the table "MinusculeISOLAT1" in htfuzzy/Accents.cc.  The only way to
configure this for ISO-8859-2 or other character sets right now is to
edit this table for the specific character set you need, and recompile.
If someone can suggest a better way of doing this, using the locale
information, it would be a big help.

Re: [htdig] word list in htfuzzy accents verbose mode

Reply via email to