Yes, it is too bad that the table is hard coded. I don't know of a quick way of doing this with locale support. But, at the risk of being presumptuous, I would expect a similar functionality should be doable with a help of some configuration parameter. I am sure a syntax could be thought of that would let a user override/augment whatever was put in the default table. After all, I can imagine that even if locale did provide some way of accomplishing this, a particular user preference might require that it be done in a custom way.
But in any case, I will dig around the source and redo it to my liking.
Greg
Gilles Detillieux wrote:
According to Gregory Szeszko:I set up htdig to index a prmerily Polish (ISO-8859-2) web site. Things appear to be working well for the most part. I can search for words/phrases as long as I type in the search keywords with the accented characters. But if I replace the Polish characters with their ASCII "equivalents" then the search comes up empty even though my rundig script runs the "htfuzzy accents" command. My understanding of htfuzzy accents is that it is supposed to enter into htdig's database words with accented characters replaced by the unacceneted equivalents. But it would appear that it doesn't happen exactly like this.To try to debug the problem I ran "htfuzzy -vvv accents". This spits out a long list of word pairs. Each pair appears to contain an "unaccented word" along with the original word. But after glancing at that list it appears to me that not all the original accented words are in there. That is, I know of accented words on the site's pages that are not displayed in the list. I am certain that ALL of the pages are digged through, because I specify every single one of them in the start_url (to avoid the fact that htdig doesn't follow JavaScript linked pages). So how come I don't see all of the accent words in that list? Am I overlooking something?It doesn't mention this in the documentation (yet), but the accents algorithm currently only supports the iso-8859-1 (Latin 1) character set. The conversion from accented to unaccented characters is hard-coded in the table "MinusculeISOLAT1" in htfuzzy/Accents.cc. The only way to configure this for ISO-8859-2 or other character sets right now is to edit this table for the specific character set you need, and recompile. If someone can suggest a better way of doing this, using the locale information, it would be a big help.

