According to Gregory Szeszko: > I set up htdig to index a prmerily Polish (ISO-8859-2) web site. Things > appear to be working well for the most part. I can search for > words/phrases as long as I type in the search keywords with the accented > characters. But if I replace the Polish characters with their ASCII > "equivalents" then the search comes up empty even though my rundig > script runs the "htfuzzy accents" command. My understanding of htfuzzy > accents is that it is supposed to enter into htdig's database words with > accented characters replaced by the unacceneted equivalents. But it > would appear that it doesn't happen exactly like this. > > To try to debug the problem I ran "htfuzzy -vvv accents". This spits > out a long list of word pairs. Each pair appears to contain an > "unaccented word" along with the original word. But after glancing at > that list it appears to me that not all the original accented words are > in there. That is, I know of accented words on the site's pages that > are not displayed in the list. I am certain that ALL of the pages are > digged through, because I specify every single one of them in the > start_url (to avoid the fact that htdig doesn't follow JavaScript linked > pages). So how come I don't see all of the accent words in that list? > Am I overlooking something?
It doesn't mention this in the documentation (yet), but the accents algorithm currently only supports the iso-8859-1 (Latin 1) character set. The conversion from accented to unaccented characters is hard-coded in the table "MinusculeISOLAT1" in htfuzzy/Accents.cc. The only way to configure this for ISO-8859-2 or other character sets right now is to edit this table for the specific character set you need, and recompile. If someone can suggest a better way of doing this, using the locale information, it would be a big help. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This sf.net email is sponsored by: To learn the basics of securing your web site with SSL, click here to get a FREE TRIAL of a Thawte Server Certificate: http://www.gothawte.com/rd524.html _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

