I set up htdig to index a prmerily Polish (ISO-8859-2) web site. Things appear to be working well for the most part. I can search for words/phrases as long as I type in the search keywords with the accented characters. But if I replace the Polish characters with their ASCII "equivalents" then the search comes up empty even though my rundig script runs the "htfuzzy accents" command. My understanding of htfuzzy accents is that it is supposed to enter into htdig's database words with accented characters replaced by the unacceneted equivalents. But it would appear that it doesn't happen exactly like this.

To try to debug the problem I ran "htfuzzy -vvv accents". This spits out a long list of word pairs. Each pair appears to contain an "unaccented word" along with the original word. But after glancing at that list it appears to me that not all the original accented words are in there. That is, I know of accented words on the site's pages that are not displayed in the list. I am certain that ALL of the pages are digged through, because I specify every single one of them in the start_url (to avoid the fact that htdig doesn't follow JavaScript linked pages). So how come I don't see all of the accent words in that list? Am I overlooking something?

Thanks for any help/information.

Greg Szeszko




-------------------------------------------------------
This sf.net email is sponsored by: To learn the basics of securing your web site with SSL, click here to get a FREE TRIAL of a Thawte Server Certificate: http://www.gothawte.com/rd524.html
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to