According to Marco Scheurer: > I'm trying to run and configure ht://Dig on Mac OS X, 10.2 (Jaguar). I > need to index documents in French and English. > > I set locale: to fr_FR and htdig complains with: > > "Warning: unknown locale!" > > However, the testlocale program seems to indicate that everything is > fine (and this wasn't the case with Mac OS X 10.1.x): > > % ./testlocale fr_FR > ... > 192 0xC0: � -a-un--gt---- > 193 0xC1: � -a-un--gt---- > 194 0xC2: � -a-un--gt---- > ... > 253 0xFD: � -al-n--gt---- > 254 0xFE: � -al-n--gt---- > 255 0xFF: � -al-n--gt----
Have a look at the comments in the program source, available at http://www.htdig.org/files/contrib/other/testlocale.c Note: * line argument. If you find one that works, try changing the LC_CTYPE to * LC_ALL in the setlocale() call, to make sure it still works that way. * If it works both ways, that locale should work with htdig. Do you get the same results from testlocale using both LC_CTYPE and LC_ALL? If so, read on... * There does also appear to be a few systems on which this program * works correctly and identifies accented letters as letters, but htdig * still doesn't seem to work with the same locale set in its "locale" * attribute. I don't know what to point the finger at besides bugs in * the C library on these systems. > Can I ignore the htdig warning? I think not, since it looks like no > accented words have been indexed. However, since testlocale works, I > would think that there is an easy fix to htdig, but I don't know where > to look. Any ideas? Unless you can pin it down to something htdig is doing that's improper, then you may be out of luck. I know that after setting LC_ALL, htdig then sets LC_TIME back to "C" to handle locale-free parsing of HTTP date headers. It may be that some buggy implementations have problems with these "split" locales. The LC_TIME setting may be irrelevant in 3.1.x versions, but I think it may matter in 3.2 betas. In any case, it's a long shot, but you may want to remove that second setlocale() call in Configuration::AddParsed() in htlib/Configuration.cc (in 3.1.x, or wherever it is in 3.2), and see if that helps. > Alternatively, is there an easy way to filter the files to be indexed? > For my need, it would be perfectly OK to replace all occurences of > accented letters with their non-accented counter-parts (� -> e, etc...) > before indexing. I believe there's a patch to do that somewhere in the patch archive (ftp://ftp.ccsf.org/htdig-patches/). It was for an older 3.1.x release, but I forget which. IIRC it was kludgy and only worked for ISO-8859-1 characters, but it may help if nothing else does. (This isn't Robert Marchand's accents patch for 3.1.5, which is standard in 3.1.6.) -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

