According to Marco Scheurer:
> I'm trying to run and configure ht://Dig on Mac OS X, 10.2 (Jaguar). I 
> need to index documents in French and English.
> 
> I set locale: to fr_FR and htdig complains with:
> 
> "Warning: unknown locale!"
> 
> However, the testlocale program seems to indicate that everything is 
> fine (and this wasn't the case with Mac OS X 10.1.x):
> 
> % ./testlocale fr_FR
> ...
> 192 0xC0:  �  -a-un--gt----
> 193 0xC1:  �  -a-un--gt----
> 194 0xC2:  �  -a-un--gt----
> ...
> 253 0xFD:  �  -al-n--gt----
> 254 0xFE:  �  -al-n--gt----
> 255 0xFF:  �  -al-n--gt----

Have a look at the comments in the program source, available at
http://www.htdig.org/files/contrib/other/testlocale.c
Note:
 * line argument. If you find one that works, try changing the LC_CTYPE to
 * LC_ALL in the setlocale() call, to make sure it still works that way.
 * If it works both ways, that locale should work with htdig.

Do you get the same results from testlocale using both LC_CTYPE and LC_ALL?
If so, read on...
 * There does also appear to be a few systems on which this program
 * works correctly and identifies accented letters as letters, but htdig
 * still doesn't seem to work with the same locale set in its "locale"
 * attribute.  I don't know what to point the finger at besides bugs in
 * the C library on these systems.

> Can I ignore the htdig warning? I think not, since it looks like no 
> accented words have been indexed. However, since testlocale works, I 
> would think that there is an easy fix to htdig, but I don't know where 
> to look. Any ideas?

Unless you can pin it down to something htdig is doing that's improper,
then you may be out of luck.  I know that after setting LC_ALL, htdig
then sets LC_TIME back to "C" to handle locale-free parsing of HTTP
date headers.  It may be that some buggy implementations have problems
with these "split" locales.  The LC_TIME setting may be irrelevant in
3.1.x versions, but I think it may matter in 3.2 betas.  In any case,
it's a long shot, but you may want to remove that second setlocale()
call in Configuration::AddParsed() in htlib/Configuration.cc (in 3.1.x,
or wherever it is in 3.2), and see if that helps.

> Alternatively, is there an easy way to filter the files to be indexed? 
> For my need, it would be perfectly OK to replace all occurences of 
> accented letters with their non-accented counter-parts (� -> e, etc...) 
> before indexing.

I believe there's a patch to do that somewhere in the patch archive
(ftp://ftp.ccsf.org/htdig-patches/).  It was for an older 3.1.x release,
but I forget which.  IIRC it was kludgy and only worked for ISO-8859-1
characters, but it may help if nothing else does.  (This isn't Robert
Marchand's accents patch for 3.1.5, which is standard in 3.1.6.)

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to