According to Bernier, Melanie: > > I have installed htdig and I have a little problem with German Umlaut. I > > can search for words with Umlaut without any problem. When I search for > > say 'C34644' (a file containing Umlaut), the results from htdig comes back > > with strange characters instead of Umlaut (for example, I get a circle (�) > > instead of �, or I get a bit � instead of a small �), and it seems to > > return that kind of results only for word documents. What could be the > > problem?
The problem is MS Word doesn't use ISO-8859-1 (Latin 1) encoding for characters with accents. The doc2html.pl script uses catdoc to decode the Word documents into plain text, which works fine for ASCII text, but when accents are involved it doesn't automatically map to the encoding you want. With catdoc, you have -s and -d options to specify the source and destination character sets. I've found that by using catdoc -scp1250 -d8859-1 file.doc I can get accents to come out correctly on one of the few Word documents I have that contain accents. This document happens to use cp1250 as its internal character set. You may need to experiment to find the right options for your documents. When you figure out the right options, you can put them into the command line for catdoc in doc2html.pl. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev
