According to Christian Damm: > im the senior system administrator of an austrian webdesign and webhosting > > company called "die webmaster". > > five minutes ago i was reading your posting at: > http://www.geocrawler.com/archives/3/8825/2001/11/0/7143772/ > > > THE QUESTION FROM MELANIE: > According to Bernier, Melanie: > > I have installed htdig and I have a > little problem with German Umlaut. I > > can search for words with Umlaut > without any problem. When I search for > > say 'C34644' (a file containing > > Umlaut), the results from htdig comes back > > with strange characters > instead of Umlaut (for example, I get a circle (�) > > instead of �, or > I > get a bit � instead of a small �), and it seems to > > return that kind > of > results only for word documents. What could be the > > problem? > YOUR REPLY: > The problem is MS Word doesn't use ISO-8859-1 (Latin 1) encoding for > characters with accents. The doc2html.pl script uses catdoc to decode the > Word documents into plain text, which works fine for ASCII text, but when > accents are involved it doesn't automatically map to the encoding you want. > > With catdoc, you have -s and -d options to specify the source and > destination character sets. I've found that by using catdoc -scp1250 > -d8859-1 file.doc I can get accents to come out correctly on one of the few > > Word documents I have that contain accents. This document happens to use > cp1250 as its internal character set. You may need to experiment to find > the right options for your documents. When you figure out the right > options, you can put them into the command line for catdoc in doc2html.pl. > > > i got exactly the same problem but i dont know how to fix this.....CP1250 > is not working for me..... > i got htdig working fine one one of our companys sun cobalt raq 4 servers > (cobalt os based on redhat linux 6) > (pdf parsing works like a dream, english word-doc parsing too) - but i > cant figure out how to fix this german "umlaut" problem..... > > now my question: > do you know any source on the web where all the encodings are listed? - i > tried so many values.....none worked......or some hints i can > use to find the correct encoding?
These encodings are all stored in files that catdoc uses, so the location of the files will depend on how/where you've installed catdoc. On my system, I just "ls /usr/local/lib/catdoc", and I get... 8859-1.txt ascii.specchars cp437.txt cp865.txt 8859-2.txt cp1250.txt cp850.txt cp866.txt 8859-3.txt cp1251.txt cp852.txt cp869.txt 8859-4.txt cp1252.txt cp855.txt cp874.txt 8859-5.txt cp1253.txt cp857.txt koi8-r.txt 8859-6.txt cp1254.txt cp860.txt tex.replchars 8859-7.txt cp1255.txt cp861.txt tex.specchars 8859-8.txt cp1256.txt cp862.txt us-ascii.txt 8859-9.txt cp1257.txt cp863.txt x-mac-cyrillic.txt ascii.replchars cp1258.txt cp864.txt I think all the .txt files in there are encoding definitions. Yours may be different if you're running a different version of catdoc than I am. If you can't find the encodings you need, maybe try a different (i.e. preferably more recent) version of catdoc. By the way, followups to postings on the htdig mailing lists are best kept on the mailing lists. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

