According to Christian Damm:
> im the senior system administrator of an austrian webdesign and webhosting
>  
> company called "die webmaster".
> 
> five minutes ago i was reading your posting at:
> http://www.geocrawler.com/archives/3/8825/2001/11/0/7143772/
> 
> 
> THE QUESTION FROM MELANIE:
> According to Bernier, Melanie: > > I have installed htdig and I have a 
> little problem with German Umlaut. I > > can search for words with Umlaut 
> without any problem. When I search for > > say 'C34644' (a file containing
>  
> Umlaut), the results from htdig comes back > > with strange characters 
> instead of Umlaut (for example, I get a circle (�) > > instead of �, or
>  I 
> get a bit � instead of a small �), and it seems to > > return that kind
>  of 
> results only for word documents. What could be the > > problem?
> YOUR REPLY:
> The problem is MS Word doesn't use ISO-8859-1 (Latin 1) encoding for 
> characters with accents. The doc2html.pl script uses catdoc to decode the 
> Word documents into plain text, which works fine for ASCII text, but when 
> accents are involved it doesn't automatically map to the encoding you want.
>  
> With catdoc, you have -s and -d options to specify the source and 
> destination character sets. I've found that by using catdoc -scp1250 
> -d8859-1 file.doc I can get accents to come out correctly on one of the few
>  
> Word documents I have that contain accents. This document happens to use 
> cp1250 as its internal character set. You may need to experiment to find 
> the right options for your documents. When you figure out the right 
> options, you can put them into the command line for catdoc in doc2html.pl.
> 
> 
> i got exactly the same problem but i dont know how to fix this.....CP1250 
> is not working for me.....
> i got htdig working fine one one of our companys sun cobalt raq 4 servers 
> (cobalt os based on redhat linux 6)
>   (pdf parsing works like a dream, english word-doc parsing too) - but i 
> cant figure out how to fix this german "umlaut" problem.....
> 
> now my question:
> do you know any source on the web where all the encodings are listed? - i 
> tried so many values.....none worked......or some hints i can
> use to find the correct encoding?

These encodings are all stored in files that catdoc uses, so the location
of the files will depend on how/where you've installed catdoc.  On my
system, I just "ls /usr/local/lib/catdoc", and I get...

8859-1.txt          ascii.specchars     cp437.txt           cp865.txt
8859-2.txt          cp1250.txt          cp850.txt           cp866.txt
8859-3.txt          cp1251.txt          cp852.txt           cp869.txt
8859-4.txt          cp1252.txt          cp855.txt           cp874.txt
8859-5.txt          cp1253.txt          cp857.txt           koi8-r.txt
8859-6.txt          cp1254.txt          cp860.txt           tex.replchars
8859-7.txt          cp1255.txt          cp861.txt           tex.specchars
8859-8.txt          cp1256.txt          cp862.txt           us-ascii.txt
8859-9.txt          cp1257.txt          cp863.txt           x-mac-cyrillic.txt
ascii.replchars     cp1258.txt          cp864.txt

I think all the .txt files in there are encoding definitions.  Yours may
be different if you're running a different version of catdoc than I am.
If you can't find the encodings you need, maybe try a different (i.e.
preferably more recent) version of catdoc.

By the way, followups to postings on the htdig mailing lists are best
kept on the mailing lists.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to