According to Bernier, Melanie:
> > I have installed htdig and I have a little problem with German Umlaut.  I
> > can search for words with Umlaut without any problem.  When I search for
> > say 'C34644' (a file containing Umlaut), the results from htdig comes back
> > with strange characters instead of Umlaut (for example, I get a circle (�)
> > instead of �, or I get a bit � instead of a small �), and it seems to
> > return that kind of results only for word documents.  What could be the
> > problem?

The problem is MS Word doesn't use ISO-8859-1 (Latin 1) encoding for
characters with accents.  The doc2html.pl script uses catdoc to decode
the Word documents into plain text, which works fine for ASCII text, but
when accents are involved it doesn't automatically map to the encoding
you want.

With catdoc, you have -s and -d options to specify the source and
destination character sets.  I've found that by using

   catdoc -scp1250 -d8859-1 file.doc

I can get accents to come out correctly on one of the few Word documents
I have that contain accents.  This document happens to use cp1250 as its
internal character set.  You may need to experiment to find the right
options for your documents.  When you figure out the right options,
you can put them into the command line for catdoc in doc2html.pl.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to