Re: [htdig] pdf2text, catdoc and French accents

Gilles Detillieux Thu, 09 Jan 2003 10:32:03 -0800

According to [EMAIL PROTECTED]:
> I've installed htDig on a Red Hat 8.0 box and have some problems with
> ISO to Unicode (UTF-8) conversions.
> The website to dig is ISO8859 based as are the documents referred to
> (pdf, doc, xls and ppt).
> The parsing and searchengine works fine except for special chars.
> This is due to a Unicode conversion done by my Linux box.
> In fact, for plain html and text-files we can avoid the conversion
> when we turn Unicode conversion off on the Linux box (unicode_stop
> command).  But I can't find a solution for the doc2html (pdf2text)
> or catdoc parsers.
> 
> Does anybody have a hint, clue or solution ?


I've never tried this myself, but the first thing I'd attempt would be
to set the LANG environment variable to a non-UTF-8-based locale before
calling htdig.  E.g. add "export LANG=fr_FR" to your rundig script
(but be certain that LC_COLLATE is set to "C" when calling htmerge,
if you're using one of the 3.1.x releases of ht://Dig).  I suspect that
pdftotext may look at LANG or other locale-related environment variables
to determine what it should output.

I'm not sure about catdoc, though, as it tends to use its own character
set tables.  (See "man catdoc" and look for the -s and -d options,
which you can set in the catdoc command line in doc2html.pl, once you
know what charsets your Word documents use.)

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This SF.NET email is sponsored by:
SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See!
http://www.vasoftware.com
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Re: [htdig] pdf2text, catdoc and French accents

Reply via email to