Title: RE: [htdig] pdf2text, catdoc and French accents

Thanks for your rapid response.

I used htdig v3.1.6 and doc2html v3.0.  The catdoc version was the 0.35 and I experimented with the charset commandline parameter (-c8859-1 ... -d ...) though without a positive result.  In the best case I had the letters but without their accents.  I have been going through the Perl scripts though I have poor Perl experience.  What bothered me most of all is that when I launched pdftotext directly on the same pdf files, the characters where correctly displayed ???  Next I stated that the problem was general with (ex. simple text files containing French accents imported from windows platform) so I tested out the unicode_start and unicode_stop commands with succes.

After applying your proposal (export LANG=fr_FR together with the unicode_stop command) the French accents appeared, but then I had the famous "deleted, no excerpt" statement for all my Acrobat documents.  It took me a couple of hours to get the solution.  I re�nstalled a Ghost copy of my machine and started all over tar, ./configure with the CXX- and CPPFLAGS, make, htdigconf and so long ... but without any result.  I used every hint in the FAQ and mailing list just until a couple of minutes ago I gave full permissions to the script and document dirs and everything worked fine. 

Thanks again for your help and devotion.

Stephan

PS : I'm going to install a more recent version of Vitus' Catdoc.  On his site he admits  the 0.35 has problems with Unicode.  That and some fine tuning should give me the expected result for the search engine.



-----Original Message-----
From: Gilles Detillieux [mailto:[EMAIL PROTECTED]]
Sent: jeudi 9 janvier 2003 19:22
To: Bastiaens Stephan - HQZ
Cc: [EMAIL PROTECTED]
Subject: Re: [htdig] pdf2text, catdoc and French accents


According to [EMAIL PROTECTED]:
> I've installed htDig on a Red Hat 8.0 box and have some problems with
> ISO to Unicode (UTF-8) conversions.
> The website to dig is ISO8859 based as are the documents referred to
> (pdf, doc, xls and ppt).
> The parsing and searchengine works fine except for special chars.
> This is due to a Unicode conversion done by my Linux box.
> In fact, for plain html and text-files we can avoid the conversion
> when we turn Unicode conversion off on the Linux box (unicode_stop
> command).  But I can't find a solution for the doc2html (pdf2text)
> or catdoc parsers.
>
> Does anybody have a hint, clue or solution ?

I've never tried this myself, but the first thing I'd attempt would be
to set the LANG environment variable to a non-UTF-8-based locale before
calling htdig.  E.g. add "export LANG=fr_FR" to your rundig script
(but be certain that LC_COLLATE is set to "C" when calling htmerge,
if you're using one of the 3.1.x releases of ht://Dig).  I suspect that
pdftotext may look at LANG or other locale-related environment variables
to determine what it should output.

I'm not sure about catdoc, though, as it tends to use its own character
set tables.  (See "man catdoc" and look for the -s and -d options,
which you can set in the catdoc command line in doc2html.pl, once you
know what charsets your Word documents use.)

--
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)

------------------------------------------------------- This SF.NET email is sponsored by: FREE SSL Guide from Thawte are you planning your Web Server Security? Click here to get a FREE Thawte SSL guide and find the answers to all your SSL security issues. http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0026en _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to