Dear Ladies and Gentlemen,
We have a major problem using ht://Dig with sites mostly hosting
MS Office documents, e.g. a fileserver or a document archive browsable
through the web. We are using ht://Dig 3.1.6 SSL on Solaris 8 / 9 systems.
I think you discovered the following as well:
- Office document types & versions are changing rapidly.
- Most OpenSource native binary converters will or are not be continued
in development.
This leads to the problem, that those OpenSource converters crash with a
segfault or cause such a significant high load on the servers as the
subprocesses don't return due to some non-parsable documents. This causes
the indexing process to hang and stop.
Unfortunately, this will take ht://Dig out of work, if these document types
can't be converted to html and thus be indexed. I can't make continuous
test runs and constantly extend the exclusion list as this would mean also
cutting off ht://Dig.
We used the following converters:
- pdf: Xpdf
Xpdf can open most PDF files, but not those from Acrobat 5 to 6.
(PDF Format 1.5/1.6)
- ppt: ppthtml
This tool is lacking development since '98 and can only process
97/98 Powerpoint files.
- doc: wvware
Fortunately, this works quite well.
- xls: xlhtml
see ppt, but development is stalling since 04-13-02
They worked fine until Office 2000 came out.
I know there is doc2html.pl (or however it is called), but you have to
tell doc2html.pl which native converter to use and that points back to the
beginning. :)) Personally, I am on the edge of giving up.
What type of converters do you use? What experience did you folks on the
list made? Can you tell me some URL's where to look for better converters?
I don't have any problems to buy some <convert-my-world>-software, if it
is cheaper than HTML-Transit. :)
Or I turn using Lucene (http://www.jguru.com/faq/Lucene) instead.
Yours sincerely,
Martin Allert
--
--------------------------------------------------------
arago AG, Institut fuer komplexes Datenmanagement
Am Niddatal 3, 60488 Frankfurt/Main, [EMAIL PROTECTED]
Tel. 069/405680, Fax 069/40568111, http://www.arago.de
--------------------------------------------------------
pgpMVOZv0ASYe.pgp
Description: PGP signature

