According to Frank Guangxin Liu:
> This afternoon, I noticed htdig didn't do anything except
> running parse_doc.pl on a pdf file. The file is about
> 700k, ~80 pages of text. I tried run pdftotext on this
> file and it took about a minute to produce a 6M text file.
> Both xpdf and acroread can open this file almost immediately.
> I am wondering why it took parse_doc.pl the whole afternoon
> to parse this one file. "top" shows it uses 90% of CPU.
> Is there anything we can do to speed up "parse_doc.pl"?
> If any of you want to re-produce this, I can send you
> the pdf file.
> After this file, I keep checking how htdig runs, it seems
> to me it almost always takes more than an hour to
> parse_doc.pl a pdf file. This really is unacceptable.
>
> By the way, I switch to use parse_doc.pl from acroread
> this weekend after reading the FAQ.
parse_doc.pl is an interpreted Perl script, so it's not going to
be super efficient. However, more than one hour to parse an 80 page
document seems quite unusually long. I don't have PDFs that large, but
on my system a 2 page PDF gets parsed in under a second. I have a 200
MHz AMD-K6 with 64 MB RAM, running Linux kernel 2.0.36 and Perl 5.004.
How does that compare to what you have? Have you noticed any difference
if you run parse_doc.pl directly on one of these PDFs, instead of running
it from htdig? If you let me know where I could fetch a copy of this PDF,
I'll try it out on my system.
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.