According to Frank Guangxin Liu:
> This afternoon, I noticed htdig didn't do anything except
> running parse_doc.pl on a pdf file. The file is about
> 700k, ~80 pages of text. I tried run pdftotext on this
> file and it took about a minute to produce a 6M text file.
> Both xpdf and acroread can open this file almost immediately.
> I am wondering why it took parse_doc.pl the whole afternoon
> to parse this one file. "top" shows it uses 90% of CPU.
> Is there anything we can do to speed up "parse_doc.pl"?
> If any of you want to re-produce this, I can send you
> the pdf file.
> After this file, I keep checking how htdig runs, it seems
> to me it almost always takes more than an hour to 
> parse_doc.pl a pdf file. This really is unacceptable.
> 
> By the way, I switch to use parse_doc.pl from acroread
> this weekend after reading the FAQ. 

parse_doc.pl is an interpreted Perl script, so it's not going to
be super efficient.  However, more than one hour to parse an 80 page
document seems quite unusually long.  I don't have PDFs that large, but
on my system a 2 page PDF gets parsed in under a second.  I have a 200
MHz AMD-K6 with 64 MB RAM, running Linux kernel 2.0.36 and Perl 5.004.
How does that compare to what you have?  Have you noticed any difference
if you run parse_doc.pl directly on one of these PDFs, instead of running
it from htdig?  If you let me know where I could fetch a copy of this PDF,
I'll try it out on my system.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to