Re: [htdig] parse_doc.pl slow

Gilles Detillieux Tue, 20 Jul 1999 11:38:30 -0700

According to me:
> According to Frank Guangxin Liu:
> > This afternoon, I noticed htdig didn't do anything except
> > running parse_doc.pl on a pdf file. The file is about
> > 700k, ~80 pages of text. I tried run pdftotext on this
> > file and it took about a minute to produce a 6M text file.
> > Both xpdf and acroread can open this file almost immediately.
> > I am wondering why it took parse_doc.pl the whole afternoon
> > to parse this one file. "top" shows it uses 90% of CPU.
> > Is there anything we can do to speed up "parse_doc.pl"?
> > If any of you want to re-produce this, I can send you
> > the pdf file.
> > After this file, I keep checking how htdig runs, it seems
> > to me it almost always takes more than an hour to 
> > parse_doc.pl a pdf file. This really is unacceptable.
> > 
> > By the way, I switch to use parse_doc.pl from acroread
> > this weekend after reading the FAQ. 
> 
> parse_doc.pl is an interpreted Perl script, so it's not going to
> be super efficient.  However, more than one hour to parse an 80 page
> document seems quite unusually long.  I don't have PDFs that large, but
> on my system a 2 page PDF gets parsed in under a second.  I have a 200
> MHz AMD-K6 with 64 MB RAM, running Linux kernel 2.0.36 and Perl 5.004.
> How does that compare to what you have?  Have you noticed any difference
> if you run parse_doc.pl directly on one of these PDFs, instead of running
> it from htdig?  If you let me know where I could fetch a copy of this PDF,
> I'll try it out on my system.

Frank & I continued this discussion off the list, but for the benefit
of those who are following (and for the archives), I thought I'd post
a summary.

It turns out the problem was caused by some pages in the PDF that
contained tables in landscape orientation.  These caused major confusion
for pdftotext, leading it to put out hundreds of very long lines (~7KB),
which slowed the perl script parse_doc.pl to a crawl.  Adding the
-rawdump option to pdftotext (which requires a patch, available at
http://www.htdig.org/files/contrib/parsers/) sped things up considerably
(from 1.5 hrs to 22 sec on my system), but pdftotext still isn't putting
out intelligible text for these landscape pages.  I recommended to Frank
that he notify Derek Noonburg, author of pdftotext and the xpdf package,
to let him know of the problem.  It remains to be seen whether htdig's
parsing of acroread's PostScript output would do a better job of indexing
these particular documents.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.
Re: [htdig] parse_doc.pl slow

Reply via email to