Re: [htdig] parse_doc.pl slow

Frank Guangxin Liu Tue, 20 Jul 1999 12:28:33 -0700



On Tue, 20 Jul 1999, Frank Guangxin Liu wrote:

> 
> 
> 
> On Tue, 20 Jul 1999, Gilles Detillieux wrote:
> 
> > 
> > According to me:
> > > According to Frank Guangxin Liu:
> > > > This afternoon, I noticed htdig didn't do anything except
> > > > running parse_doc.pl on a pdf file. The file is about
> > > > 700k, ~80 pages of text. I tried run pdftotext on this
> > > > file and it took about a minute to produce a 6M text file.
> > > > Both xpdf and acroread can open this file almost immediately.
> > > > I am wondering why it took parse_doc.pl the whole afternoon
> > > > to parse this one file. "top" shows it uses 90% of CPU.
> > > > Is there anything we can do to speed up "parse_doc.pl"?
> > > > If any of you want to re-produce this, I can send you
> > > > the pdf file.
> > > > After this file, I keep checking how htdig runs, it seems
> > > > to me it almost always takes more than an hour to 
> > > > parse_doc.pl a pdf file. This really is unacceptable.
> > > > 
> > > > By the way, I switch to use parse_doc.pl from acroread
> > > > this weekend after reading the FAQ. 
> > > 
> > > parse_doc.pl is an interpreted Perl script, so it's not going to
> > > be super efficient.  However, more than one hour to parse an 80 page
> > > document seems quite unusually long.  I don't have PDFs that large, but
> > > on my system a 2 page PDF gets parsed in under a second.  I have a 200
> > > MHz AMD-K6 with 64 MB RAM, running Linux kernel 2.0.36 and Perl 5.004.
> > > How does that compare to what you have?  Have you noticed any difference
> > > if you run parse_doc.pl directly on one of these PDFs, instead of running
> > > it from htdig?  If you let me know where I could fetch a copy of this PDF,
> > > I'll try it out on my system.
> > 
> > Frank & I continued this discussion off the list, but for the benefit
> > of those who are following (and for the archives), I thought I'd post
> > a summary.
> > 
> > It turns out the problem was caused by some pages in the PDF that
> > contained tables in landscape orientation.  These caused major confusion
> > for pdftotext, leading it to put out hundreds of very long lines (~7KB),
> > which slowed the perl script parse_doc.pl to a crawl.  Adding the
> > -rawdump option to pdftotext (which requires a patch, available at
> > http://www.htdig.org/files/contrib/parsers/) sped things up considerably
> > (from 1.5 hrs to 22 sec on my system), but pdftotext still isn't putting
> > out intelligible text for these landscape pages.  I recommended to Frank
> > that he notify Derek Noonburg, author of pdftotext and the xpdf package,
> > to let him know of the problem.  It remains to be seen whether htdig's
> > parsing of acroread's PostScript output would do a better job of indexing
> > these particular documents.
> 
> I just tested and "acroread" can handle those landscape tables
> without a problem. Keywords in those tables can be found by
> htsearch. But if you use pdftotext, anything in those landscape
> tables will get lost and will get into the search db.
                           won't

> I will check with Derek Noonburg to see if he can do anything
> to pdftotext to gracefully handling those landscape tables.
> 
> > 
> > -- 
> > Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
> > Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
> > Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
> > Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
> > 
> > ------------------------------------
> > To unsubscribe from the htdig mailing list, send a message to
> > [EMAIL PROTECTED] containing the single word "unsubscribe" in
> > the SUBJECT of the message.
> > 
> 
> 
> ------------------------------------
> To unsubscribe from the htdig mailing list, send a message to
> [EMAIL PROTECTED] containing the single word "unsubscribe" in
> the SUBJECT of the message.
> 


------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.
Re: [htdig] parse_doc.pl slow

Reply via email to