Re: [htdig3-dev] external parsers

Gilles Detillieux Wed, 24 Nov 1999 13:22:57 -0800
According to Tom Metro:
> Gilles Detillieux <[EMAIL PROTECTED]> writes:
> > According to Geoff Hutchison:
> > > >The question that comes to mind is why is pdf_parser treated specially
> > > >and not implemented via the generalized external parser interface?
...
> I'm confused a bit about the "semi-internal" terminology being used. I 
> haven't read through PDF.cc, but my understanding was that it just 
> called acroread, and then used a built-in PostScript parser to process 
> the output of acroread. So if PostScript parsing is built-in, and thus 
> considered a native format, and Acrobat parsing is accomplished with 
> an external tool, then where does the "semi-internal" terminology come 
> from. To me it seems that acroread is as fully external as any other 
> parser that generates a natively understood format (like plain text). 

Well, you're not the only one that's confused by all this.  That's one of
the reasons I'd favour scrapping the current PDF support and replacing
it with a fully external parser or converter.  The semi-internal
terminology is mine, and I use it to describe the strange animal which
is htdig/PDF.cc.

Originally, htdig had two builtin parsers, as far as I can tell - they
handled HTML and plain text.  Then, to support PDF and PostScript,
two more internal parsers were added (PDF.cc and Postscript.cc).
The PostScript parser was never properly developed, so it sat there in the
code but was disabled.  The PDF.cc code made use of acroread to convert
the PDF to PostScript, which was then parsed by PDF.cc's own PostScript
parser that's handled only acroread's unique flavour of PostScript.
That's why I call it semi-internal - the parsing of acroread's PS is
internal, but it calls its own external converter to get PS out of a PDF.

Now I'm a bit fuzzy on the history, because all of this happened over
a year ago when I came on the scene, but I believe that external parser
support was added after that, to address the limitations of the current
model and the difficulty in extending it.  Problem is, writing your own
external parser isn't a simple, straightforward task.  A very good attempt
at one was the parse_word_doc.pl script, contributed around 3.1.0b3.
I helped that script evolve into the current parse_doc.pl script,
which handles a number of document types, and parses in a manner that's
reasonably consistent with the internal parsers.  It's far from perfect,
though, and there are still some inconsistencies.

Far easier to implement than a complete external parser, is an external
converter.  Many such converters are already available.  parse_doc.pl
uses document to text converters to extract text from PDF, PS, Word,
and potentially other document types, then parses that plain text into
external parser records for title, document head, and individual words.
Why bother to do the parsing of plain text when there's a good internal
text parser?  Worse still, to take advantage of many document to HTML
converters that are popping up (rtftohtml, mswordview, xlHtml) you'd
need to rewrite your own external HTML parser.  Hence the motivation
for external converters, which are available now as a patch to 3.1.3,
and also in the 3.2.0b1 development source tree.

What's all this have to do with acroread and PDF.cc?  Well, as I
mentioned above, acroread is an external converter (not a parser),
but it doesn't fit into this new external parser/converter framework.
It has its own builtin parser and its own specialized external parser
interface (pdf_parser).  Given the new framework, this is an unnecessary
and confusing throwback.

Of course, one could now design an external converter for PDFs using
acroread and gs's ps2ascii, but that would be very slow and inefficient.
I'd propose writing an "acrops2text" utility based on Sylvain's PDF.cc
code, to convert acroread's unique variety of PostScript into text, so
that the whole pdf_parser nonsense could be replaced by a fully external
PDF to text converter, without losing the functionality of the current
PDF.cc parser, which does still handle a limited number of situations
better than xpdf's pdftotext utility.

> Or is the documentation lagging...later you say:
> 
> > We got rid of the PostScript parser, because it never did work, and 
> > now we can get rid of PDF.cc.  
> So does that mean that PostScript isn't a native format and that 
> ht://Dig wouldn't deal with a .ps file, and instead PDF.cc contains a 
> quick-and-dirty parser that only deals with the PostScript 
> specifically generated by acroread?

Yes, exactly.  To parse any arbitrary variety of PostScript, you'd need
an external parser/converter, like gs's ps2ascii, which parse_doc.pl
supports.  Internal PS parsing never got off the ground.

> > I've been giving this whole internal parsers vs. external parsers 
> > issue some thought lately. ...here's what I'd like to see later in 
> > 3.2:
> > 
> > - The whole semi-internal, semi-external pdf_parser support has been 
> > a frequent source of confusion - I'd like to see it go.  It could 
> > now be replaced with an external converter that spits out a 
> > text/plain version of the PDF's contents, using either pdftotext, or 
> > acroread 
> Do you loose meta or structural information by converting to 
> text/plain, or is structural information normally lost when harvesting 
> text from PostScript files anyway?

Yes, that's correct.  Most document types other than HTML are
currently dealt with just as plain text, ultimately, so all structural
information is lost.  The only exception to this is the latest version
of parse_doc.pl, which has a hook in it to extract the title from PDFs
(if one was specified).  No one has ever contributed an external parser
that dealt with hypertext links in PDF or Word documents, for instance.
Though technically feasible, it's not trivial to implement such a beast.
By supporting document to HTML converters through the external converter
mechanism, we can more easily take advantage of the work that's been
done by others in that regard.  mswordview does seem to make proper
HTML <a> tags out of hypertext links in Word documents, and we can hope
that other converters will do likewise, as well as outputting meta tags
and all that other good stuff.

> > - It's a bit of a pain to maintain multiple internal parsers, and it 
> > leads to a certain amount of duplication of code. ... That leaves 
> > Plaintext.cc and HTML.cc, which isn't bad.  If you think about it, 
> > though, if you SGMLify plain text (at least the <, >, and &) you can 
> > pass it through the HTML parser - that way, you'd only need a single 
> > internal parser to maintain.  That would probably greatly simplify 
> > things internally.
> I'm just a casual observer here, but I'd say that if you were going to 
> standardize on a single parser, your best bet would be XML. You could 
> probably lift an existing XML parser too. It'd give you a certain 
> degree of future proofing. (I see on the todo list "Field-based 
> searching." XML would be a big step in that direction.)

Yes, if someone could add a good, efficient and reliable XML parser to
htdig, that would certainly be the way to go.

...
> Though of course it would probably make sense to still have an 
> internal text/plain parser - either implemented as a wrapper similar 
> to the above code, or for better efficiency, feed the text directly 
> into the internal text parser and skip the XML parser. (I'm assuming 
> the XML parser would be layered on top of a plain text parser.)

Well, I'd hope that a good XML parser could handle plain text effiently
enough, given that it wouldn't have any tags to deal with other than
the few "wrapper" tags.  For consistency and ease of maintenance, I'd
really prefer a single internal parser that handled everything as
HTML or a superset of it (e.g. XML).

...
> Of course a radical change like this would push ht://Dig more in the 
> direction of flexibility, than speed, which wouldn't help with the 
> criticisms that it's already not fast enough. (So I read on the 
> net...it worked plenty fast enough for my application.)

Yeah, but I think we'd need to get to the bottom of why exactly htdig
is too slow.  I don't think the current HTML parser is necessarily the
model of efficiency either, so it may be that a well designed XML parser
in its place wouldn't slow things down too much.  I think attention really
needs to be paid to the database back-end, and minimizing the amount of
copying of huge strings that takes place in the current code.

> > ...leave the actual parsing and word separation to the one builtin 
> > parser, to be assured of consistent treatment of words regardless of 
> > the source document type.
> Exactly.
> 
> 
> Speaking of external_parsers...
> 
> http://www.htdig.org/attrs.html#external_parsers
>   ...
>   description: 
>           This attribute is used to specify a list of
>           content-type/parsers that are to be used to parse
>           documents that cannot by parsed by any of the
>           internal parsers. 
> 
> This might be a good place to list the content-types understood by the 
> internal parsers.

Good idea!

> Can htdig's config parsing handle multiple directives with the same 
> name? (If I recall correctly it only remembers the last one seen.) I 
> was just thinking that it might be cleaner to specify items like this 
> using multiple directives like this:
> 
>   external_parser: text/html /usr/local/bin/htmlparser
>   external_parser: application/ms-word "mswordparser -w"
>   ...
> 
> instead of:
>   external_parsers:
>       text/html /usr/local/bin/htmlparser \
>       application/ms-word "/usr/local/bin/mswordparser -w" 

The problem is you want to be able to override attributes that were
defined previously, in an include file for example.  I'd favour a
different syntax for appending to an already defined attribute, e.g.:

        bad_extensions += .pdf

That way, you decide whether you're appending or redefining.

>           The parser program takes four command-line
>           parameters...:
>           infile content-type URL configuration-file
> 
> Have you considered using variable substitution? I'm not sure if the 
> extra complication is worth it, but I believe htdig already includes 
> code for doing this and it might lessen the need for creating shell 
> wrappers for external parsers that rearrange or discard parameters. i.e.
> 
>   external_parser: application/ms-word "mswordparser -w $infile"

Not a bad idea, but you'd need something that doesn't conflict with
the existing variable substitution mechanism.  E.g.: currently, something
like:
        limit_urls_to:  ${start_url}

is perfectly valid, and is expanded when the attribute is looked up.
Or are you suggesting that we use this very same mechanism for doing
this.  That would mean defining "infile" and other attributes for each
file being parsed, and re-looking up the external_parsers attribute to
get the latest expanded form of the arguments.  That would work too,
but you'd want to make sure the attributes you use for arguments won't
conflict with the existing attribute set.  Otherwise, you'd want to
implement a different mechanism altogether, using either a different
lead-in character than "$", or requiring a backslash before the "$"
in the arguments.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.
Re: [htdig3-dev] external parsers

Reply via email to