Re: [htdig3-dev] external parsers

Tom Metro Wed, 24 Nov 1999 12:51:01 -0800
Gilles Detillieux <[EMAIL PROTECTED]> writes:
> According to Geoff Hutchison:
> > >The question that comes to mind is why is pdf_parser treated specially
> > >and not implemented via the generalized external parser interface?
> > 
> > ...having a builtin parser is almost always faster than an 
> > external parser.
> 
> ...I suspect that Sylvain's PDF.cc code may predate the external 
> parser support, or perhaps he had some reservations about using 
> external parsers back then (they were buggy until about April, if I 
> recall, and still not super efficient).

I'm confused a bit about the "semi-internal" terminology being used. I 
haven't read through PDF.cc, but my understanding was that it just 
called acroread, and then used a built-in PostScript parser to process 
the output of acroread. So if PostScript parsing is built-in, and thus 
considered a native format, and Acrobat parsing is accomplished with 
an external tool, then where does the "semi-internal" terminology come 
from. To me it seems that acroread is as fully external as any other 
parser that generates a natively understood format (like plain text). 
Or is the documentation lagging...later you say:

> We got rid of the PostScript parser, because it never did work, and 
> now we can get rid of PDF.cc.  
So does that mean that PostScript isn't a native format and that 
ht://Dig wouldn't deal with a .ps file, and instead PDF.cc contains a 
quick-and-dirty parser that only deals with the PostScript 
specifically generated by acroread?


> I've been giving this whole internal parsers vs. external parsers 
> issue some thought lately. ...here's what I'd like to see later in 
> 3.2:
> 
> - The whole semi-internal, semi-external pdf_parser support has been 
> a frequent source of confusion - I'd like to see it go.  It could 
> now be replaced with an external converter that spits out a 
> text/plain version of the PDF's contents, using either pdftotext, or 
> acroread 
Do you loose meta or structural information by converting to 
text/plain, or is structural information normally lost when harvesting 
text from PostScript files anyway?

> - It's a bit of a pain to maintain multiple internal parsers, and it 
> leads to a certain amount of duplication of code. ... That leaves 
> Plaintext.cc and HTML.cc, which isn't bad.  If you think about it, 
> though, if you SGMLify plain text (at least the <, >, and &) you can 
> pass it through the HTML parser - that way, you'd only need a single 
> internal parser to maintain.  That would probably greatly simplify 
> things internally.
I'm just a casual observer here, but I'd say that if you were going to 
standardize on a single parser, your best bet would be XML. You could 
probably lift an existing XML parser too. It'd give you a certain 
degree of future proofing. (I see on the todo list "Field-based 
searching." XML would be a big step in that direction.)

Then you'd update your spec. for external parsers so that they'd 
generate XML instead of the specialized record format they produce 
now. Ideally, this should also allow an external parser author to 
create a parser that feeds in just a few bits of meta information as 
XML tags, followed by unprocessed plain text. Something like:

<?xml version="1.0" ?>
<!DOCTYPE htdig SYSTEM "http://www.htdig.org/htdig.dtd">
<TITLE>a document title</TITLE>
<KEYWORDS>document keywords</KEYWORDS>
<PLAINTEXT>
  ....
</PLAINTEXT>


A simple plain text (just to demonstrate the idea) external parser 
written in Perl might look like:

        $file = shift;
        open(IN,$file) || die "$file: $!\n";
        print <<"EOF";
<?xml version="1.0" ?>
<!DOCTYPE htdig SYSTEM "http://www.htdig.org/htdig.dtd">
<TITLE>$file</TITLE>
<PLAINTEXT>
EOF
        while(<IN>) {
                s|</PLAINTEXT>||ig;
                print;
        }
        print "\n<PLAINTEXT>\n";
        close(IN);

Though of course it would probably make sense to still have an 
internal text/plain parser - either implemented as a wrapper similar 
to the above code, or for better efficiency, feed the text directly 
into the internal text parser and skip the XML parser. (I'm assuming 
the XML parser would be layered on top of a plain text parser.)

And I believe that for HTML you'd just need a simple wrapper that 
prepends the correct DTD spec. (<!DOCTYPE...>, if one isn't already 
specified) and then your XML parser would be able to extract 
meaningful structural information.

Of course a radical change like this would push ht://Dig more in the 
direction of flexibility, than speed, which wouldn't help with the 
criticisms that it's already not fast enough. (So I read on the 
net...it worked plenty fast enough for my application.)

> ...leave the actual parsing and word separation to the one builtin 
> parser, to be assured of consistent treatment of words regardless of 
> the source document type.
Exactly.


Speaking of external_parsers...

http://www.htdig.org/attrs.html#external_parsers
  ...
  description: 
          This attribute is used to specify a list of
          content-type/parsers that are to be used to parse
          documents that cannot by parsed by any of the
          internal parsers. 

This might be a good place to list the content-types understood by the 
internal parsers.

Can htdig's config parsing handle multiple directives with the same 
name? (If I recall correctly it only remembers the last one seen.) I 
was just thinking that it might be cleaner to specify items like this 
using multiple directives like this:

  external_parser: text/html /usr/local/bin/htmlparser
  external_parser: application/ms-word "mswordparser -w"
  ...

instead of:
  external_parsers:
        text/html /usr/local/bin/htmlparser \
        application/ms-word "/usr/local/bin/mswordparser -w" 



          The parser program takes four command-line
          parameters...:
          infile content-type URL configuration-file

Have you considered using variable substitution? I'm not sure if the 
extra complication is worth it, but I believe htdig already includes 
code for doing this and it might lessen the need for creating shell 
wrappers for external parsers that rearrange or discard parameters. i.e.

  external_parser: application/ms-word "mswordparser -w $infile"


 -Tom

-- 
Tom Metro
Venture Logic                                     [EMAIL PROTECTED]
Newton, MA, USA


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.
Re: [htdig3-dev] external parsers

Reply via email to