Gilles Detillieux <[EMAIL PROTECTED]> writes:
> According to Geoff Hutchison:
> > >The question that comes to mind is why is pdf_parser treated specially
> > >and not implemented via the generalized external parser interface?
> >
> > ...having a builtin parser is almost always faster than an
> > external parser.
>
> ...I suspect that Sylvain's PDF.cc code may predate the external
> parser support, or perhaps he had some reservations about using
> external parsers back then (they were buggy until about April, if I
> recall, and still not super efficient).
I'm confused a bit about the "semi-internal" terminology being used. I
haven't read through PDF.cc, but my understanding was that it just
called acroread, and then used a built-in PostScript parser to process
the output of acroread. So if PostScript parsing is built-in, and thus
considered a native format, and Acrobat parsing is accomplished with
an external tool, then where does the "semi-internal" terminology come
from. To me it seems that acroread is as fully external as any other
parser that generates a natively understood format (like plain text).
Or is the documentation lagging...later you say:
> We got rid of the PostScript parser, because it never did work, and
> now we can get rid of PDF.cc.
So does that mean that PostScript isn't a native format and that
ht://Dig wouldn't deal with a .ps file, and instead PDF.cc contains a
quick-and-dirty parser that only deals with the PostScript
specifically generated by acroread?
> I've been giving this whole internal parsers vs. external parsers
> issue some thought lately. ...here's what I'd like to see later in
> 3.2:
>
> - The whole semi-internal, semi-external pdf_parser support has been
> a frequent source of confusion - I'd like to see it go. It could
> now be replaced with an external converter that spits out a
> text/plain version of the PDF's contents, using either pdftotext, or
> acroread
Do you loose meta or structural information by converting to
text/plain, or is structural information normally lost when harvesting
text from PostScript files anyway?
> - It's a bit of a pain to maintain multiple internal parsers, and it
> leads to a certain amount of duplication of code. ... That leaves
> Plaintext.cc and HTML.cc, which isn't bad. If you think about it,
> though, if you SGMLify plain text (at least the <, >, and &) you can
> pass it through the HTML parser - that way, you'd only need a single
> internal parser to maintain. That would probably greatly simplify
> things internally.
I'm just a casual observer here, but I'd say that if you were going to
standardize on a single parser, your best bet would be XML. You could
probably lift an existing XML parser too. It'd give you a certain
degree of future proofing. (I see on the todo list "Field-based
searching." XML would be a big step in that direction.)
Then you'd update your spec. for external parsers so that they'd
generate XML instead of the specialized record format they produce
now. Ideally, this should also allow an external parser author to
create a parser that feeds in just a few bits of meta information as
XML tags, followed by unprocessed plain text. Something like:
<?xml version="1.0" ?>
<!DOCTYPE htdig SYSTEM "http://www.htdig.org/htdig.dtd">
<TITLE>a document title</TITLE>
<KEYWORDS>document keywords</KEYWORDS>
<PLAINTEXT>
....
</PLAINTEXT>
A simple plain text (just to demonstrate the idea) external parser
written in Perl might look like:
$file = shift;
open(IN,$file) || die "$file: $!\n";
print <<"EOF";
<?xml version="1.0" ?>
<!DOCTYPE htdig SYSTEM "http://www.htdig.org/htdig.dtd">
<TITLE>$file</TITLE>
<PLAINTEXT>
EOF
while(<IN>) {
s|</PLAINTEXT>||ig;
print;
}
print "\n<PLAINTEXT>\n";
close(IN);
Though of course it would probably make sense to still have an
internal text/plain parser - either implemented as a wrapper similar
to the above code, or for better efficiency, feed the text directly
into the internal text parser and skip the XML parser. (I'm assuming
the XML parser would be layered on top of a plain text parser.)
And I believe that for HTML you'd just need a simple wrapper that
prepends the correct DTD spec. (<!DOCTYPE...>, if one isn't already
specified) and then your XML parser would be able to extract
meaningful structural information.
Of course a radical change like this would push ht://Dig more in the
direction of flexibility, than speed, which wouldn't help with the
criticisms that it's already not fast enough. (So I read on the
net...it worked plenty fast enough for my application.)
> ...leave the actual parsing and word separation to the one builtin
> parser, to be assured of consistent treatment of words regardless of
> the source document type.
Exactly.
Speaking of external_parsers...
http://www.htdig.org/attrs.html#external_parsers
...
description:
This attribute is used to specify a list of
content-type/parsers that are to be used to parse
documents that cannot by parsed by any of the
internal parsers.
This might be a good place to list the content-types understood by the
internal parsers.
Can htdig's config parsing handle multiple directives with the same
name? (If I recall correctly it only remembers the last one seen.) I
was just thinking that it might be cleaner to specify items like this
using multiple directives like this:
external_parser: text/html /usr/local/bin/htmlparser
external_parser: application/ms-word "mswordparser -w"
...
instead of:
external_parsers:
text/html /usr/local/bin/htmlparser \
application/ms-word "/usr/local/bin/mswordparser -w"
The parser program takes four command-line
parameters...:
infile content-type URL configuration-file
Have you considered using variable substitution? I'm not sure if the
extra complication is worth it, but I believe htdig already includes
code for doing this and it might lessen the need for creating shell
wrappers for external parsers that rearrange or discard parameters. i.e.
external_parser: application/ms-word "mswordparser -w $infile"
-Tom
--
Tom Metro
Venture Logic [EMAIL PROTECTED]
Newton, MA, USA
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.