At 6:30 PM -0600 11/20/99, Gilles Detillieux wrote:
>Stderr is best used when the program would otherwise have a normal output
>stream on stdout, in which you don't want to bury error messages if the
>output stream is piped or redirected.
This is getting a bit off-topic (for this thread), but I guess a good
precedent would be that critical or fatal errors should clearly go to
stderr and everything else (debugging messages especially) should go
to stdout.
>of the PDF's contents, using either pdftotext, or acroread -toPostScript
>piped through an Acrobat PostScript to text converter based on Sylvain's
>code. That way, those who prefer acroread to xpdf aren't left out in the
>cold.
This sounds good.
>though, if you SGMLify plain text (at least the <, >, and &) you can pass
>it through the HTML parser - that way, you'd only need a single internal
>parser to maintain. That would probably greatly simplify things internally.
>
>All other parsing can be done externally, or better yet, externally convert
>any document type you want to text/html or text/plain, and leave the actual
>parsing and word separation to the one builtin parser, to be assured of
>consistent treatment of words regardless of the source document type.
This is a good point again. I'd still be interested in seeing
performance stats, but I'd guess that the vast majority of the pages
people are indexing are text/plain or text/html (and slowly XML as
well).
In addition, this could allow for more sophisticated external
parsers. Andrew pointed out that with an ExternalTransport class, you
might want to directly pass a document from the transport handler to
a parser, skipping as much code in between as possible.
-Geoff
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.