Dig documentation

Geoff Hutchison Wed, 24 Nov 1999 12:46:16 -0800
At 6:30 PM -0600 11/20/99, Gilles Detillieux wrote:
>Stderr is best used when the program would otherwise have a normal output
>stream on stdout, in which you don't want to bury error messages if the
>output stream is piped or redirected.

This is getting a bit off-topic (for this thread), but I guess a good 
precedent would be that critical or fatal errors should clearly go to 
stderr and everything else (debugging messages especially) should go 
to stdout.

>of the PDF's contents, using either pdftotext, or acroread -toPostScript
>piped through an Acrobat PostScript to text converter based on Sylvain's
>code.  That way, those who prefer acroread to xpdf aren't left out in the
>cold.

This sounds good.

>though, if you SGMLify plain text (at least the <, >, and &) you can pass
>it through the HTML parser - that way, you'd only need a single internal
>parser to maintain.  That would probably greatly simplify things internally.
>
>All other parsing can be done externally, or better yet, externally convert
>any document type you want to text/html or text/plain, and leave the actual
>parsing and word separation to the one builtin parser, to be assured of
>consistent treatment of words regardless of the source document type.

This is a good point again. I'd still be interested in seeing 
performance stats, but I'd guess that the vast majority of the pages 
people are indexing are text/plain or text/html (and slowly XML as 
well).

In addition, this could allow for more sophisticated external 
parsers. Andrew pointed out that with an ExternalTransport class, you 
might want to directly pass a document from the transport handler to 
a parser, skipping as much code in between as possible.

-Geoff


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.
Re: [htdig3-dev] feedback on ht://Dig documentation

Reply via email to