Re: [jira] Resolved: (TIKA-67) Add an auto-detecting Parser implementation

Jukka Zitting Mon, 15 Oct 2007 15:37:04 -0700

Hi,

On 10/16/07, Keith R. Bennett <[EMAIL PROTECTED]> wrote:
> 1) Now that we have this AutoDetectParser, do you want (me) to remove the
> ParseUtils method that do similar/same things?


That would probably be good, as we shouldn't have too many ways to do
the same thing. In general I would prefer to avoid static methods in
order to keep the codebase modular.

We could even do something like this:

    public class ConvenienceParser extends ParserDecorator {
        ...
        public String parse(InputStream stream) throws ... {
            return parse(stream, new Metadata());
        }
        public String parse(InputStream stream, Metadata metadata) throws ... {
            StringWriter writer = new StringWriter();
            parse(stream, new WriteOutContentHandler(writer), metadata);
            return writer.toString();
        }
        public String parse(File file) throws ... {
            Metadata metadata = new Metadata();
            metadata.set("filename", file.getName());
            InputStream stream = new FileInputStream(file);
            try {
                return parse(stream, metadata);
            } finally {
                stream.close();
            }
        }
        ... // etc.
    }

> 2) Also (by the way), we no longer use RereadableInputStream.  Did you want
> to remove that?

We may still have a use for RereadableInputStream later on. See also
http://www.nabble.com/RereadableInputStream-tf4609782.html for related
discussion on getting similar functionality included in commons-io.

> 3) Could we use a constant for "filename"?  And should we rename it to
> something general enough to cover URL's in addition to file names (such as
> "resourceName")?

Definitely. Probably.

> 4) It looks like it unconditionally reads and inspects the header to
> determine the MIME type, even if the CONTENT_TYPE is provided.  Did you want
> to allow overriding that behavior?   (other than by setting a different
> config object)?

The Metadata.CONTENT_TYPE hint is just a hint. For example the HTTP
ContentType header is quite often incorrect, so a MIME magic match (if
available) is probably more accurate.

> 5) The parse() method sometimes wraps the passed stream in a
> BufferedInputStream.  The caller (presumably) closes its stream after
> parse() returns, but the BufferedInputStream never gets closed, right?

As you noticed, that won't be a problem as the only resource in
BufferedInputStream that needs explicit closing is the underlying
stream.

BR,

Jukka Zitting

Re: [jira] Resolved: (TIKA-67) Add an auto-detecting Parser implementation

Reply via email to