Jukka -

The AutoDetectParser looks good.  I expect it will simplify things quite a
bit.  Some questions for you:

1) Now that we have this AutoDetectParser, do you want (me) to remove the
ParseUtils method that do similar/same things?

2) Also (by the way), we no longer use RereadableInputStream.  Did you want
to remove that?

3) Could we use a constant for "filename"?  And should we rename it to
something general enough to cover URL's in addition to file names (such as
"resourceName")?

4) It looks like it unconditionally reads and inspects the header to
determine the MIME type, even if the CONTENT_TYPE is provided.  Did you want
to allow overriding that behavior?   (other than by setting a different
config object)?

5) The parse() method sometimes wraps the passed stream in a
BufferedInputStream.  The caller (presumably) closes its stream after
parse() returns, but the BufferedInputStream never gets closed, right?  It
looks like that's not a problem, though, because BufferedInputStream.close()
seems to do nothing other than to close its wrapped stream.  From
BufferedInputStream.java:

public void close() throws IOException {
    byte[] buffer;
    while ( (buffer = buf) != null) {
        if (bufUpdater.compareAndSet(this, buffer, null)) {
            InputStream input = in;
            in = null;
            if (input != null)
                input.close();
            return;
        }
        // Else retry in case a new buf was CASed in fill()
    }
}


- Keith



JIRA [EMAIL PROTECTED] wrote:
> 
> 
>      [
> https://issues.apache.org/jira/browse/TIKA-67?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
> 
> Jukka Zitting resolved TIKA-67.
> -------------------------------
> 
>     Resolution: Fixed
> 
> Patch committed in revision 584921.
> 
>> Add an auto-detecting Parser implementation
>> -------------------------------------------
>>
>>                 Key: TIKA-67
>>                 URL: https://issues.apache.org/jira/browse/TIKA-67
>>             Project: Tika
>>          Issue Type: New Feature
>>          Components: general
>>            Reporter: Jukka Zitting
>>            Assignee: Jukka Zitting
>>             Fix For: 0.1-incubator
>>
>>         Attachments: TIKA-67.patch
>>
>>
>> We should have an AutoDetectParser class that uses the MIME framework to
>> automatically detect the type of the document being parsed, and that
>> dispatches the parsing task to the parser class configured for the
>> detected MIME type.
>> The class would work like this:
>>     InputStream stream = ...;
>>     ContentHandler handler = ...;
>>     Metadata metadata = new Metadata();
>>     metadata.set(Metadata.CONTENT_TYPE, ...); // optional content type
>> hint
>>     metadata.set("filename", ...); // optional file name hint
>>     AutoDetectParser parser = new AutoDetectParser();
>>     parser.setConfig(...); // optional TikaConfig configuration
>>     parser.parse(stream, handler, metadata);
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/-jira--Created%3A-%28TIKA-67%29-Add-an-auto-detecting-Parser-implementation-tf4623559.html#a13223061
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Reply via email to