[ 
https://issues.apache.org/jira/browse/TIKA-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856653#action_12856653
 ] 

Jukka Zitting commented on TIKA-153:
------------------------------------

I have an idea on how to implement this...

The current Tika APIs are already pretty good, and I'd hate to complicate the 
clean Parser interface with extra methods for different kinds of inputs. 
Instead I'm thinking of adding a TikaInputStream utility class that extends 
InputStream with methods that allow accessing the input document as a File.

The TikaInputStream class would have at least the following construtors:

    public TikaInputStream(InputStream stream) { ... }
    public TikaInputStream(File file) { ... }

And would in addition to the standard InputStream methods provide at least the 
following:

    public File getFile { ... }

If the TikaInputStream instance was created from a normal InputStream, then the 
getFile() method would automatically copy the stream into a temporary file 
that'll get removed when the stream is closed.

The Tika facade would always pass TikaInputStreams to the underlying parsers 
and we'd recommend downstream projects to use this class also when directly 
accessing the Parser API, but doing so would not be necessary. Instead the 
TikaInputStream class would have a static method like the following that our 
parsers could access the extra functionality:

    public static TikaInputStream getTikaInputStream(InputStream stream) {
        if (stream instanceof TikaInputStream) {
            return (TikaInputStream) stream;
        } else {
            return new TikaInputStream(stream);
        }
    }


> Allow passing of files or memory buffers to parsers
> ---------------------------------------------------
>
>                 Key: TIKA-153
>                 URL: https://issues.apache.org/jira/browse/TIKA-153
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> Some of our parsers need to be able to go back and forth within a source 
> document, so need either a file or (for smaller documents) an in-memory 
> buffer that contains the full document. Currently we use temporary files for 
> such cases, which in some cases means doing an extra copy of a file before it 
> gets parsed. We should come up with some way for clients to pass in a file or 
> a memory buffer if one is available.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to