[
https://issues.apache.org/jira/browse/TIKA-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856653#action_12856653
]
Jukka Zitting commented on TIKA-153:
------------------------------------
I have an idea on how to implement this...
The current Tika APIs are already pretty good, and I'd hate to complicate the
clean Parser interface with extra methods for different kinds of inputs.
Instead I'm thinking of adding a TikaInputStream utility class that extends
InputStream with methods that allow accessing the input document as a File.
The TikaInputStream class would have at least the following construtors:
public TikaInputStream(InputStream stream) { ... }
public TikaInputStream(File file) { ... }
And would in addition to the standard InputStream methods provide at least the
following:
public File getFile { ... }
If the TikaInputStream instance was created from a normal InputStream, then the
getFile() method would automatically copy the stream into a temporary file
that'll get removed when the stream is closed.
The Tika facade would always pass TikaInputStreams to the underlying parsers
and we'd recommend downstream projects to use this class also when directly
accessing the Parser API, but doing so would not be necessary. Instead the
TikaInputStream class would have a static method like the following that our
parsers could access the extra functionality:
public static TikaInputStream getTikaInputStream(InputStream stream) {
if (stream instanceof TikaInputStream) {
return (TikaInputStream) stream;
} else {
return new TikaInputStream(stream);
}
}
> Allow passing of files or memory buffers to parsers
> ---------------------------------------------------
>
> Key: TIKA-153
> URL: https://issues.apache.org/jira/browse/TIKA-153
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Reporter: Jukka Zitting
> Priority: Minor
>
> Some of our parsers need to be able to go back and forth within a source
> document, so need either a file or (for smaller documents) an in-memory
> buffer that contains the full document. Currently we use temporary files for
> such cases, which in some cases means doing an extra copy of a file before it
> gets parsed. We should come up with some way for clients to pass in a file or
> a memory buffer if one is available.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira