[
https://issues.apache.org/jira/browse/TIKA-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372063#comment-16372063
]
Nick Burch commented on TIKA-2585:
----------------------------------
I can't immediately see a common / well known class/interface we could accept -
Spring has {{InputStreamSource}} in it's core, but that's potentially a huge
dependency to suck in for just one class. Various other libraries define their
own {{InputStreamFactory}}, but I can't seem to find any of those in tiny
libraries / the JVM core itself.
Before we have to create our own class/interface for passing this to
{{TikaInputStream}}, does anyone know of a good one we can re-use?
> TikaInputStream support for resetting via a factory of InputStreams
> -------------------------------------------------------------------
>
> Key: TIKA-2585
> URL: https://issues.apache.org/jira/browse/TIKA-2585
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 2.0, 1.17
> Reporter: Nick Burch
> Priority: Major
>
> As raised in the 2.0 breaking changes thread, currently the only way that
> Tika has of handling the need to fully read an InputStream multiple times is
> to use {{TikaInputStream.getFile()}} which will spool to a temp file if not
> already file-based. (Reading a few kb is handled via buffering and
> mark/reset, but that doesn't scale for huge full files)
> In some cases, grabbing a fresh {{InputStream}} is actually cheaper than Tika
> spooling to a temp file, but we've no way of a caller expressing that
> So, before we make too much extra use of re-processing the whole input
> several times (eg for the augmenting-parsers and fallback-parsers), we should
> provide a way for callers to instead supply new {{InputStream}} instances on
> demand
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)