[jira] [Updated] (TIKA-2585) TikaInputStream support for resetting via a factory of InputStreams

Nick Burch (JIRA) Wed, 21 Feb 2018 13:44:51 -0800

     [ 
https://issues.apache.org/jira/browse/TIKA-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nick Burch updated TIKA-2585:
-----------------------------
    Description: 
As raised in the 2.0 breaking changes thread, currently the only way that Tika 
has of handling the need to fully read an InputStream multiple times is to use 
{{TikaInputStream.getFile()}} which will spool to a temp file if not already 
file-based. (Reading a few kb is handled via buffering and mark/reset, but that 
doesn't scale for huge full files)

In some cases, grabbing a fresh {{InputStream}} is actually cheaper than Tika 
spooling to a temp file, but we've no way of a caller expressing that

So, before we make too much extra use of re-processing the whole input several 
times (eg for the augmenting-parsers and fallback-parsers), we should provide a 
way for callers to instead supply new {{InputStream}} instances on demand

  was:
As raised in the 2.0 breaking changes thread, currently the only way that Tika 
has of handling the need to fully read an InputStream multiple times is to use 
`TikaInputStream.getFile()` which will spool to a temp file if not already 
file-based. (Reading a few kb is handled via buffering and mark/reset, but that 
doesn't scale for huge full files)

In some cases, grabbing a fresh `InputStream` is actually cheaper than Tika 
spooling to a temp file, but we've no way of a caller expressing that

So, before we make too much extra use of re-processing the whole input several 
times (eg for the augmenting-parsers and fallback-parsers), we should provide a 
way for callers to instead supply new InputStream instances on demand


> TikaInputStream support for resetting via a factory of InputStreams
> -------------------------------------------------------------------
>
>                 Key: TIKA-2585
>                 URL: https://issues.apache.org/jira/browse/TIKA-2585
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.0, 1.17
>            Reporter: Nick Burch
>            Priority: Major
>
> As raised in the 2.0 breaking changes thread, currently the only way that 
> Tika has of handling the need to fully read an InputStream multiple times is 
> to use {{TikaInputStream.getFile()}} which will spool to a temp file if not 
> already file-based. (Reading a few kb is handled via buffering and 
> mark/reset, but that doesn't scale for huge full files)
> In some cases, grabbing a fresh {{InputStream}} is actually cheaper than Tika 
> spooling to a temp file, but we've no way of a caller expressing that
> So, before we make too much extra use of re-processing the whole input 
> several times (eg for the augmenting-parsers and fallback-parsers), we should 
> provide a way for callers to instead supply new {{InputStream}} instances on 
> demand



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (TIKA-2585) TikaInputStream support for resetting via a factory of InputStreams

Reply via email to