[ 
https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080195#comment-17080195
 ] 

LuĂ­s Filipe Nassif commented on TIKA-2849:
------------------------------------------

Hi [~boris-petrov],

There are a number of Tika parsers that need a java.io.File because it is 
needed by Tika's dependencies. Looking at current sources, I found File is 
needed by parsers of rar, 7z, pst, mp4, jpg, tif, webp, sqlite, maybe others... 
Currently there is no way to know if a parser will spool the stream or not.

But, my organization have a project with a hard requirement to run a search 
tool in computers/cellphones with very limited resources in the field, and we 
prefer to receive an IOException("File size larger than max spool limit") from 
parsers instead of waiting too long in dangerous places or exhausting computer 
resources and crashing the app...

[~tallison], What do you think of a new TikaInputStream constructor that takes 
the spool limit or some setMaxSpoolSize() method to set this limit? If reached, 
TikaInputStream should throw the IOException above.

If approved, I can code that, is simple.

> TikaInputStream copies the input stream locally
> -----------------------------------------------
>
>                 Key: TIKA-2849
>                 URL: https://issues.apache.org/jira/browse/TIKA-2849
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.20
>            Reporter: Boris Petrov
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 1.21
>
>
> When doing "tika.detect(stream, name)" and the stream is a "TikaInputStream", 
> execution gets to "TikaInputStream#getPath" which does a "Files.copy(in, 
> path, REPLACE_EXISTING);" which is very, very bad. This input stream could 
> be, as in our case, an input stream from a network file which is tens or 
> hundreds of gigabytes large. Copying it locally is a huge waste of resources 
> to say the least. Why does it do that and can I make it not do it? Or is this 
> something that has to be fixed in Tika?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to