Jukka and All -

I've been thinking about how our Parser interface takes an InputStream
rather than a resource identifier (URL, File, String).

In order to accomplish the reading of an original resource only once, we
have the RereadableInputStream.  However, this presents the following
potential problems due to the duplication of data in memory or on disk:

1) We are implementing the chunking of data using the SAX events.  This
allows us to break up a document into smaller parts.  However, there is no
such chunking with regards to the RereadableInputStream; it reads and stores
the entire document.

2) Users need to be much more aware of their system's resources at all
points in time during which Tika may be in use.  This would require
anticipating available disk storage, what other processes are running, etc.

3) In some environments, saving to disk is not practical due to performance
or security concerns.

4) We introduce the risk of bringing down the JVM if the maximum memory is
exceeded, and possibly worse if the disk runs out of free space.

5) The parser implementations themselves may store data and use large
amounts of memory, so we may not have as much memory or disk available as we
may think.

* * *

For casual uses, this will probably not be a problem.  However, many users
will need Tika to be robust and efficient even under high loads.

So I raise the question -- should we think about supporting multiple reads
of a resource, at least as an option?  Many users will work only with static
resources such as files, and not be concerned about the data changing
between reads.  This would require changing the Parser interface, probably
to take a URL rather than an InputStream.

Maybe this is not necessary though -- do we know to what extent parsers need
to make multiple passes?  And will they ever need the first pass to read
more than just a small header?  If not, then the BufferedInputStream's mark
and release would work fine, and we would not need to store the read bytes
ourselves, using RereadableInputStream or otherwise.  I have no knowledge of
the parser implementations, so I thought RereadableInputStream would cover
the worst case.  However, I'm now seeing that it presents problems of its
own.

- Keith

-- 
View this message in context: 
http://www.nabble.com/Parser-Interface%2C-RereadableInputStream-tf4616886.html#a13185507
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Reply via email to