Re: Extensible content type detection

Jukka Zitting Tue, 27 Jan 2009 07:01:09 -0800

Hi,

On Mon, Jan 19, 2009 at 11:17 PM, Jukka Zitting <[email protected]> wrote:
> On Mon, Jan 19, 2009 at 10:24 PM, Niall Pemberton
> <[email protected]> wrote:
>> But your API says "...the detector must only read up to a limited
>> number of bytes from the stream to avoid potentially unbounded memory
>> use for the buffer of a marked a stream."
>
> Limited but not fixed. I'd like to leave it up to the detector
> implementation to determine how many bytes it actually needs, and only
> set a very high upper limit in the calling application.


Actually, now that I've played a bit with the new code in
org.apache.tika.detect, I think it makes the most sense if we avoid
setting any limit at the application level.

Instead we could change the Detector API contract so that the passed
InputStream must support the mark feature and that the Detector
implementation should use the mark() and reset() methods inside
detect() to restore the stream to it's original state before
returning.

This would put each Detector implementation directly in control of how
much data they need and how much memory they are willing to consume.

This would also simplify clients, as they wouldn't need to worry about
maximum buffer sizes etc. All they need to do is ensure that the
stream passed to the detector supports the mark feature. That can be
easily done like this:

    if (!stream.markSupported()) {
        stream = new BufferedInputStream(stream);
    }

Putting the Detector in control of the mark/reset calls also makes it
easier to implement a composite detector that may do multiple passes
over the same data using different type-specific detectors.

BR,

Jukka Zitting

Re: Extensible content type detection

Reply via email to