Re: Extensible content type detection

Jukka Zitting Mon, 19 Jan 2009 14:17:40 -0800

Hi,

On Mon, Jan 19, 2009 at 10:24 PM, Niall Pemberton
<[email protected]> wrote:
> But your API says "...the detector must only read up to a limited
> number of bytes from the stream to avoid potentially unbounded memory
> use for the buffer of a marked a stream."


Limited but not fixed. I'd like to leave it up to the detector
implementation to determine how many bytes it actually needs, and only
set a very high upper limit in the calling application.

> - with a ByteBuffer the detector would be able to discover how
> many its allowed to read - otherwise how are you going to prevent
> it going past the limit - throw an exception?

Yes, an exception seems reasonable, but only at something like 1M
bytes down the line. No reasonable detector should ever need to look
so deep into the stream. And if one needs, then we can just as well
consider the detection to have failed and treat the stream as
application/octet-stream.

The problem with fixing the limit somewhere lower, like the 1k we have
now, may prevent some document types from being detected. The most
notable examples are the OLE2 file formats that we currently can only
autodetect as the artificial generic application/x-tika-msoffice type.
More exact  type detection will likely need more than 1k bytes, but
it's hard to say exactly how many more. And I'd rather not spend lots
of memory for buffering when most of the time only a few bytes are
needed.

> Potentially as well it could be much more efficient.

I don't think efficiency is that big a concern, as we're in any case
dealing with bytes that are being buffered in memory.

BR,

Jukka Zitting

Re: Extensible content type detection

Reply via email to