Hi, On Mon, Jan 19, 2009 at 10:24 PM, Niall Pemberton <[email protected]> wrote: > But your API says "...the detector must only read up to a limited > number of bytes from the stream to avoid potentially unbounded memory > use for the buffer of a marked a stream."
Limited but not fixed. I'd like to leave it up to the detector implementation to determine how many bytes it actually needs, and only set a very high upper limit in the calling application. > - with a ByteBuffer the detector would be able to discover how > many its allowed to read - otherwise how are you going to prevent > it going past the limit - throw an exception? Yes, an exception seems reasonable, but only at something like 1M bytes down the line. No reasonable detector should ever need to look so deep into the stream. And if one needs, then we can just as well consider the detection to have failed and treat the stream as application/octet-stream. The problem with fixing the limit somewhere lower, like the 1k we have now, may prevent some document types from being detected. The most notable examples are the OLE2 file formats that we currently can only autodetect as the artificial generic application/x-tika-msoffice type. More exact type detection will likely need more than 1k bytes, but it's hard to say exactly how many more. And I'd rather not spend lots of memory for buffering when most of the time only a few bytes are needed. > Potentially as well it could be much more efficient. I don't think efficiency is that big a concern, as we're in any case dealing with bytes that are being buffered in memory. BR, Jukka Zitting
