On Wed, 23 Dec 2020, Peter Kronenberg wrote:
Best is to wrap as a TikaInputStream, detect using all the detectors
via >DefaultDetector, then parse after that.
But sometimes the detect will read the whole file, right? For example,
for Word. So is it then making 2 passes?
Nope, we stash the
>> In my use case, we will not have any filename or metadata. It will
>> just be a stream. But you're right in that we will want to parse it.
>> So it sounds like the best way to do it is to do the detect on the
>> first few bytes, which will at least give you an idea of what it is,
>> but no
On Wed, 23 Dec 2020, Peter Kronenberg wrote:
But yet, if I understand correctly, using a TikaInputStream *will* spool
the entire stream to disk so it can read everything, right? If I
re-read the stream to parse, is it making 2 passes?
TikaInputStream has logic in it dump the stream to a temp
But yet, if I understand correctly, using a TikaInputStream *will* spool the
entire stream to disk so it can read everything, right? If I re-read the
stream to parse, is it making 2 passes?
In my use case, we will not have any filename or metadata. It will just be a
stream. But you're right
On Tue, 22 Dec 2020, Peter Kronenberg wrote:
Oh, so reading the stream doesn't read the whole file?
Not for Detect, no. The assumption is that Detect is normally followed by
Parse, so you won't want the Stream consuming, so we do a mark/reset to
check the first few kb only
I know for Office