Opening and Closing Document Input Streams

kbennett Thu, 20 Sep 2007 10:22:02 -0700

Given the fact that input documents can be specified by URL, it would seem
logical to me that a caller would pass Tika a URL, get the parsed content,
and not want to have to manage any streams required to process the document. 
In other words, Tika would open and close the stream itself.


Currently, the parser factory takes the URL, opens a stream from it, and
passes it to the newly created Parser object.  However, as far as I know,
the stream is never closed unless the caller calls Parser.getInputStream()
and does so himself.

For my use of Tika, I am creating a Java component that will continuously
read URL's as input, and output the parsed text read from those URL's. 
Ideally, a single entry point in Tika would be great, where we do something
like this:

String fulltext = Tika.getFullText(documentUrl, tikaConfigUrl);

... or perhaps to be more performant, we would create a Tika 'thing' with
the config URL and reuse that for each document:

TikaThing tikaThing = new TikaThing(tikaConfigUrl);
String fulltext = tikaThing.getFullText(documentUrl);

Another reason to open and close the stream ourselves is that (I am
assuming) that any parser will read the entire resource from beginning to
end.  So returning the stream would have little value.  However, I'm not
suggesting that we eliminate that functionality.

To sum up, I propose that when the Parser class receives a URL, it opens and
closes the stream itself.  When it receives a stream, it does NOT close the
stream itself.

What do you think?

- Keith

-- 
View this message in context: 
http://www.nabble.com/Opening-and-Closing-Document-Input-Streams-tf4488928.html#a12801853
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Opening and Closing Document Input Streams

Reply via email to