Given the fact that input documents can be specified by URL, it would seem logical to me that a caller would pass Tika a URL, get the parsed content, and not want to have to manage any streams required to process the document. In other words, Tika would open and close the stream itself.
Currently, the parser factory takes the URL, opens a stream from it, and passes it to the newly created Parser object. However, as far as I know, the stream is never closed unless the caller calls Parser.getInputStream() and does so himself. For my use of Tika, I am creating a Java component that will continuously read URL's as input, and output the parsed text read from those URL's. Ideally, a single entry point in Tika would be great, where we do something like this: String fulltext = Tika.getFullText(documentUrl, tikaConfigUrl); ... or perhaps to be more performant, we would create a Tika 'thing' with the config URL and reuse that for each document: TikaThing tikaThing = new TikaThing(tikaConfigUrl); String fulltext = tikaThing.getFullText(documentUrl); Another reason to open and close the stream ourselves is that (I am assuming) that any parser will read the entire resource from beginning to end. So returning the stream would have little value. However, I'm not suggesting that we eliminate that functionality. To sum up, I propose that when the Parser class receives a URL, it opens and closes the stream itself. When it receives a stream, it does NOT close the stream itself. What do you think? - Keith -- View this message in context: http://www.nabble.com/Opening-and-Closing-Document-Input-Streams-tf4488928.html#a12801853 Sent from the Apache Tika - Development mailing list archive at Nabble.com.
