On Thu, 28 Aug 2014, ruby wrote:
Since the files contain over 5GB data, the content string here will end up
too much data in memory. I want to avoid this and want to read chunk at a
time.

You'll probably need your own custom ContentHandler, which detects when there's too much data, and flushes it / starts a new file / etc

There's an example of how to do this in the tika-examples package, look at
parseToPlainTextChunks from ContentHandlerExample:
https://svn.apache.org/repos/asf/tika/trunk/tika-example/src/main/java/org/apache/tika/example/ContentHandlerExample.java

Basically though, you'll want to extend from DefaultContentHandler (which takes care of most of the basics for you), then write your own logic to handle outputting / flushing / chunking as per your needs

Nick

Reply via email to