On Thu, 28 Aug 2014, ruby wrote:
Since the files contain over 5GB data, the content string here will end up
too much data in memory. I want to avoid this and want to read chunk at a
time.
You'll probably need your own custom ContentHandler, which detects when
there's too much data, and flushes it / starts a new file / etc
There's an example of how to do this in the tika-examples package, look at
parseToPlainTextChunks from ContentHandlerExample:
https://svn.apache.org/repos/asf/tika/trunk/tika-example/src/main/java/org/apache/tika/example/ContentHandlerExample.java
Basically though, you'll want to extend from DefaultContentHandler (which
takes care of most of the basics for you), then write your own logic to
handle outputting / flushing / chunking as per your needs
Nick