Tika 1.5, Java 7, Desktop Java application using Tika for content
parsing (to feed into a machine learning application)

Problem:
How to avoid Out of Memory errors during Tika parsing.
One user reported issues with Out of Memory errors (Java heap) with
files when attempting to pre-process documents.
The application essentially parses each file (of varying sizes and
content-type) using Tika and then generates a data dictionary entry for
each document.
The parsing uses a fairly simple Tika method.
StringBuffer bodytext =3D new StringBuffer() ;
StringBuffer metatext =3D new StringBuffer() ;
int writelimit =3D -1 ;
BodyContentHandler content =3D new BodyContentHandler(writelimit);
{Parser).parse(stream, content, metadata, context);

The body content is placed into bodytext and metadata into metatext.
This is returned to the dataset dictionary routine.

What appears to be happening is some files are throwing
OutOfMemoryErrors during Tika parsing. This fails the entire system and
results in an unresponsive application.

I understand that new BodyContentHandler(writelimit) can limit the
number of chars via the writelimit but simply cutting off the output at
a fixed amount will not reasonably work for this application--as content
might be missed.

Has anyone else needed to handle this issue in a simple desktop, Java
application? I thought of various fixes including using
Runtime.maxMemory, freeMemory, etc. to indirectly detect low memory
situations before parsing but that has not worked well. Also a Java
OutOfMemoryError essentially freezes the system and limits any recovery
ability so this is not nice for the users. I also thought of perhaps
some type of object-to-disk caching but the one implementation that I
saw was for J2EE applications and I am not sure how it could be
integrated into Tika. Ia also though of processing files in chunks but
the BodyContentHandler does not seem to handle chunking (with offsets)
right now. NOTE: I have already tweaked the Java heap at runtime via
-Xmx (max heap) and -Xms (initial heap) but some files exceed the
physical RAM in the system.

Any ideas?

Shannon

-- 
-----------------------------------------------------------------------8
Shannon Brown
sbr...@abbacan.com

"[Courage is] when you know you're licked
before you begin but you begin anyway and
you see it through no matter what. You
rarely win, but sometimes you do."

Atticus Finch in
To Kill a Mockingbird by Harper Lee


Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to