Tika 1.5, Java 7, Desktop Java application using Tika for content parsing (to feed into a machine learning application)
Problem: How to avoid Out of Memory errors during Tika parsing. One user reported issues with Out of Memory errors (Java heap) with files when attempting to pre-process documents. The application essentially parses each file (of varying sizes and content-type) using Tika and then generates a data dictionary entry for each document. The parsing uses a fairly simple Tika method. StringBuffer bodytext =3D new StringBuffer() ; StringBuffer metatext =3D new StringBuffer() ; int writelimit =3D -1 ; BodyContentHandler content =3D new BodyContentHandler(writelimit); {Parser).parse(stream, content, metadata, context); The body content is placed into bodytext and metadata into metatext. This is returned to the dataset dictionary routine. What appears to be happening is some files are throwing OutOfMemoryErrors during Tika parsing. This fails the entire system and results in an unresponsive application. I understand that new BodyContentHandler(writelimit) can limit the number of chars via the writelimit but simply cutting off the output at a fixed amount will not reasonably work for this application--as content might be missed. Has anyone else needed to handle this issue in a simple desktop, Java application? I thought of various fixes including using Runtime.maxMemory, freeMemory, etc. to indirectly detect low memory situations before parsing but that has not worked well. Also a Java OutOfMemoryError essentially freezes the system and limits any recovery ability so this is not nice for the users. I also thought of perhaps some type of object-to-disk caching but the one implementation that I saw was for J2EE applications and I am not sure how it could be integrated into Tika. Ia also though of processing files in chunks but the BodyContentHandler does not seem to handle chunking (with offsets) right now. NOTE: I have already tweaked the Java heap at runtime via -Xmx (max heap) and -Xms (initial heap) but some files exceed the physical RAM in the system. Any ideas? Shannon -- -----------------------------------------------------------------------8 Shannon Brown sbr...@abbacan.com "[Courage is] when you know you're licked before you begin but you begin anyway and you see it through no matter what. You rarely win, but sometimes you do." Atticus Finch in To Kill a Mockingbird by Harper Lee
signature.asc
Description: OpenPGP digital signature