JCas method like getSofaDataString that doesn't copy the chars from the 
StringHeap
----------------------------------------------------------------------------------

                 Key: UIMA-483
                 URL: https://issues.apache.org/jira/browse/UIMA-483
             Project: UIMA
          Issue Type: Improvement
    Affects Versions: 2.1
            Reporter: Greg Holmberg


I process large documents--the String I pass to JCas.setSofaDataString may be 
as large 100 MBs (50,000,000 chars).  This is causing the JVM to run out of 
memory when we have many concurrent AnalysisEngines running.

I traced JCas.getSofaDataString(), and it eventually calls 
StringHeap.getStringForCode(), which does a "new String" from it's private 
char[] (which does a copy).

This would happen for each annotator.  We have five, so now the 100 MBs has 
become 600 MBs.  Multiply by 10 concurrent AnalysisEngines, and that's 6,000 
MBs.

Perhaps there could be a variation on getSofaDataString that returns one of the 
other classes (besides String) that implements CharSequence.  A CharBuffer 
perhaps, or even a new class the implements the CharSequence interface but is 
read-only (just four methods).  Or even just return a char[] or char[] and 
begin/end offset into the StringHeap.

If nothing else, perhaps the document text should be treated specially from all 
the little strings in the StringHeap, and be stored separately, so calls to 
getSofaDataString() simply return a reference to an existing String object, 
without copying.

I'm open to possibilities, I just need the copying to end.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to