[
https://issues.apache.org/jira/browse/UIMA-483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Marshall Schor updated UIMA-483:
--------------------------------
Affects Version/s: (was: 2.1)
2.2
Changed version affected to 2.2 - based on the comments - it's not something
we'll likely fix for 2.2 release. In addition to Thilo's comments, Eddie
suggest that although it (removing the special support in the String Heap to
store data as one big char array - something done originally to support the C++
side of the story) might make the Java <-> C++ connection (done via JNI) slower
in some cases, there are better things to do to ameliorate this, including
figuring out and supporting some kind of "delta CAS" approach that just sends
changes. This might also allow us to move away from the JNI approach to C++
operabilty - in favor of one which would be more robust - using sockets +
serialization to support running the C++ in a separate, isolated-from-Java
address space.
> JCas method like getSofaDataString that doesn't copy the chars from the
> StringHeap
> ----------------------------------------------------------------------------------
>
> Key: UIMA-483
> URL: https://issues.apache.org/jira/browse/UIMA-483
> Project: UIMA
> Issue Type: Improvement
> Components: Core Java Framework
> Affects Versions: 2.2
> Reporter: Greg Holmberg
>
> I process large documents--the String I pass to JCas.setSofaDataString may be
> as large 100 MBs (50,000,000 chars). This is causing the JVM to run out of
> memory when we have many concurrent AnalysisEngines running.
> I traced JCas.getSofaDataString(), and it eventually calls
> StringHeap.getStringForCode(), which does a "new String" from it's private
> char[] (which does a copy).
> This would happen for each annotator. We have five, so now the 100 MBs has
> become 600 MBs. Multiply by 10 concurrent AnalysisEngines, and that's 6,000
> MBs.
> Perhaps there could be a variation on getSofaDataString that returns one of
> the other classes (besides String) that implements CharSequence. A
> CharBuffer perhaps, or even a new class the implements the CharSequence
> interface but is read-only (just four methods). Or even just return a char[]
> or char[] and begin/end offset into the StringHeap.
> If nothing else, perhaps the document text should be treated specially from
> all the little strings in the StringHeap, and be stored separately, so calls
> to getSofaDataString() simply return a reference to an existing String
> object, without copying.
> I'm open to possibilities, I just need the copying to end.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.