Marshall Schor wrote:
Thilo Goetz wrote:
Marshall Schor wrote:
Thilo Goetz (JIRA) wrote:
Some applications may break if they require == between instances of
the same JCas object. Other of course won't care. So - it's good
for this to be configurable.
Any annotator that works with this assumption is broken IMO.
Why would anybody make such an assumption?
One use case: With JCas it is possible to add fields to the cover class
(thus, you could add a hashmap object, for instance); this is described
in the documentation for JCas. Those field values are only preserved
for different iterations if the JCas instance is kept.
-Marshall
I have nothing to add to what I said earlier. The JCas cache is an
"optimization" (or not), and it shouldn't be relied upon for program
correctness - nor should our documentation suggest that it should.
Now that the JCas cache is configurable, annotators that rely on the
JCas cache for correctness will no longer work in UIMA instances
configured not to use the cache.
I would like to see benchmarks that show a clear *performance*
advantage from using the JCas cache. So far all the benchmarks
I've run show that turning off the JCas cache not only reduces
the memory overhead significantly, it's also faster. YMMV.
--Thilo
I don't see anything
in our documentation that encourages this. To the contrary,
we say that we don't guarantee object identity for feature
structures, and that equals() should be used to compare them.
It might be good, also, to put in "soft references" for this - which
will be reclaimed if memory gets low. But this might end up doubling
the size of the storage used for this (to hold the soft reference)...
-Marshall
Use of the JCas cache should be configurable
--------------------------------------------
Key: UIMA-1068
URL: https://issues.apache.org/jira/browse/UIMA-1068
Project: UIMA
Issue Type: Improvement
Components: Core Java Framework
Affects Versions: 2.2.2
Reporter: Thilo Goetz
Assignee: Thilo Goetz
Fix For: 2.3
The JCas caches all CAS objects that are accessed through it. This
means that JCas objects that are no longer used can't be garbage
collected. If only part of the processing chain uses the JCas, or
the caching is redundant for some other reason, this produces a
severe memory overhead.
I ran the same experiment I ran for UIMA-1067: doubled the size of
Moby Dick and ran the POS tagger from the sandbox. I used the
improved version from UIMA-1067 as base case and simply commented
out the line that adds JCas objects to the cache. This reduced the
required heap size from 115MB to 105MB. It also improved the
performance from around 10s for the base case to consistently under
9s for the version without any caching. I looked at the tagger
source code, and saw that it keeps its own list of tokens around.
So the savings are just the caching data structure.
There may be cases where the JCas cache is a performance win, though
I'd be curious to see the benchmarks. So we should not just turn it
off, but make it configurable.