Thilo Goetz wrote:
Marshall Schor (JIRA) wrote:
Space/Time tradeoffs in the CAS
-------------------------------
Key: UIMA-1089
URL: https://issues.apache.org/jira/browse/UIMA-1089
Project: UIMA
Issue Type: Improvement
Components: Core Java Framework
Affects Versions: 2.2.2
Reporter: Marshall Schor
Priority: Minor
Investigate / implement optimizations that trade user-controllable
time (running the optimizations) for space. One such optimization
could be: sharing strings. To do the sharing requires additional
computation and (temporary) storage to detect the sharing
opportunities, but results in space savings. For instance, a common
annotation might assign short strings like "noun" to a
"part-of-speech" feature. If you are processing a large document,
there may be a large number of these kinds of string valued features,
picked from a small pool of allowable values. The CAS's string
storage might be able to be optimized to share the string references
in this case, at a cost of temporarily creating a hash table of the
unique strings and using it to identify sharing possibilities. A new
API call to do this optimization would isolate the performance/space
overhead of doing this optimization to just those users and times
where it makes sense to do this.
An alternative would be to automatically figure this out for some
selected kinds of optimizations, but I'm not sure that could be done
without impacting finely-tuned systems negatively.
Marshall,
I'm not sure what you're doing here. Why don't you just
start discussion threads on the mailing list? Why do these
things need to be in Jira?
I thought the reason to put these in Jira was to "track" them so they
don't get lost. It seemed like a good idea to me. The discussion can
take place as Jira comments, and later can be easily located. I don't
have a strong preference, though.
-Marshall