I was mistaken about Java in one detail: for things like Integer(17), there are two ways to create it: new Integer(17), or Integer.valueOf(17). The first call does create a fresh, not == to any other Integer object, while the 2nd call will reuse an existing Integer object for 17 (if it exists). Users are encouraged to switch to Integer.valueOf(xxx) for efficiency in the Javadocs.
I'm now slightly leaning against doing this change for UIMA, because of the edge cases where the user could have depended on object un-equality for 0-length arrays and lists. Users could "manually" achieve the same result using the shared instance values, and (for xmi serialization) marking any features that contain these values as "multi-references-allowed" so the deserialization would share them. This could become a suggested "best practice" for those who use 0-length arrays and empty lists. Not doing this would make two Jiras a "won't fix": https://issues.apache.org/jira/browse/UIMA-5564 https://issues.apache.org/jira/browse/UIMA-5566 What do others think? -Marshall On 9/13/2017 8:22 AM, Marshall Schor wrote: > I posted a Jira for a proposed change in how 0-length UIMA arrays and lists > are > managed. These are immutable objects, and (theoretically) one instance (per > CAS) could be shared. > > In the current implementation, this is managed explicitly by the user - they > can > use a bunch of new APIs to get shared instances. > > I'm thinking a better way is to make this automatically the case, and remove > the > new bunch of APIs (a smaller API set is always a good thing, for essentially > the > same functionality, IMHO). The implementation would change so that the calls > that create "new" 0-length arrays/lists would instead of creating a new one, > only do that if none already existed, and if one already did, it would return > that one. > > This follows Java's general direction for immutable objects, like Strings and > Integer values, which can be shared. > > For cases where people wanted/needed a CAS value "marker" that was tiny, but > unique (like you get with Java's new Object()), we would keep "new TOP(aCas)" > as > something that generated unique instances. What do others think? > > I've seen large-scale implementations of UIMA pipelines with lots of defaulted > 0-length arrays in them; this has the potential to improve space/time > performance a reasonable amount for these. > > -Marshall > >
