[jira] [Commented] (UIMA-5106) uv3 constant "id" for FSs (Proposed new Feature for uv3)
[ https://issues.apache.org/jira/browse/UIMA-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15677858#comment-15677858 ] Richard Eckart de Castilho commented on UIMA-5106: -- In WebAnno, we're using the FS address for this - it remains stable as long as we use SerialFormat.SERIALIZED_TSI or SerialFormat.SERIALIZED. > uv3 constant "id" for FSs (Proposed new Feature for uv3) > > > Key: UIMA-5106 > URL: https://issues.apache.org/jira/browse/UIMA-5106 > Project: UIMA > Issue Type: New Feature > Components: Core Java Framework >Reporter: Marshall Schor >Priority: Minor > > Add constant ID for FSs. This would be an incrementing, long value. It would > be constant through serialization/ deserialization cycles. There would be a > lazily created map from longs to FSs (via weak links) to allow direct access > from the ID to the FS. Lazy intent is to not have a cost for this > (space/time) other than the cost for 1 long / FS, if it is not used. > We could make this feature optional, as well, to avoid the 8 bytes per FS > overhead, but in V3, I think that's not a good tradeoff (space savings vs > complexity). > Issues: > * Current design allows parallelism of services, with returned results > "stacked" into receiving CAS; would need to change (some of) the IDs coming > back. > CAS would need to have the high-water-mark value as part of serializations. > Backwards compatibility: > * loading V2 CASs: generate new IDs upon loading. > * serializing to V2: (for connecting to V2 services): drop the IDs. > This is a proposed new V3 feature; comments appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (UIMA-5106) uv3 constant "id" for FSs (Proposed new Feature for uv3)
[ https://issues.apache.org/jira/browse/UIMA-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674095#comment-15674095 ] Daniel Gruhl commented on UIMA-5106: In systems with persistent analytics (that is, where CAS are stored long term and incrementally annotated, often by humans) it is very helpful to have a stabile UUID to a feature structure. For example, there may be a document in a CAS that is under analysis. Being able to refer to a span of that sofa and send it to a human for review or adjudication is very helpful. It also allows the use of CAS to hold "entity information", that is, frames of knowledge, or to represent higher level concepts (e.g., a web site CAS can be pointed to by all it's page CAS). This was critical in large persistent UIMA system such as WebFountain and it would be nice to see it make its way into the standard. > uv3 constant "id" for FSs (Proposed new Feature for uv3) > > > Key: UIMA-5106 > URL: https://issues.apache.org/jira/browse/UIMA-5106 > Project: UIMA > Issue Type: New Feature > Components: Core Java Framework >Reporter: Marshall Schor >Priority: Minor > > Add constant ID for FSs. This would be an incrementing, long value. It would > be constant through serialization/ deserialization cycles. There would be a > lazily created map from longs to FSs (via weak links) to allow direct access > from the ID to the FS. Lazy intent is to not have a cost for this > (space/time) other than the cost for 1 long / FS, if it is not used. > We could make this feature optional, as well, to avoid the 8 bytes per FS > overhead, but in V3, I think that's not a good tradeoff (space savings vs > complexity). > Issues: > * Current design allows parallelism of services, with returned results > "stacked" into receiving CAS; would need to change (some of) the IDs coming > back. > CAS would need to have the high-water-mark value as part of serializations. > Backwards compatibility: > * loading V2 CASs: generate new IDs upon loading. > * serializing to V2: (for connecting to V2 services): drop the IDs. > This is a proposed new V3 feature; comments appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (UIMA-5106) uv3 constant "id" for FSs (Proposed new Feature for uv3)
[ https://issues.apache.org/jira/browse/UIMA-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605866#comment-15605866 ] Marshall Schor commented on UIMA-5106: -- # re: getting a blessed id feature that all FSs would inherit: I wasn't thinking of users adding a new feature on TOP or Annotation - you are correct in observing that this would be restricted to some user-defined type which wasn't one of the built-in "feature-final" types. A key idea is that only selected types would have this - the ones you wanted it for. # re: users wanting their own stable IDs - they can manage this themselves. True, but the dev list has had requests for UIMA to help here. There were 2 kinds of help wanted: #* a) assigning unique ids, and #* b) having a way to go from those IDs to the associated Feature Structures (a map). As we move into more distributed environments, having some principled way to have a hierarchical naming that results in guaranteed unique names seems useful; this context is where the OID idea came up. But you may be right - this may not be of very much interest (yet) to the wider community. (Community - if this is wrong, please speak up :-) ). > uv3 constant "id" for FSs (Proposed new Feature for uv3) > > > Key: UIMA-5106 > URL: https://issues.apache.org/jira/browse/UIMA-5106 > Project: UIMA > Issue Type: New Feature > Components: Core Java Framework >Reporter: Marshall Schor >Priority: Minor > Fix For: 3.0.0SDKexp > > > Add constant ID for FSs. This would be an incrementing, long value. It would > be constant through serialization/ deserialization cycles. There would be a > lazily created map from longs to FSs (via weak links) to allow direct access > from the ID to the FS. Lazy intent is to not have a cost for this > (space/time) other than the cost for 1 long / FS, if it is not used. > We could make this feature optional, as well, to avoid the 8 bytes per FS > overhead, but in V3, I think that's not a good tradeoff (space savings vs > complexity). > Issues: > * Current design allows parallelism of services, with returned results > "stacked" into receiving CAS; would need to change (some of) the IDs coming > back. > CAS would need to have the high-water-mark value as part of serializations. > Backwards compatibility: > * loading V2 CASs: generate new IDs upon loading. > * serializing to V2: (for connecting to V2 services): drop the IDs. > This is a proposed new V3 feature; comments appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (UIMA-5106) uv3 constant "id" for FSs (Proposed new Feature for uv3)
[ https://issues.apache.org/jira/browse/UIMA-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605577#comment-15605577 ] Richard Eckart de Castilho commented on UIMA-5106: -- Hm, ok. So then I am not sure if this feature here is needed. If users want to assign own stable IDs, then they can create a feature for that and are good. But users cannot create new features on high-level types (e.g. on TOP or Annotation) - is it that you wanted to introduce a "blessed" external ID feature that all FSes would inherit? > uv3 constant "id" for FSs (Proposed new Feature for uv3) > > > Key: UIMA-5106 > URL: https://issues.apache.org/jira/browse/UIMA-5106 > Project: UIMA > Issue Type: New Feature > Components: Core Java Framework >Reporter: Marshall Schor >Priority: Minor > Fix For: 3.0.0SDKexp > > > Add constant ID for FSs. This would be an incrementing, long value. It would > be constant through serialization/ deserialization cycles. There would be a > lazily created map from longs to FSs (via weak links) to allow direct access > from the ID to the FS. Lazy intent is to not have a cost for this > (space/time) other than the cost for 1 long / FS, if it is not used. > We could make this feature optional, as well, to avoid the 8 bytes per FS > overhead, but in V3, I think that's not a good tradeoff (space savings vs > complexity). > Issues: > * Current design allows parallelism of services, with returned results > "stacked" into receiving CAS; would need to change (some of) the IDs coming > back. > CAS would need to have the high-water-mark value as part of serializations. > Backwards compatibility: > * loading V2 CASs: generate new IDs upon loading. > * serializing to V2: (for connecting to V2 services): drop the IDs. > This is a proposed new V3 feature; comments appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (UIMA-5106) uv3 constant "id" for FSs (Proposed new Feature for uv3)
[ https://issues.apache.org/jira/browse/UIMA-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605501#comment-15605501 ] Marshall Schor commented on UIMA-5106: -- V3 isn't planning to remove the existing IDs, which work as you mention above, so you could continue to use that. It also currently assigns values and adjusts values coming back from parallel executing remote services so they remain unique. None of this is planned to change, except it is somewhat likely the actual IDs will be changed from incrementing-by-1 to incrementing-exactly-like-v2-increments-them :-) . > uv3 constant "id" for FSs (Proposed new Feature for uv3) > > > Key: UIMA-5106 > URL: https://issues.apache.org/jira/browse/UIMA-5106 > Project: UIMA > Issue Type: New Feature > Components: Core Java Framework >Reporter: Marshall Schor >Priority: Minor > Fix For: 3.0.0SDKexp > > > Add constant ID for FSs. This would be an incrementing, long value. It would > be constant through serialization/ deserialization cycles. There would be a > lazily created map from longs to FSs (via weak links) to allow direct access > from the ID to the FS. Lazy intent is to not have a cost for this > (space/time) other than the cost for 1 long / FS, if it is not used. > We could make this feature optional, as well, to avoid the 8 bytes per FS > overhead, but in V3, I think that's not a good tradeoff (space savings vs > complexity). > Issues: > * Current design allows parallelism of services, with returned results > "stacked" into receiving CAS; would need to change (some of) the IDs coming > back. > CAS would need to have the high-water-mark value as part of serializations. > Backwards compatibility: > * loading V2 CASs: generate new IDs upon loading. > * serializing to V2: (for connecting to V2 services): drop the IDs. > This is a proposed new V3 feature; comments appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (UIMA-5106) uv3 constant "id" for FSs (Proposed new Feature for uv3)
[ https://issues.apache.org/jira/browse/UIMA-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605494#comment-15605494 ] Marshall Schor commented on UIMA-5106: -- another user says we should look at / consider using OIDs. https://en.wikipedia.org/wiki/Object_identifier The general use case is for future UIMA uses where Feature Structures are generated and stored in a potentially widely distributed manner. This could solve this problem: * a client sends a CAS to 2 services (for parallel processing), who both process it and return it * the client (first thought) would adjust the unique IDs for one of the returned CAS's new feature structures. This is actually done today for the internal IDs. But, on reflection, we might imagine that the purpose for having the unique ID was to put that value into other features as well. There is no reasonable way to find all those uses and re-adjust them as well, I think. Using OIDs solves this, because they don't need adjusting. It could be implemented along these lines: * normal OIDs for new FSs would be, for instance ".1", ".2", ... * OIDs for new FSs produced at a service from a client would have OIDs of .8.1, .8.2, ... for one service, and .9.1, .9.2 etc, for another. * These OIDs would never need adjusting. * The prefix (.8 .9, in the above example) could be generated by the client, and sent along with the CAS to each remote service call Combining this with the facility to only have these things attached the subset of Feature Structures users want unique ids for (using the reserved feature name, which we might call uimaOID), this feels like a good direction to consider, especially for farther in the future use cases. > uv3 constant "id" for FSs (Proposed new Feature for uv3) > > > Key: UIMA-5106 > URL: https://issues.apache.org/jira/browse/UIMA-5106 > Project: UIMA > Issue Type: New Feature > Components: Core Java Framework >Reporter: Marshall Schor >Priority: Minor > Fix For: 3.0.0SDKexp > > > Add constant ID for FSs. This would be an incrementing, long value. It would > be constant through serialization/ deserialization cycles. There would be a > lazily created map from longs to FSs (via weak links) to allow direct access > from the ID to the FS. Lazy intent is to not have a cost for this > (space/time) other than the cost for 1 long / FS, if it is not used. > We could make this feature optional, as well, to avoid the 8 bytes per FS > overhead, but in V3, I think that's not a good tradeoff (space savings vs > complexity). > Issues: > * Current design allows parallelism of services, with returned results > "stacked" into receiving CAS; would need to change (some of) the IDs coming > back. > CAS would need to have the high-water-mark value as part of serializations. > Backwards compatibility: > * loading V2 CASs: generate new IDs upon loading. > * serializing to V2: (for connecting to V2 services): drop the IDs. > This is a proposed new V3 feature; comments appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (UIMA-5106) uv3 constant "id" for FSs (Proposed new Feature for uv3)
[ https://issues.apache.org/jira/browse/UIMA-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605216#comment-15605216 ] Richard Eckart de Castilho commented on UIMA-5106: -- Being used to UIMA managing IDs, I would prefer that UIMA continues to manage the IDs and that is automatically and always assigns them - just like a database auto-increment primary key. I would also prefer if they continue to be kept separate from the features space. IMHO using an int ID should be sufficient. Int worked so far... long would be nice... but not really important. IMHO all feature structures should have the ID, even the FSArray. > uv3 constant "id" for FSs (Proposed new Feature for uv3) > > > Key: UIMA-5106 > URL: https://issues.apache.org/jira/browse/UIMA-5106 > Project: UIMA > Issue Type: New Feature > Components: Core Java Framework >Reporter: Marshall Schor >Priority: Minor > Fix For: 3.0.0SDKexp > > > Add constant ID for FSs. This would be an incrementing, long value. It would > be constant through serialization/ deserialization cycles. There would be a > lazily created map from longs to FSs (via weak links) to allow direct access > from the ID to the FS. Lazy intent is to not have a cost for this > (space/time) other than the cost for 1 long / FS, if it is not used. > We could make this feature optional, as well, to avoid the 8 bytes per FS > overhead, but in V3, I think that's not a good tradeoff (space savings vs > complexity). > Issues: > * Current design allows parallelism of services, with returned results > "stacked" into receiving CAS; would need to change (some of) the IDs coming > back. > CAS would need to have the high-water-mark value as part of serializations. > Backwards compatibility: > * loading V2 CASs: generate new IDs upon loading. > * serializing to V2: (for connecting to V2 services): drop the IDs. > This is a proposed new V3 feature; comments appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (UIMA-5106) uv3 constant "id" for FSs (Proposed new Feature for uv3)
[ https://issues.apache.org/jira/browse/UIMA-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605203#comment-15605203 ] Marshall Schor commented on UIMA-5106: -- Thinking harder about this, I'd like to close this Jira as won't-do-it-this-way, and open a new one that changes the goal slightly to support a user-specified unique ID feature which could selectively be added to selected Feature Structure (FS) declarations. The main difference is this allows users to specify which FS Types they want this additional ID on. This allows other FS to remain more light-weight. Some consequences: * The built-in FSarray would not have this ID (it doesn't have fields). * No space cost in FSs of this when not being used * No space/time cost for doing the special indexing by id for FSs the user is not interested in (for example, the little FSs that make up the list cells in the various FSLists). 2 approaches come to mind: # having a "reserved" feature name. The user would declare this feature with range "long" on any FS where they wanted the unique ID # letting users designate one or more features of type long to be a unique-id, using an API call. The 2nd approach has some difficulties with type merging - the "application" consuming someone else's aggregate+typesystem may not know the other's assumptions about unique-id. So I think the "reserved name" approach would be best. Possible feature name: uimaBuiltInUID or uimaUID (UIMA Unique ID). Other thoughts welcome. > uv3 constant "id" for FSs (Proposed new Feature for uv3) > > > Key: UIMA-5106 > URL: https://issues.apache.org/jira/browse/UIMA-5106 > Project: UIMA > Issue Type: New Feature > Components: Core Java Framework >Reporter: Marshall Schor >Priority: Minor > Fix For: 3.0.0SDKexp > > > Add constant ID for FSs. This would be an incrementing, long value. It would > be constant through serialization/ deserialization cycles. There would be a > lazily created map from longs to FSs (via weak links) to allow direct access > from the ID to the FS. Lazy intent is to not have a cost for this > (space/time) other than the cost for 1 long / FS, if it is not used. > We could make this feature optional, as well, to avoid the 8 bytes per FS > overhead, but in V3, I think that's not a good tradeoff (space savings vs > complexity). > Issues: > * Current design allows parallelism of services, with returned results > "stacked" into receiving CAS; would need to change (some of) the IDs coming > back. > CAS would need to have the high-water-mark value as part of serializations. > Backwards compatibility: > * loading V2 CASs: generate new IDs upon loading. > * serializing to V2: (for connecting to V2 services): drop the IDs. > This is a proposed new V3 feature; comments appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (UIMA-5106) uv3 constant "id" for FSs (Proposed new Feature for uv3)
[ https://issues.apache.org/jira/browse/UIMA-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15603636#comment-15603636 ] Marshall Schor commented on UIMA-5106: -- I was planning to use the existing _id() for this; making it a long turns out to cascade into a whole lot of work. I think this is designing ahead of need - so I plan to keep this as an int as it is now, until we see some real use requirement. > uv3 constant "id" for FSs (Proposed new Feature for uv3) > > > Key: UIMA-5106 > URL: https://issues.apache.org/jira/browse/UIMA-5106 > Project: UIMA > Issue Type: New Feature > Components: Core Java Framework >Reporter: Marshall Schor >Priority: Minor > Fix For: 3.0.0SDKexp > > > Add constant ID for FSs. This would be an incrementing, long value. It would > be constant through serialization/ deserialization cycles. There would be a > lazily created map from longs to FSs (via weak links) to allow direct access > from the ID to the FS. Lazy intent is to not have a cost for this > (space/time) other than the cost for 1 long / FS, if it is not used. > We could make this feature optional, as well, to avoid the 8 bytes per FS > overhead, but in V3, I think that's not a good tradeoff (space savings vs > complexity). > Issues: > * Current design allows parallelism of services, with returned results > "stacked" into receiving CAS; would need to change (some of) the IDs coming > back. > CAS would need to have the high-water-mark value as part of serializations. > Backwards compatibility: > * loading V2 CASs: generate new IDs upon loading. > * serializing to V2: (for connecting to V2 services): drop the IDs. > This is a proposed new V3 feature; comments appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (UIMA-5106) uv3 constant "id" for FSs (Proposed new Feature for uv3)
[ https://issues.apache.org/jira/browse/UIMA-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15498418#comment-15498418 ] Richard Eckart de Castilho commented on UIMA-5106: -- I understand, thanks. Regarding the preservation of IDs across serialization: this is very useful. Maybe it should not always be mandatory. A user may intentionally want to "garbage collect" the ID space. E.g. right now with v2, I use a variant of SERIALIZED if I want to preserve IDs and COMPRESSED_FILTERED if I wanted to garbage-collect IDs (and FSes). I could imagine that with v3, the preservation of IDs could become a parameter to some serialization/deserialization formats. > uv3 constant "id" for FSs (Proposed new Feature for uv3) > > > Key: UIMA-5106 > URL: https://issues.apache.org/jira/browse/UIMA-5106 > Project: UIMA > Issue Type: New Feature > Components: Core Java Framework >Reporter: Marshall Schor >Priority: Minor > Fix For: 3.0.0SDKexp > > > Add constant ID for FSs. This would be an incrementing, long value. It would > be constant through serialization/ deserialization cycles. There would be a > lazily created map from longs to FSs (via weak links) to allow direct access > from the ID to the FS. Lazy intent is to not have a cost for this > (space/time) other than the cost for 1 long / FS, if it is not used. > We could make this feature optional, as well, to avoid the 8 bytes per FS > overhead, but in V3, I think that's not a good tradeoff (space savings vs > complexity). > Issues: > * Current design allows parallelism of services, with returned results > "stacked" into receiving CAS; would need to change (some of) the IDs coming > back. > CAS would need to have the high-water-mark value as part of serializations. > Backwards compatibility: > * loading V2 CASs: generate new IDs upon loading. > * serializing to V2: (for connecting to V2 services): drop the IDs. > This is a proposed new V3 feature; comments appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (UIMA-5106) uv3 constant "id" for FSs (Proposed new Feature for uv3)
[ https://issues.apache.org/jira/browse/UIMA-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15498033#comment-15498033 ] Marshall Schor commented on UIMA-5106: -- only because the low level cas address in v2 is not guaranteed to remain preserved across various serializations/deserializations. This would be "elevating" this previously "internal use" value (that many users made use of, in spite of it's non-guarantees of stability), to a more official and stable status. > uv3 constant "id" for FSs (Proposed new Feature for uv3) > > > Key: UIMA-5106 > URL: https://issues.apache.org/jira/browse/UIMA-5106 > Project: UIMA > Issue Type: New Feature > Components: Core Java Framework >Reporter: Marshall Schor >Priority: Minor > Fix For: 3.0.0SDKexp > > > Add constant ID for FSs. This would be an incrementing, long value. It would > be constant through serialization/ deserialization cycles. There would be a > lazily created map from longs to FSs (via weak links) to allow direct access > from the ID to the FS. Lazy intent is to not have a cost for this > (space/time) other than the cost for 1 long / FS, if it is not used. > We could make this feature optional, as well, to avoid the 8 bytes per FS > overhead, but in V3, I think that's not a good tradeoff (space savings vs > complexity). > Issues: > * Current design allows parallelism of services, with returned results > "stacked" into receiving CAS; would need to change (some of) the IDs coming > back. > CAS would need to have the high-water-mark value as part of serializations. > Backwards compatibility: > * loading V2 CASs: generate new IDs upon loading. > * serializing to V2: (for connecting to V2 services): drop the IDs. > This is a proposed new V3 feature; comments appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (UIMA-5106) uv3 constant "id" for FSs (Proposed new Feature for uv3)
[ https://issues.apache.org/jira/browse/UIMA-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15497557#comment-15497557 ] Richard Eckart de Castilho commented on UIMA-5106: -- I though the ID property (it is not a feature as in Feature Structure) resembles the LowLevelCas address in v2, so I'm not exactly sure why this is considered to be a new feature. > uv3 constant "id" for FSs (Proposed new Feature for uv3) > > > Key: UIMA-5106 > URL: https://issues.apache.org/jira/browse/UIMA-5106 > Project: UIMA > Issue Type: New Feature > Components: Core Java Framework >Reporter: Marshall Schor >Priority: Minor > Fix For: 3.0.0SDKexp > > > Add constant ID for FSs. This would be an incrementing, long value. It would > be constant through serialization/ deserialization cycles. There would be a > lazily created map from longs to FSs (via weak links) to allow direct access > from the ID to the FS. Lazy intent is to not have a cost for this > (space/time) other than the cost for 1 long / FS, if it is not used. > We could make this feature optional, as well, to avoid the 8 bytes per FS > overhead, but in V3, I think that's not a good tradeoff (space savings vs > complexity). > Issues: > * Current design allows parallelism of services, with returned results > "stacked" into receiving CAS; would need to change (some of) the IDs coming > back. > CAS would need to have the high-water-mark value as part of serializations. > Backwards compatibility: > * loading V2 CASs: generate new IDs upon loading. > * serializing to V2: (for connecting to V2 services): drop the IDs. > This is a proposed new V3 feature; comments appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)