[jira] [Comment Edited] (UIMA-5662) uv3 support CAS deserialization subsequent low level access

Richard Eckart de Castilho (JIRA) Fri, 08 Dec 2017 10:31:54 -0800

    [ 
https://issues.apache.org/jira/browse/UIMA-5662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16283983#comment-16283983
 ]


Richard Eckart de Castilho edited comment on UIMA-5662 at 12/8/17 6:30 PM:
---------------------------------------------------------------------------

I'm re-reading the proposal. Reddng the second time, it seems as if you do not 
really plan adding a new built-in type to the type system, rather having one or 
more maps in parallel to the CAS in which the user could track IDs. This seems 
like the approach taken in the XMI deserializer-serializer were (I believe) ID 
information can be recorded during de-serializion and re-used during 
serialization.

I'm still not very convinced though. E.g. in the case of the CAS/annotation 
editor, I'd not only have to keep the CAS around, but also the map. It seems 
like when adding new FSes to the CAS, I'd have to manually figure out the next 
ID.

True, the approach would allow supporting multiple maps. I could imagine it to 
be an interesting approach under some conditions/circumstances:

* lookups FS -> ID are fast
* lookups ID -> FS are fast (i.e. a uni-directional map would not be sufficient)
* the maps are stored directly inside the CAS so that the client-code doesn't 
have to juggle them around manually
* the client code can set up an ID assignment strategy for each map for cases 
where a new FS is created and added to the CAS
* one such strategy should allow explicitly adding an ID->FS mapping e.g. in 
reader components where IDs may be obtained from the file format being read. 
I.e. the access to the maps should not be limited to UIMA framework code.

And some open questions
* Would there be some way of controlling the XMI element IDs using maybe a 
specially-named map and thus remove the need for the current 
XmiSerializationSharedData?
* Is general provision for transporting out-of-type-system information 
introduced - and how would e.g. XMI deal with that?

There could be some risks
* If the maps are stored like FSes during serialization (e.g. in XMI), then it 
could cause problem with existing code that reads/writes XMI.

For the time being, at least for me a single hard-coded map would be sufficient 
and it could be used to transport ID information from formats that support it 
to formats that support and it would be ok if there are only specific cases 
when this map would be effective, e.g. not if one CAS is merged into another 
one. And potentially that single map could even be used with XMI as a 
(partial?) replacement for the XmiSerializationSharedData.


was (Author: rec):
I'm re-reading the proposal. Reddng the second time, it seems as if you do not 
really plan adding a new built-in type to the type system, rather having one or 
more maps in parallel to the CAS in which the user could track IDs. This seems 
like the approach taken in the XMI deserializer-serializer were (I believe) ID 
information can be recorded during de-serializion and re-used during 
serialization.

I'm still not very convinced though. E.g. in the case of the CAS/annotation 
editor, I'd not only have to keep the CAS around, but also the map. It seems 
like when adding new FSes to the CAS, I'd have to manually figure out the next 
ID.

True, the approach would allows supporting multiple maps. I could imagine it to 
be an interesting approach under some conditions/circumstances:

* lookups FS -> ID are fast
* lookups ID -> FS are fast (i.e. a uni-directional map would not be sufficient)
* the maps are stored directly inside the CAS so that the client-code doesn't 
have to juggle them around manually
* the client code can set up an ID assignment strategy for each map for cases 
where a new FS is created and added to the CAS
* one such strategy should allow explicitly adding an ID->FS mapping e.g. in 
reader components where IDs may be obtained from the file format being read. 
I.e. the access to the maps should not be limited to UIMA framework code.

And some open questions
* Would there be some way of controlling the XMI element IDs using maybe a 
specially-named map and thus remove the need for the current 
XmiSerializationSharedData?
* Is general provision for transporting out-of-type-system information 
introduced - and how would e.g. XMI deal with that?

There could be some risks
* If the maps are stored like FSes during serialization (e.g. in XMI), then it 
could cause problem with existing code that reads/writes XMI.

For the time being, at least for me a single hard-coded map would be sufficient 
and it could be used to transport ID information from formats that support it 
to formats that support and it would be ok if there are only specific cases 
when this map would be effective, e.g. not if one CAS is merged into another 
one. And potentially that single map could even be used with XMI as a 
(partial?) replacement for the XmiSerializationSharedData.

> uv3 support CAS deserialization subsequent low level access
> -----------------------------------------------------------
>
>                 Key: UIMA-5662
>                 URL: https://issues.apache.org/jira/browse/UIMA-5662
>             Project: UIMA
>          Issue Type: Improvement
>          Components: Core Java Framework
>    Affects Versions: 3.0.0SDK-beta
>            Reporter: Marshall Schor
>            Assignee: Marshall Schor
>            Priority: Minor
>             Fix For: 3.0.0SDK
>
>
> Some users depend 1) constant v2-ids for FSs preserved in deserialization and 
> serialization, and 2) low level cas API access to these.
> V3 normally doesn't maintain tables linking ids to FSs, as these (unless weak 
> refs are used) prevent GC of unreachable FSs.
> Based on a mode, set by -Duima.deserialize_perserve_ids, and also 
> controllable by new config option per deserialize call, alter the 
> deserialization for those deserializers which know about v2 ids, to put these 
> into the map used for low-level CAS access, using the actual v2 ids, and 
> change the v3 next available id for future new FSs to be 1 beyond the end.
> The -Duima.deserialize-preserve_ids global setting is needed to handle the 
> use case of some annotators using low-level APIs, when part of a pipeline is 
> "remoted". 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (UIMA-5662) uv3 support CAS deserialization subsequent low level access

Reply via email to