[jira] [Commented] (UIMA-5662) uv3 support CAS deserialization subsequent low level access

Richard Eckart de Castilho (JIRA) Sun, 17 Dec 2017 11:46:04 -0800

    [ 
https://issues.apache.org/jira/browse/UIMA-5662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294292#comment-16294292
 ]


Richard Eckart de Castilho commented on UIMA-5662:
--------------------------------------------------

Because of the way I am currently used to dealing with the state of affairs in 
UIMA v2, I may be biased towards a specific mode, namely:

* that all FSes always have an address
* that for specific serialization formats these addresses are stable and there 
is no garbage collection during save/load
* that for other serialization formats, the addresses are no stable, but there 
is garbage collection during save/load

So currently, instead of using an API to control when GC should happen, I use 
one or the other serialization method. Mind that this happens in a web 
application (multiple users, concurrent access). Using the current approach, I 
can choose between stable addresses and GC on a per-CAS-instance base even when 
working with multiple CASes simultaneously in a single thread (e.g. when doing 
a diff across CASes). For example, when a user opens a document, the 
server-side processing of the web request first loads the CAS from disk 
(CasCompleteSerializer, stable IDs), then stores it again into a byte array 
(Binary format 6, GC), then loads it again from the byte array into a new CAS 
with a potentially update type system (Binary format 6, lenient loading), then 
saves it again to disk (CasCompleteSerializer, stable IDs). While the user is 
continues to work on the document, load/save always happens using 
CasCompleteSerializer and avoiding GC.

If the UIMA API provides more control over the FS<->ID mapping and if more file 
formats support stable IDs, it would probably no longer be necessary to use 
different formats to achieve this effect. Instead, I would probably try to do 
the following:

* store the data only in a single format (preferably form 6 compressed with 
lenient loading assuming that it eventually supports stable IDs)
* when a user opens a document, load it without FS<->ID mapping to allow for 
garbage collection; save it again
* when a user continues to work on a document, load/save it with FS<->ID 
mapping enabled

If the FS<->ID mapping could make use of weak references, I would probably make 
use of that: once an FS is no longer reachable, the editor has no use for it 
anymore.

> uv3 support CAS deserialization subsequent low level access
> -----------------------------------------------------------
>
>                 Key: UIMA-5662
>                 URL: https://issues.apache.org/jira/browse/UIMA-5662
>             Project: UIMA
>          Issue Type: Improvement
>          Components: Core Java Framework
>    Affects Versions: 3.0.0SDK-beta
>            Reporter: Marshall Schor
>            Assignee: Marshall Schor
>            Priority: Minor
>             Fix For: 3.0.0SDK
>
>
> Some users depend 1) constant v2-ids for FSs preserved in deserialization and 
> serialization, and 2) low level cas API access to these.
> V3 normally doesn't maintain tables linking ids to FSs, as these (unless weak 
> refs are used) prevent GC of unreachable FSs.
> Based on a mode, set by -Duima.deserialize_perserve_ids, and also 
> controllable by new config option per deserialize call, alter the 
> deserialization for those deserializers which know about v2 ids, to put these 
> into the map used for low-level CAS access, using the actual v2 ids, and 
> change the v3 next available id for future new FSs to be 1 beyond the end.
> The -Duima.deserialize-preserve_ids global setting is needed to handle the 
> use case of some annotators using low-level APIs, when part of a pipeline is 
> "remoted". 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (UIMA-5662) uv3 support CAS deserialization subsequent low level access

Reply via email to