On 23.07.2015, at 19:17, Joern Kottmann <[email protected]> wrote:

>> If this is the scenario, another option would be to have the serialized CASes
>> stored along with a reference to their type system, and have some new
>> deserialization capability be able to locate the referred-to type system 
>> along
>> with the CAS to be read in.  Would that "solve" this issue, or are there 
>> other
>> aspects?

https://issues.apache.org/jira/browse/UIMA-2127 ;)

But having the TS stored alongside the CAS also is nice - see below.

> It would probably solve it, but it is not a simple solution either. That
> would mean that the Type System get switched frequently and have be
> looked up all the time.

For DKPro Core, I have implemented a BinaryCasWriter that stores the type 
system in the same file as the binary serialized CAS. It is not always the best 
solution because it adds a fixed overhead to every file, but it is very 
convenient. Optionally, the type system can be stored externally in a separate 
file to avoid this overhead. If and how this typesystem can be used depends on 
which of the six kinds of binary serialization is being used. See [1] for an 
overview over these formats and their properties.

In the BinaryCasReader, depending on the type of serialization, either:
- there is a failure if the pipeline CAS typesystem is not compatible with the 
persisted CAS;
- the type system in the pipeline CAS is reinitialized from the persisted CAS;
- the data from the persisted CAS is loaded leniently, dropping all FSes that 
are not defined in the pipeline CAS typesystem

Furthermore, the BinaryCasReader auto-detects the binary format and loads it, 
be it the Java serialization-based format or one of the binary formats that 
Marschall recently created, or our extended format that also embeds the 
typesystem in the file.

Mind that depending on the use-case a different kind of serialization may be 
appropriate.

For me, this covers in particular the following use-cases:

- fast (de)serialization of the entire CAS
- compact binary format (some more some less)
- stable FS addresses (in some formats)
- restoring the pipeline CAS type system from file (i.e. CAS can be initialized 
with an empty type system on creation and TS is set by reader - in some formats)
- lenient loading of data allowing for different TSes on disk and in pipeline 
(in some formats)

Would such an approach cover (some of your) use-cases? 

Cheers,

-- Richard

[1] 
http://www.dkpro.org/dkpro-core/releases/1.7.0/apidocs/index.html?de/tudarmstadt/ukp/dkpro/core/io/bincas/BinaryCasWriter.html

Reply via email to