On 23.07.2015, at 19:17, Joern Kottmann <[email protected]> wrote: >> If this is the scenario, another option would be to have the serialized CASes >> stored along with a reference to their type system, and have some new >> deserialization capability be able to locate the referred-to type system >> along >> with the CAS to be read in. Would that "solve" this issue, or are there >> other >> aspects?
https://issues.apache.org/jira/browse/UIMA-2127 ;) But having the TS stored alongside the CAS also is nice - see below. > It would probably solve it, but it is not a simple solution either. That > would mean that the Type System get switched frequently and have be > looked up all the time. For DKPro Core, I have implemented a BinaryCasWriter that stores the type system in the same file as the binary serialized CAS. It is not always the best solution because it adds a fixed overhead to every file, but it is very convenient. Optionally, the type system can be stored externally in a separate file to avoid this overhead. If and how this typesystem can be used depends on which of the six kinds of binary serialization is being used. See [1] for an overview over these formats and their properties. In the BinaryCasReader, depending on the type of serialization, either: - there is a failure if the pipeline CAS typesystem is not compatible with the persisted CAS; - the type system in the pipeline CAS is reinitialized from the persisted CAS; - the data from the persisted CAS is loaded leniently, dropping all FSes that are not defined in the pipeline CAS typesystem Furthermore, the BinaryCasReader auto-detects the binary format and loads it, be it the Java serialization-based format or one of the binary formats that Marschall recently created, or our extended format that also embeds the typesystem in the file. Mind that depending on the use-case a different kind of serialization may be appropriate. For me, this covers in particular the following use-cases: - fast (de)serialization of the entire CAS - compact binary format (some more some less) - stable FS addresses (in some formats) - restoring the pipeline CAS type system from file (i.e. CAS can be initialized with an empty type system on creation and TS is set by reader - in some formats) - lenient loading of data allowing for different TSes on disk and in pipeline (in some formats) Would such an approach cover (some of your) use-cases? Cheers, -- Richard [1] http://www.dkpro.org/dkpro-core/releases/1.7.0/apidocs/index.html?de/tudarmstadt/ukp/dkpro/core/io/bincas/BinaryCasWriter.html
