On Fri, Jul 24, 2015 at 12:46 PM, Richard Eckart de Castilho <[email protected] > wrote:
> On 23.07.2015, at 19:17, Joern Kottmann <[email protected]> wrote: > > >> If this is the scenario, another option would be to have the serialized > CASes > >> stored along with a reference to their type system, and have some new > >> deserialization capability be able to locate the referred-to type > system along > >> with the CAS to be read in. Would that "solve" this issue, or are > there other > >> aspects? > > https://issues.apache.org/jira/browse/UIMA-2127 ;) > > But having the TS stored alongside the CAS also is nice - see below. > > > It would probably solve it, but it is not a simple solution either. That > > would mean that the Type System get switched frequently and have be > > looked up all the time. > > For DKPro Core, I have implemented a BinaryCasWriter that stores the type > system in the same file as the binary serialized CAS. It is not always the > best solution because it adds a fixed overhead to every file, but it is > very convenient. Optionally, the type system can be stored externally in a > separate file to avoid this overhead. If and how this typesystem can be > used depends on which of the six kinds of binary serialization is being > used. See [1] for an overview over these formats and their properties. > > We have a few hundred million documents in the system, storing the ts with each document would be wasteful. It needs storage and it has to be parsed for each CAS. > In the BinaryCasReader, depending on the type of serialization, either: > - there is a failure if the pipeline CAS typesystem is not compatible with > the persisted CAS; > - the type system in the pipeline CAS is reinitialized from the persisted > CAS; > - the data from the persisted CAS is loaded leniently, dropping all FSes > that are not defined in the pipeline CAS typesystem > > Furthermore, the BinaryCasReader auto-detects the binary format and loads > it, be it the Java serialization-based format or one of the binary formats > that Marschall recently created, or our extended format that also embeds > the typesystem in the file. > > Mind that depending on the use-case a different kind of serialization may > be appropriate. > > For me, this covers in particular the following use-cases: > > - fast (de)serialization of the entire CAS > - compact binary format (some more some less) > - stable FS addresses (in some formats) > - restoring the pipeline CAS type system from file (i.e. CAS can be > initialized with an empty type system on creation and TS is set by reader - > in some formats) > - lenient loading of data allowing for different TSes on disk and in > pipeline (in some formats) > > Would such an approach cover (some of your) use-cases? > With the current design the best option is probably to store a type system id with the document. It would be nice to avoid that additional complexity. I think I have mainly two cases I can't really deal with: - A CAS contains FSes of many types. I know a few of those types and would like to only work with them. Not interested at all in the FSes with other types. - A CAS contains FSes of many types. I just want to deal with them as if they have a certain super-type. That could be FeatureStructure or AnnotationFS. The CASes above have been produced by many different AAEs with similar, but slightly different type systems. Jörn
