Re: Ideas for UIMA v3

Joern Kottmann Fri, 24 Jul 2015 05:10:14 -0700

On Fri, Jul 24, 2015 at 12:46 PM, Richard Eckart de Castilho <[email protected]
> wrote:


> On 23.07.2015, at 19:17, Joern Kottmann <[email protected]> wrote:
>
> >> If this is the scenario, another option would be to have the serialized
> CASes
> >> stored along with a reference to their type system, and have some new
> >> deserialization capability be able to locate the referred-to type
> system along
> >> with the CAS to be read in.  Would that "solve" this issue, or are
> there other
> >> aspects?
>
> https://issues.apache.org/jira/browse/UIMA-2127 ;)
>
> But having the TS stored alongside the CAS also is nice - see below.
>
> > It would probably solve it, but it is not a simple solution either. That
> > would mean that the Type System get switched frequently and have be
> > looked up all the time.
>
> For DKPro Core, I have implemented a BinaryCasWriter that stores the type
> system in the same file as the binary serialized CAS. It is not always the
> best solution because it adds a fixed overhead to every file, but it is
> very convenient. Optionally, the type system can be stored externally in a
> separate file to avoid this overhead. If and how this typesystem can be
> used depends on which of the six kinds of binary serialization is being
> used. See [1] for an overview over these formats and their properties.
>
>
We have a few hundred million documents in the system, storing the ts with
each document would be wasteful. It needs storage and it has to be parsed
for each CAS.



> In the BinaryCasReader, depending on the type of serialization, either:
> - there is a failure if the pipeline CAS typesystem is not compatible with
> the persisted CAS;
> - the type system in the pipeline CAS is reinitialized from the persisted
> CAS;
> - the data from the persisted CAS is loaded leniently, dropping all FSes
> that are not defined in the pipeline CAS typesystem
>
> Furthermore, the BinaryCasReader auto-detects the binary format and loads
> it, be it the Java serialization-based format or one of the binary formats
> that Marschall recently created, or our extended format that also embeds
> the typesystem in the file.
>
> Mind that depending on the use-case a different kind of serialization may
> be appropriate.
>
> For me, this covers in particular the following use-cases:
>
> - fast (de)serialization of the entire CAS
> - compact binary format (some more some less)
> - stable FS addresses (in some formats)
> - restoring the pipeline CAS type system from file (i.e. CAS can be
> initialized with an empty type system on creation and TS is set by reader -
> in some formats)
> - lenient loading of data allowing for different TSes on disk and in
> pipeline (in some formats)
>
> Would such an approach cover (some of your) use-cases?
>


With the current design the best option is probably to store a type system
id with the document.
It would be nice to avoid that additional complexity.

I think I have mainly two cases I can't really deal with:
- A CAS contains FSes of many types. I know a few of those types and would
like to only work with them. Not interested at all in the FSes with other
types.
- A CAS contains FSes of many types. I just want to deal with them as if
they have a certain super-type. That could be FeatureStructure or
AnnotationFS.

The CASes above have been produced by many different AAEs with similar, but
slightly different type systems.

Jörn

Re: Ideas for UIMA v3

Reply via email to