Hi,

On 25. Oct 2019, at 17:53, Marshall Schor <m...@schor.com> wrote:
> 
> One other useful sources for examples:  The test cases for UIMA, e.g. search 
> the
> uimaj-core projects *.java files for "getSofaDataStream".

Ok, let me elaborate :)

One can use setSofaDataURI(url) to tell the CAS that the sofa data is actually 
external.
One can then use getSofaDataStream() resolve the URL and retrieve the data as a 
stream.

So let's assume I have a CAS containing annotations on a text and the text is 
in an external file:

  CAS cas = CasCreationUtils.createCas((TypeSystemDescription) null, null, 
null);
  cas.setSofaDataURI("file:/path/to/my/file", "text/plain");

Works nice when I use getSofaDataStream() to retrieve the data. 

But I can't use the "normal" methods like getDocumentText() or getCoveredText() 
at all.

Also, I cannot call setSofaDataString(urlContent, "text/plain") - it throws an 
exception 
because there is already a sofaURI set. This is a major inconvenience.

The ClearTK guys came up with an approach that tries to make this a bit more 
convenient:

* they introduce a well-known view named "UriView" and set the sofaDataURI in 
that view.
* then they use a special reader which looks up the URI in that view, resolves 
it and 
  drops the content into the sofaDataString of the "_defaultView".

That way they get the benefit of the externally stored sofa as well as the 
ability to use
the usual methods to access the text.

When I looked at setSofaDataURI(), I naively expected that it would be resolved 
the first
time I try to access the sofa data (e.g. via getDocumentText()) - but that 
doesn't happen.

Then I expected that I would just call getSofaDataStream() and manually drop 
the contents
into setSofaDataString() and that this data string would be "transient", i.e. 
not saved
into XMI because we already have a setSofaDataURI set... but that expectation 
was also
not fulfilled.

Could it be useful to introduce some place where we can transiently drop data 
obtained
from the sofaDataURI such that methods like getDocumentText() and 
getCoveredText() do 
something useful but also such that the data is not included when serializing 
the CAS to
whatever format?

Cheers,

-- Richard

Reply via email to