Hi,
On 25. Oct 2019, at 17:53, Marshall Schor <[email protected]> wrote:
>
> One other useful sources for examples: The test cases for UIMA, e.g. search
> the
> uimaj-core projects *.java files for "getSofaDataStream".
Ok, let me elaborate :)
One can use setSofaDataURI(url) to tell the CAS that the sofa data is actually
external.
One can then use getSofaDataStream() resolve the URL and retrieve the data as a
stream.
So let's assume I have a CAS containing annotations on a text and the text is
in an external file:
CAS cas = CasCreationUtils.createCas((TypeSystemDescription) null, null,
null);
cas.setSofaDataURI("file:/path/to/my/file", "text/plain");
Works nice when I use getSofaDataStream() to retrieve the data.
But I can't use the "normal" methods like getDocumentText() or getCoveredText()
at all.
Also, I cannot call setSofaDataString(urlContent, "text/plain") - it throws an
exception
because there is already a sofaURI set. This is a major inconvenience.
The ClearTK guys came up with an approach that tries to make this a bit more
convenient:
* they introduce a well-known view named "UriView" and set the sofaDataURI in
that view.
* then they use a special reader which looks up the URI in that view, resolves
it and
drops the content into the sofaDataString of the "_defaultView".
That way they get the benefit of the externally stored sofa as well as the
ability to use
the usual methods to access the text.
When I looked at setSofaDataURI(), I naively expected that it would be resolved
the first
time I try to access the sofa data (e.g. via getDocumentText()) - but that
doesn't happen.
Then I expected that I would just call getSofaDataStream() and manually drop
the contents
into setSofaDataString() and that this data string would be "transient", i.e.
not saved
into XMI because we already have a setSofaDataURI set... but that expectation
was also
not fulfilled.
Could it be useful to introduce some place where we can transiently drop data
obtained
from the sofaDataURI such that methods like getDocumentText() and
getCoveredText() do
something useful but also such that the data is not included when serializing
the CAS to
whatever format?
Cheers,
-- Richard