Hi,

Here's what I vaguely remember was the driving use-cases for the sofa as a URI.

1.  The main use case was for applications where the data was so large, it would
be unreasonable to read it all in and save as a string.

2.  The prohibition on changing a sofa spec (without resetting the CAS) was that
it has the potential for users to invalidate the results, in this (imagined)
scenario:

    a) User creates cas with some sofa data,
    b) User runs annotators, which create annotations that "point into" the sofa
data
    c) User changes the sofa spec, to different data, but now all the
annotations still are pointing into "offsets" in the original data.

You can change the sofa data setting, but only after resetting the CAS. 

    Did you have a use case for wanting to change the sofa data without
resetting the CAS?


It sounds like you have another interesting use case:

    a) want to convert the sofa data uri -> a string and have the normal
getDocumentText etc. work, but
    b) have the serialization serialize the sofaURI, and not the data that's
present there.

This might be a nice convenience.

I can see a couple of issues:
  a) it might need to have a good strategy for handling very large data.  E.g.,
the convert method might need to include a max string size spec.
  b) Since the serialization would serialize the annotations, but not the data
(it would only serialize the URI), the data at that URI could easily change,
making the annotation results meaningless.  Perhaps some "fingerprinting"
(developing a checksum of the data, and serializing that to be able to signal if
that did happen) would be a reasonable protection.

Maybe do a new feature-request issue?

-Marshall

magine the JavaDoc for this method would be saying something like: has the
potential to exceed your memory, at run time, due to the potential size of the
data...


On 10/25/2019 12:59 PM, Richard Eckart de Castilho wrote:
> Hi,
>
> On 25. Oct 2019, at 17:53, Marshall Schor <m...@schor.com> wrote:
>> One other useful sources for examples:  The test cases for UIMA, e.g. search 
>> the
>> uimaj-core projects *.java files for "getSofaDataStream".
> Ok, let me elaborate :)
>
> One can use setSofaDataURI(url) to tell the CAS that the sofa data is 
> actually external.
> One can then use getSofaDataStream() resolve the URL and retrieve the data as 
> a stream.
>
> So let's assume I have a CAS containing annotations on a text and the text is 
> in an external file:
>
>   CAS cas = CasCreationUtils.createCas((TypeSystemDescription) null, null, 
> null);
>   cas.setSofaDataURI("file:/path/to/my/file", "text/plain");
>
> Works nice when I use getSofaDataStream() to retrieve the data. 
>
> But I can't use the "normal" methods like getDocumentText() or 
> getCoveredText() at all.
>
> Also, I cannot call setSofaDataString(urlContent, "text/plain") - it throws 
> an exception 
> because there is already a sofaURI set. This is a major inconvenience.
>
> The ClearTK guys came up with an approach that tries to make this a bit more 
> convenient:
>
> * they introduce a well-known view named "UriView" and set the sofaDataURI in 
> that view.
> * then they use a special reader which looks up the URI in that view, 
> resolves it and 
>   drops the content into the sofaDataString of the "_defaultView".
>
> That way they get the benefit of the externally stored sofa as well as the 
> ability to use
> the usual methods to access the text.
>
> When I looked at setSofaDataURI(), I naively expected that it would be 
> resolved the first
> time I try to access the sofa data (e.g. via getDocumentText()) - but that 
> doesn't happen.
>
> Then I expected that I would just call getSofaDataStream() and manually drop 
> the contents
> into setSofaDataString() and that this data string would be "transient", i.e. 
> not saved
> into XMI because we already have a setSofaDataURI set... but that expectation 
> was also
> not fulfilled.
>
> Could it be useful to introduce some place where we can transiently drop data 
> obtained
> from the sofaDataURI such that methods like getDocumentText() and 
> getCoveredText() do 
> something useful but also such that the data is not included when serializing 
> the CAS to
> whatever format?
>
> Cheers,
>
> -- Richard

Reply via email to