Besides very large documents and remote data, another major motivation was
for non-text data, such as audio or video.
Eddie

On Fri, Oct 25, 2019 at 1:33 PM Marshall Schor <[email protected]> wrote:

> Hi,
>
> Here's what I vaguely remember was the driving use-cases for the sofa as a
> URI.
>
> 1.  The main use case was for applications where the data was so large, it
> would
> be unreasonable to read it all in and save as a string.
>
> 2.  The prohibition on changing a sofa spec (without resetting the CAS)
> was that
> it has the potential for users to invalidate the results, in this
> (imagined)
> scenario:
>
>     a) User creates cas with some sofa data,
>     b) User runs annotators, which create annotations that "point into"
> the sofa
> data
>     c) User changes the sofa spec, to different data, but now all the
> annotations still are pointing into "offsets" in the original data.
>
> You can change the sofa data setting, but only after resetting the CAS.
>
>     Did you have a use case for wanting to change the sofa data without
> resetting the CAS?
>
>
> It sounds like you have another interesting use case:
>
>     a) want to convert the sofa data uri -> a string and have the normal
> getDocumentText etc. work, but
>     b) have the serialization serialize the sofaURI, and not the data
> that's
> present there.
>
> This might be a nice convenience.
>
> I can see a couple of issues:
>   a) it might need to have a good strategy for handling very large data.
> E.g.,
> the convert method might need to include a max string size spec.
>   b) Since the serialization would serialize the annotations, but not the
> data
> (it would only serialize the URI), the data at that URI could easily
> change,
> making the annotation results meaningless.  Perhaps some "fingerprinting"
> (developing a checksum of the data, and serializing that to be able to
> signal if
> that did happen) would be a reasonable protection.
>
> Maybe do a new feature-request issue?
>
> -Marshall
>
> magine the JavaDoc for this method would be saying something like: has the
> potential to exceed your memory, at run time, due to the potential size of
> the
> data...
>
>
> On 10/25/2019 12:59 PM, Richard Eckart de Castilho wrote:
> > Hi,
> >
> > On 25. Oct 2019, at 17:53, Marshall Schor <[email protected]> wrote:
> >> One other useful sources for examples:  The test cases for UIMA, e.g.
> search the
> >> uimaj-core projects *.java files for "getSofaDataStream".
> > Ok, let me elaborate :)
> >
> > One can use setSofaDataURI(url) to tell the CAS that the sofa data is
> actually external.
> > One can then use getSofaDataStream() resolve the URL and retrieve the
> data as a stream.
> >
> > So let's assume I have a CAS containing annotations on a text and the
> text is in an external file:
> >
> >   CAS cas = CasCreationUtils.createCas((TypeSystemDescription) null,
> null, null);
> >   cas.setSofaDataURI("file:/path/to/my/file", "text/plain");
> >
> > Works nice when I use getSofaDataStream() to retrieve the data.
> >
> > But I can't use the "normal" methods like getDocumentText() or
> getCoveredText() at all.
> >
> > Also, I cannot call setSofaDataString(urlContent, "text/plain") - it
> throws an exception
> > because there is already a sofaURI set. This is a major inconvenience.
> >
> > The ClearTK guys came up with an approach that tries to make this a bit
> more convenient:
> >
> > * they introduce a well-known view named "UriView" and set the
> sofaDataURI in that view.
> > * then they use a special reader which looks up the URI in that view,
> resolves it and
> >   drops the content into the sofaDataString of the "_defaultView".
> >
> > That way they get the benefit of the externally stored sofa as well as
> the ability to use
> > the usual methods to access the text.
> >
> > When I looked at setSofaDataURI(), I naively expected that it would be
> resolved the first
> > time I try to access the sofa data (e.g. via getDocumentText()) - but
> that doesn't happen.
> >
> > Then I expected that I would just call getSofaDataStream() and manually
> drop the contents
> > into setSofaDataString() and that this data string would be "transient",
> i.e. not saved
> > into XMI because we already have a setSofaDataURI set... but that
> expectation was also
> > not fulfilled.
> >
> > Could it be useful to introduce some place where we can transiently drop
> data obtained
> > from the sofaDataURI such that methods like getDocumentText() and
> getCoveredText() do
> > something useful but also such that the data is not included when
> serializing the CAS to
> > whatever format?
> >
> > Cheers,
> >
> > -- Richard
>

Reply via email to