Besides very large documents and remote data, another major motivation was for non-text data, such as audio or video. Eddie
On Fri, Oct 25, 2019 at 1:33 PM Marshall Schor <[email protected]> wrote: > Hi, > > Here's what I vaguely remember was the driving use-cases for the sofa as a > URI. > > 1. The main use case was for applications where the data was so large, it > would > be unreasonable to read it all in and save as a string. > > 2. The prohibition on changing a sofa spec (without resetting the CAS) > was that > it has the potential for users to invalidate the results, in this > (imagined) > scenario: > > a) User creates cas with some sofa data, > b) User runs annotators, which create annotations that "point into" > the sofa > data > c) User changes the sofa spec, to different data, but now all the > annotations still are pointing into "offsets" in the original data. > > You can change the sofa data setting, but only after resetting the CAS. > > Did you have a use case for wanting to change the sofa data without > resetting the CAS? > > > It sounds like you have another interesting use case: > > a) want to convert the sofa data uri -> a string and have the normal > getDocumentText etc. work, but > b) have the serialization serialize the sofaURI, and not the data > that's > present there. > > This might be a nice convenience. > > I can see a couple of issues: > a) it might need to have a good strategy for handling very large data. > E.g., > the convert method might need to include a max string size spec. > b) Since the serialization would serialize the annotations, but not the > data > (it would only serialize the URI), the data at that URI could easily > change, > making the annotation results meaningless. Perhaps some "fingerprinting" > (developing a checksum of the data, and serializing that to be able to > signal if > that did happen) would be a reasonable protection. > > Maybe do a new feature-request issue? > > -Marshall > > magine the JavaDoc for this method would be saying something like: has the > potential to exceed your memory, at run time, due to the potential size of > the > data... > > > On 10/25/2019 12:59 PM, Richard Eckart de Castilho wrote: > > Hi, > > > > On 25. Oct 2019, at 17:53, Marshall Schor <[email protected]> wrote: > >> One other useful sources for examples: The test cases for UIMA, e.g. > search the > >> uimaj-core projects *.java files for "getSofaDataStream". > > Ok, let me elaborate :) > > > > One can use setSofaDataURI(url) to tell the CAS that the sofa data is > actually external. > > One can then use getSofaDataStream() resolve the URL and retrieve the > data as a stream. > > > > So let's assume I have a CAS containing annotations on a text and the > text is in an external file: > > > > CAS cas = CasCreationUtils.createCas((TypeSystemDescription) null, > null, null); > > cas.setSofaDataURI("file:/path/to/my/file", "text/plain"); > > > > Works nice when I use getSofaDataStream() to retrieve the data. > > > > But I can't use the "normal" methods like getDocumentText() or > getCoveredText() at all. > > > > Also, I cannot call setSofaDataString(urlContent, "text/plain") - it > throws an exception > > because there is already a sofaURI set. This is a major inconvenience. > > > > The ClearTK guys came up with an approach that tries to make this a bit > more convenient: > > > > * they introduce a well-known view named "UriView" and set the > sofaDataURI in that view. > > * then they use a special reader which looks up the URI in that view, > resolves it and > > drops the content into the sofaDataString of the "_defaultView". > > > > That way they get the benefit of the externally stored sofa as well as > the ability to use > > the usual methods to access the text. > > > > When I looked at setSofaDataURI(), I naively expected that it would be > resolved the first > > time I try to access the sofa data (e.g. via getDocumentText()) - but > that doesn't happen. > > > > Then I expected that I would just call getSofaDataStream() and manually > drop the contents > > into setSofaDataString() and that this data string would be "transient", > i.e. not saved > > into XMI because we already have a setSofaDataURI set... but that > expectation was also > > not fulfilled. > > > > Could it be useful to introduce some place where we can transiently drop > data obtained > > from the sofaDataURI such that methods like getDocumentText() and > getCoveredText() do > > something useful but also such that the data is not included when > serializing the CAS to > > whatever format? > > > > Cheers, > > > > -- Richard >
