Re: Use of CASes with sofaURI?
Besides very large documents and remote data, another major motivation was for non-text data, such as audio or video. Eddie On Fri, Oct 25, 2019 at 1:33 PM Marshall Schor wrote: > Hi, > > Here's what I vaguely remember was the driving use-cases for the sofa as a > URI. > > 1. The main use case was for applications where the data was so large, it > would > be unreasonable to read it all in and save as a string. > > 2. The prohibition on changing a sofa spec (without resetting the CAS) > was that > it has the potential for users to invalidate the results, in this > (imagined) > scenario: > > a) User creates cas with some sofa data, > b) User runs annotators, which create annotations that "point into" > the sofa > data > c) User changes the sofa spec, to different data, but now all the > annotations still are pointing into "offsets" in the original data. > > You can change the sofa data setting, but only after resetting the CAS. > > Did you have a use case for wanting to change the sofa data without > resetting the CAS? > > > It sounds like you have another interesting use case: > > a) want to convert the sofa data uri -> a string and have the normal > getDocumentText etc. work, but > b) have the serialization serialize the sofaURI, and not the data > that's > present there. > > This might be a nice convenience. > > I can see a couple of issues: > a) it might need to have a good strategy for handling very large data. > E.g., > the convert method might need to include a max string size spec. > b) Since the serialization would serialize the annotations, but not the > data > (it would only serialize the URI), the data at that URI could easily > change, > making the annotation results meaningless. Perhaps some "fingerprinting" > (developing a checksum of the data, and serializing that to be able to > signal if > that did happen) would be a reasonable protection. > > Maybe do a new feature-request issue? > > -Marshall > > magine the JavaDoc for this method would be saying something like: has the > potential to exceed your memory, at run time, due to the potential size of > the > data... > > > On 10/25/2019 12:59 PM, Richard Eckart de Castilho wrote: > > Hi, > > > > On 25. Oct 2019, at 17:53, Marshall Schor wrote: > >> One other useful sources for examples: The test cases for UIMA, e.g. > search the > >> uimaj-core projects *.java files for "getSofaDataStream". > > Ok, let me elaborate :) > > > > One can use setSofaDataURI(url) to tell the CAS that the sofa data is > actually external. > > One can then use getSofaDataStream() resolve the URL and retrieve the > data as a stream. > > > > So let's assume I have a CAS containing annotations on a text and the > text is in an external file: > > > > CAS cas = CasCreationUtils.createCas((TypeSystemDescription) null, > null, null); > > cas.setSofaDataURI("file:/path/to/my/file", "text/plain"); > > > > Works nice when I use getSofaDataStream() to retrieve the data. > > > > But I can't use the "normal" methods like getDocumentText() or > getCoveredText() at all. > > > > Also, I cannot call setSofaDataString(urlContent, "text/plain") - it > throws an exception > > because there is already a sofaURI set. This is a major inconvenience. > > > > The ClearTK guys came up with an approach that tries to make this a bit > more convenient: > > > > * they introduce a well-known view named "UriView" and set the > sofaDataURI in that view. > > * then they use a special reader which looks up the URI in that view, > resolves it and > > drops the content into the sofaDataString of the "_defaultView". > > > > That way they get the benefit of the externally stored sofa as well as > the ability to use > > the usual methods to access the text. > > > > When I looked at setSofaDataURI(), I naively expected that it would be > resolved the first > > time I try to access the sofa data (e.g. via getDocumentText()) - but > that doesn't happen. > > > > Then I expected that I would just call getSofaDataStream() and manually > drop the contents > > into setSofaDataString() and that this data string would be "transient", > i.e. not saved > > into XMI because we already have a setSofaDataURI set... but that > expectation was also > > not fulfilled. > > > > Could it be useful to introduce some place where we can transiently drop > data obtained > > from the sofaDataURI such that methods like getDocumentText() and > getCoveredText() do > > something useful but also such that the data is not included when > serializing the CAS to > > whatever format? > > > > Cheers, > > > > -- Richard >
Re: Use of CASes with sofaURI?
Hi, Here's what I vaguely remember was the driving use-cases for the sofa as a URI. 1. The main use case was for applications where the data was so large, it would be unreasonable to read it all in and save as a string. 2. The prohibition on changing a sofa spec (without resetting the CAS) was that it has the potential for users to invalidate the results, in this (imagined) scenario: a) User creates cas with some sofa data, b) User runs annotators, which create annotations that "point into" the sofa data c) User changes the sofa spec, to different data, but now all the annotations still are pointing into "offsets" in the original data. You can change the sofa data setting, but only after resetting the CAS. Did you have a use case for wanting to change the sofa data without resetting the CAS? It sounds like you have another interesting use case: a) want to convert the sofa data uri -> a string and have the normal getDocumentText etc. work, but b) have the serialization serialize the sofaURI, and not the data that's present there. This might be a nice convenience. I can see a couple of issues: a) it might need to have a good strategy for handling very large data. E.g., the convert method might need to include a max string size spec. b) Since the serialization would serialize the annotations, but not the data (it would only serialize the URI), the data at that URI could easily change, making the annotation results meaningless. Perhaps some "fingerprinting" (developing a checksum of the data, and serializing that to be able to signal if that did happen) would be a reasonable protection. Maybe do a new feature-request issue? -Marshall magine the JavaDoc for this method would be saying something like: has the potential to exceed your memory, at run time, due to the potential size of the data... On 10/25/2019 12:59 PM, Richard Eckart de Castilho wrote: > Hi, > > On 25. Oct 2019, at 17:53, Marshall Schor wrote: >> One other useful sources for examples: The test cases for UIMA, e.g. search >> the >> uimaj-core projects *.java files for "getSofaDataStream". > Ok, let me elaborate :) > > One can use setSofaDataURI(url) to tell the CAS that the sofa data is > actually external. > One can then use getSofaDataStream() resolve the URL and retrieve the data as > a stream. > > So let's assume I have a CAS containing annotations on a text and the text is > in an external file: > > CAS cas = CasCreationUtils.createCas((TypeSystemDescription) null, null, > null); > cas.setSofaDataURI("file:/path/to/my/file", "text/plain"); > > Works nice when I use getSofaDataStream() to retrieve the data. > > But I can't use the "normal" methods like getDocumentText() or > getCoveredText() at all. > > Also, I cannot call setSofaDataString(urlContent, "text/plain") - it throws > an exception > because there is already a sofaURI set. This is a major inconvenience. > > The ClearTK guys came up with an approach that tries to make this a bit more > convenient: > > * they introduce a well-known view named "UriView" and set the sofaDataURI in > that view. > * then they use a special reader which looks up the URI in that view, > resolves it and > drops the content into the sofaDataString of the "_defaultView". > > That way they get the benefit of the externally stored sofa as well as the > ability to use > the usual methods to access the text. > > When I looked at setSofaDataURI(), I naively expected that it would be > resolved the first > time I try to access the sofa data (e.g. via getDocumentText()) - but that > doesn't happen. > > Then I expected that I would just call getSofaDataStream() and manually drop > the contents > into setSofaDataString() and that this data string would be "transient", i.e. > not saved > into XMI because we already have a setSofaDataURI set... but that expectation > was also > not fulfilled. > > Could it be useful to introduce some place where we can transiently drop data > obtained > from the sofaDataURI such that methods like getDocumentText() and > getCoveredText() do > something useful but also such that the data is not included when serializing > the CAS to > whatever format? > > Cheers, > > -- Richard
Re: Use of CASes with sofaURI?
Hi, On 25. Oct 2019, at 17:53, Marshall Schor wrote: > > One other useful sources for examples: The test cases for UIMA, e.g. search > the > uimaj-core projects *.java files for "getSofaDataStream". Ok, let me elaborate :) One can use setSofaDataURI(url) to tell the CAS that the sofa data is actually external. One can then use getSofaDataStream() resolve the URL and retrieve the data as a stream. So let's assume I have a CAS containing annotations on a text and the text is in an external file: CAS cas = CasCreationUtils.createCas((TypeSystemDescription) null, null, null); cas.setSofaDataURI("file:/path/to/my/file", "text/plain"); Works nice when I use getSofaDataStream() to retrieve the data. But I can't use the "normal" methods like getDocumentText() or getCoveredText() at all. Also, I cannot call setSofaDataString(urlContent, "text/plain") - it throws an exception because there is already a sofaURI set. This is a major inconvenience. The ClearTK guys came up with an approach that tries to make this a bit more convenient: * they introduce a well-known view named "UriView" and set the sofaDataURI in that view. * then they use a special reader which looks up the URI in that view, resolves it and drops the content into the sofaDataString of the "_defaultView". That way they get the benefit of the externally stored sofa as well as the ability to use the usual methods to access the text. When I looked at setSofaDataURI(), I naively expected that it would be resolved the first time I try to access the sofa data (e.g. via getDocumentText()) - but that doesn't happen. Then I expected that I would just call getSofaDataStream() and manually drop the contents into setSofaDataString() and that this data string would be "transient", i.e. not saved into XMI because we already have a setSofaDataURI set... but that expectation was also not fulfilled. Could it be useful to introduce some place where we can transiently drop data obtained from the sofaDataURI such that methods like getDocumentText() and getCoveredText() do something useful but also such that the data is not included when serializing the CAS to whatever format? Cheers, -- Richard
Re: Use of CASes with sofaURI?
One other useful sources for examples: The test cases for UIMA, e.g. search the uimaj-core projects *.java files for "getSofaDataStream". -Marshall On 10/24/2019 6:11 PM, Richard Eckart de Castilho wrote: > Hi there, > > does somebody have an example of how to work with CASes that where the sofa > data is not set using setDocumentText() but rather using setSofaDataURI(...)? > > > It looks like the CAS text is then not accessible via the usual means: > > CAS cas = CasCreationUtils.createCas((TypeSystemDescription) null, > null, null); > cas.setSofaDataURI("https://www.apache.org/licenses/LICENSE-2.0.txt";, > "text/plain"); > CasIOUtils.save(cas, System.out, SerialFormat.XMI); > System.out.println(cas.getDocumentText()); // -> prints "null" > System.out.println(cas.getSofaDataString()); // -> prints "null" > > Apparently, one needs to call getSofaDataStream() - but even after calling > that, getDocumentAnnotation().getCoveredText() returns null. > > So how is one expected to work with CASes that are using this data URI > concept? > > Cheers, > > -- Richard
Re: Use of CASes with sofaURI?
hi, not my area of expertise, but the docs say http://uima.apache.org/d/uimaj-current/tutorials_and_users_guides.html#ugr.tug.aas.accessing_sofa_data that if you're using a URI, then you use the cas.getSofaDataURI(), which returns a string representation of the URI. To get the data, the docs say you need to set up some standard Java I/O. There's also a special cas method, getSofaDataStream, which returns an input stream, and works with both local and remote data. -Marshall On 10/24/2019 6:11 PM, Richard Eckart de Castilho wrote: > Hi there, > > does somebody have an example of how to work with CASes that where the sofa > data is not set using setDocumentText() but rather using setSofaDataURI(...)? > > > It looks like the CAS text is then not accessible via the usual means: > > CAS cas = CasCreationUtils.createCas((TypeSystemDescription) null, > null, null); > cas.setSofaDataURI("https://www.apache.org/licenses/LICENSE-2.0.txt";, > "text/plain"); > CasIOUtils.save(cas, System.out, SerialFormat.XMI); > System.out.println(cas.getDocumentText()); // -> prints "null" > System.out.println(cas.getSofaDataString()); // -> prints "null" > > Apparently, one needs to call getSofaDataStream() - but even after calling > that, getDocumentAnnotation().getCoveredText() returns null. > > So how is one expected to work with CASes that are using this data URI > concept? > > Cheers, > > -- Richard
Use of CASes with sofaURI?
Hi there, does somebody have an example of how to work with CASes that where the sofa data is not set using setDocumentText() but rather using setSofaDataURI(...)? It looks like the CAS text is then not accessible via the usual means: CAS cas = CasCreationUtils.createCas((TypeSystemDescription) null, null, null); cas.setSofaDataURI("https://www.apache.org/licenses/LICENSE-2.0.txt";, "text/plain"); CasIOUtils.save(cas, System.out, SerialFormat.XMI); System.out.println(cas.getDocumentText()); // -> prints "null" System.out.println(cas.getSofaDataString()); // -> prints "null" Apparently, one needs to call getSofaDataStream() - but even after calling that, getDocumentAnnotation().getCoveredText() returns null. So how is one expected to work with CASes that are using this data URI concept? Cheers, -- Richard