[jira] Commented: (UIMA-1502) Using getSofaDataStream instead of getDocumentText

2009-08-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/UIMA-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748778#action_12748778
 ] 

Jérôme Rocheteau commented on UIMA-1502:


OK to close it with Won't Fix

 Using getSofaDataStream instead of getDocumentText
 --

 Key: UIMA-1502
 URL: https://issues.apache.org/jira/browse/UIMA-1502
 Project: UIMA
  Issue Type: Improvement
  Components: Sandbox-WhitespaceTokenizer
Reporter: Jérôme Rocheteau
Priority: Minor
 Attachments: wst.patch

   Original Estimate: 0.17h
  Remaining Estimate: 0.17h

 I would like to known if it could be better to get the CAS text content by 
 calling the getSofaDataStream method of the CAS class instead of getting it 
 by the getDocumentText one.
 Actually, CAS sofas can be set either by calling the setSofaDataString method 
 (aka setDocumentText), or by calling the setSofaDataArray one, or by calling 
 the setSofaDataURI one. However, the getDocumentText method (aka 
 getSofaDataString) provides the content of CASes whose sofas are only set by 
 the first method whereas the getSofaDataStream method retieves content 
 whatever the called method. A method able to get String from an InputStream 
 is then needed.
 Am I wrong in thinking it's an Improvement?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (UIMA-1502) Using getSofaDataStream instead of getDocumentText

2009-08-26 Thread Marshall Schor (JIRA)

[ 
https://issues.apache.org/jira/browse/UIMA-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748153#action_12748153
 ] 

Marshall Schor commented on UIMA-1502:
--

Jérôme, ok to close this with Won't Fix?

 Using getSofaDataStream instead of getDocumentText
 --

 Key: UIMA-1502
 URL: https://issues.apache.org/jira/browse/UIMA-1502
 Project: UIMA
  Issue Type: Improvement
  Components: Sandbox-WhitespaceTokenizer
Reporter: Jérôme Rocheteau
Priority: Minor
 Attachments: wst.patch

   Original Estimate: 0.17h
  Remaining Estimate: 0.17h

 I would like to known if it could be better to get the CAS text content by 
 calling the getSofaDataStream method of the CAS class instead of getting it 
 by the getDocumentText one.
 Actually, CAS sofas can be set either by calling the setSofaDataString method 
 (aka setDocumentText), or by calling the setSofaDataArray one, or by calling 
 the setSofaDataURI one. However, the getDocumentText method (aka 
 getSofaDataString) provides the content of CASes whose sofas are only set by 
 the first method whereas the getSofaDataStream method retieves content 
 whatever the called method. A method able to get String from an InputStream 
 is then needed.
 Am I wrong in thinking it's an Improvement?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (UIMA-1502) Using getSofaDataStream instead of getDocumentText

2009-08-19 Thread Marshall Schor (JIRA)

[ 
https://issues.apache.org/jira/browse/UIMA-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745170#action_12745170
 ] 

Marshall Schor commented on UIMA-1502:
--

This approach has trade-offs.  On the plus side, it provides a more uniform 
method of handling various kinds of input.  On the minus side, for the very 
common case where the subject of analysis is the text in the CAS's sofa, it 
would introduce considerable inefficiencies.

When the CAS has had some text set into it as the subject-of-analysis, it is 
stored as a Java string.  If you now ask the Cas for an inputstream using 
getSofaDataStream - it has to take the string and convert it into an array of 
bytes, using the UTF-8 character encoding.  This byte array is then wrapped 
into an input stream.  Your patch now takes this and reads it (by mistake, in 
the default character encoding - it should be UTF-8), with various new 
buffers assigned.  It also could be changing the manner in which new lines are 
encoded.

Since the string is already available, this seems like quite a poor approach.

I'm also not convinced that there is a real use-case which needs this kind of 
unification.

 Using getSofaDataStream instead of getDocumentText
 --

 Key: UIMA-1502
 URL: https://issues.apache.org/jira/browse/UIMA-1502
 Project: UIMA
  Issue Type: Improvement
  Components: Sandbox-WhitespaceTokenizer
Reporter: Jérôme Rocheteau
Priority: Minor
 Attachments: wst.patch

   Original Estimate: 0.17h
  Remaining Estimate: 0.17h

 I would like to known if it could be better to get the CAS text content by 
 calling the getSofaDataStream method of the CAS class instead of getting it 
 by the getDocumentText one.
 Actually, CAS sofas can be set either by calling the setSofaDataString method 
 (aka setDocumentText), or by calling the setSofaDataArray one, or by calling 
 the setSofaDataURI one. However, the getDocumentText method (aka 
 getSofaDataString) provides the content of CASes whose sofas are only set by 
 the first method whereas the getSofaDataStream method retieves content 
 whatever the called method. A method able to get String from an InputStream 
 is then needed.
 Am I wrong in thinking it's an Improvement?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.