Re: [jira] [Updated] (UIMA-2101) CasToInlineXml adds whitespace

Richard Eckart de Castilho Mon, 28 Mar 2011 22:50:13 -0700

Hello Marshall,

in a previous comment to the Jira issue, I have states similar concerns. 
However, I have to admit that Steven has a point in that the class can greatly 
facilitate getting things done if you employ a reasonably simple type-system 
and are sure that you do not have overlapping annotations. Steven's use-case 
seems to be to import XML data, process it and export it again.


For somebody familiar with XML, all of the listed points should be acceptable - 
only that feature values longer than 64 chars are truncated seems a bit 
arbitrary.

As for the DKPro Core XmlWriterInline - I should need to document the 
"inaccuracies" and possibly include some sanity checks that log warnings if a 
CAS contains overlapping annotations and complex feature structures being used 
as features - just to be that novice users are aware that strange things my be 
happening.

Cheers,

Richard

Am 29.03.2011 um 02:46 schrieb Marshall Schor:

> Just to be sure it's well known:
> 
> The Javadoc for this class indicates that this code only does an "approximate"
> representation of things.
> 
> In particular, it says:
> 
> * Generates an *approximate* inline XML representation of a CAS.
> * Annotation types are represented as XML tags, features are represented as
> attributes.
> * 
> * Features whose values are FeatureStructures are not represented.
> * Feature values which are strings longer than 64 characters are truncated.
> * Feature values which are arrays of primitives are represented by
> * strings that look like [ xxx, xxx ]
> *
> * The Subject of analysis is presumed to be a text string.
> *
> * Some characters in the document's Subject-of-analysis
> * are replaced by blanks, because the characters aren't valid in xml 
> documents.
> *
> * It doesn't work for annotations which are overlapping, because these cannot
> * be properly represented as properly - nested XML.
> 
> Because of these "inaccuracies" are you sure you want to be using this class 
> for
> your projects?
> 
> -Marshall
> 
> On 3/28/2011 8:34 PM, Richard Eckart de Castilho (JIRA) wrote:
>>     [ 
>> https://issues.apache.org/jira/browse/UIMA-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>>  ]
>> 
>> Richard Eckart de Castilho updated UIMA-2101:
>> ---------------------------------------------
>> 
>>    Attachment: UIMA-2101-eckart-20110329.patch
>> 
>> In addition to being able to disable formatting - as motivated by Steven - I 
>> would like to be able to access the SAX events generated from the CAS, so I 
>> can use a custom transformer in the DKPro Core component XmlWriterInline.
>> 
>> Added a patch to address the issue. Patch is against SVN trunk rev 1085925 
>> of the uimaj-core module.
>> 
>> - Added new method CasToInlineXml.generateXML(CAS, FSMatchConstraint, 
>> ContentHandler) which allows the user to use a custom transformer or other 
>> SAX event handler.
>> - Added new property outputFormatted controlling whether generated XML 
>> strings are formatted or not. This property does not affect the new 
>> generateXML(...) method (see above). Per default the property is set to 
>> true, resembling the state without the patch.
>> - Added rudimentary test case to check if (not) formatting works. Code 
>> borrows from XmiCasDeserializerTest.
>> - Auto-formatted using UIMA Eclipse Code profile added a few braces.
>> 
>> 
>>> CasToInlineXml adds whitespace
>>> ------------------------------
>>> 
>>>                Key: UIMA-2101
>>>                URL: https://issues.apache.org/jira/browse/UIMA-2101
>>>            Project: UIMA
>>>         Issue Type: Bug
>>>   Affects Versions: 2.3.1SDK
>>>           Reporter: Steven Bethard
>>>        Attachments: UIMA-2101-eckart-20110329.patch
>>> 
>>> 
>>> CasToInlineXml adds indentation between adjacent XML elements. E.g. for a 
>>> single character document with a single annotation covering that one 
>>> character, it will write:
>>> {noformat}
>>> <?xml version="1.0" encoding="UTF-8"?>
>>> <Document>
>>>    <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" 
>>> language="x-unspecified">
>>>        <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> 
>>> </uima.tcas.Annotation>
>>>    </uima.tcas.DocumentAnnotation>
>>> </Document>
>>> {noformat}
>>> I think it should instead write everything in a single line, that is:
>>> {noformat}
>>> <?xml version="1.0" encoding="UTF-8"?>
>>> <Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" 
>>> language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" 
>>> end="1"> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document>
>>> {noformat}
>>> I believe this could be fixed by replacing the line:
>>> {noformat}
>>> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream);
>>> {noformat}
>>> with the line:
>>> {noformat}
>>> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false);
>>> {noformat}
>>> I think it's a bug that CasToInlineXml is changing the character offsets, 
>>> but I would also be happy if there was an alternate constructor or a method 
>>> on CasToInlineXml that allowed disabling the formatting.
>> --
>> This message is automatically generated by JIRA.
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>> 
>> 

Richard Eckart de Castilho

-- 
------------------------------------------------------------------- 
Richard Eckart de Castilho
Technical Lead
Ubiquitous Knowledge Processing Lab 
FB 20 Computer Science Department      
Technische Universität Darmstadt 
Hochschulstr. 10, D-64289 Darmstadt, Germany 
phone +49 (6151) 16-7477, fax -5455, room S2/02/E225
[email protected] 
www.ukp.tu-darmstadt.de 
-------------------------------------------------------------------

Re: [jira] [Updated] (UIMA-2101) CasToInlineXml adds whitespace

Reply via email to