Re: [jira] [Updated] (UIMA-2101) CasToInlineXml adds whitespace

Marshall Schor Tue, 29 Mar 2011 06:48:38 -0700


On 3/29/2011 1:49 AM, Richard Eckart de Castilho wrote:
> Hello Marshall,
>
> in a previous comment to the Jira issue, I have states similar concerns. 
> However, I have to admit that Steven has a point in that the class can 
> greatly facilitate getting things done if you employ a reasonably simple 
> type-system and are sure that you do not have overlapping annotations. 
> Steven's use-case seems to be to import XML data, process it and export it 
> again.
>
> For somebody familiar with XML, all of the listed points should be acceptable 
> - only that feature values longer than 64 chars are truncated seems a bit 
> arbitrary.
>
> As for the DKPro Core XmlWriterInline - I should need to document the 
> "inaccuracies" and possibly include some sanity checks that log warnings if a 
> CAS contains overlapping annotations and complex feature structures being 
> used as features - just to be that novice users are aware that strange things 
> my be happening.


Good idea :-)  -Marshall
> Cheers,
>
> Richard
>
> Am 29.03.2011 um 02:46 schrieb Marshall Schor:
>
>> Just to be sure it's well known:
>>
>> The Javadoc for this class indicates that this code only does an 
>> "approximate"
>> representation of things.
>>
>> In particular, it says:
>>
>> * Generates an *approximate* inline XML representation of a CAS.
>> * Annotation types are represented as XML tags, features are represented as
>> attributes.
>> * 
>> * Features whose values are FeatureStructures are not represented.
>> * Feature values which are strings longer than 64 characters are truncated.
>> * Feature values which are arrays of primitives are represented by
>> * strings that look like [ xxx, xxx ]
>> *
>> * The Subject of analysis is presumed to be a text string.
>> *
>> * Some characters in the document's Subject-of-analysis
>> * are replaced by blanks, because the characters aren't valid in xml 
>> documents.
>> *
>> * It doesn't work for annotations which are overlapping, because these cannot
>> * be properly represented as properly - nested XML.
>>
>> Because of these "inaccuracies" are you sure you want to be using this class 
>> for
>> your projects?
>>
>> -Marshall
>>
>> On 3/28/2011 8:34 PM, Richard Eckart de Castilho (JIRA) wrote:
>>>     [ 
>>> https://issues.apache.org/jira/browse/UIMA-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>>>  ]
>>>
>>> Richard Eckart de Castilho updated UIMA-2101:
>>> ---------------------------------------------
>>>
>>>    Attachment: UIMA-2101-eckart-20110329.patch
>>>
>>> In addition to being able to disable formatting - as motivated by Steven - 
>>> I would like to be able to access the SAX events generated from the CAS, so 
>>> I can use a custom transformer in the DKPro Core component XmlWriterInline.
>>>
>>> Added a patch to address the issue. Patch is against SVN trunk rev 1085925 
>>> of the uimaj-core module.
>>>
>>> - Added new method CasToInlineXml.generateXML(CAS, FSMatchConstraint, 
>>> ContentHandler) which allows the user to use a custom transformer or other 
>>> SAX event handler.
>>> - Added new property outputFormatted controlling whether generated XML 
>>> strings are formatted or not. This property does not affect the new 
>>> generateXML(...) method (see above). Per default the property is set to 
>>> true, resembling the state without the patch.
>>> - Added rudimentary test case to check if (not) formatting works. Code 
>>> borrows from XmiCasDeserializerTest.
>>> - Auto-formatted using UIMA Eclipse Code profile added a few braces.
>>>
>>>
>>>> CasToInlineXml adds whitespace
>>>> ------------------------------
>>>>
>>>>                Key: UIMA-2101
>>>>                URL: https://issues.apache.org/jira/browse/UIMA-2101
>>>>            Project: UIMA
>>>>         Issue Type: Bug
>>>>   Affects Versions: 2.3.1SDK
>>>>           Reporter: Steven Bethard
>>>>        Attachments: UIMA-2101-eckart-20110329.patch
>>>>
>>>>
>>>> CasToInlineXml adds indentation between adjacent XML elements. E.g. for a 
>>>> single character document with a single annotation covering that one 
>>>> character, it will write:
>>>> {noformat}
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <Document>
>>>>    <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" 
>>>> language="x-unspecified">
>>>>        <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> 
>>>> </uima.tcas.Annotation>
>>>>    </uima.tcas.DocumentAnnotation>
>>>> </Document>
>>>> {noformat}
>>>> I think it should instead write everything in a single line, that is:
>>>> {noformat}
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" 
>>>> language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" 
>>>> end="1"> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document>
>>>> {noformat}
>>>> I believe this could be fixed by replacing the line:
>>>> {noformat}
>>>> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream);
>>>> {noformat}
>>>> with the line:
>>>> {noformat}
>>>> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false);
>>>> {noformat}
>>>> I think it's a bug that CasToInlineXml is changing the character offsets, 
>>>> but I would also be happy if there was an alternate constructor or a 
>>>> method on CasToInlineXml that allowed disabling the formatting.
>>> --
>>> This message is automatically generated by JIRA.
>>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>>
>>>
> Richard Eckart de Castilho
>

Re: [jira] [Updated] (UIMA-2101) CasToInlineXml adds whitespace

Reply via email to