On 3/29/2011 1:49 AM, Richard Eckart de Castilho wrote: > Hello Marshall, > > in a previous comment to the Jira issue, I have states similar concerns. > However, I have to admit that Steven has a point in that the class can > greatly facilitate getting things done if you employ a reasonably simple > type-system and are sure that you do not have overlapping annotations. > Steven's use-case seems to be to import XML data, process it and export it > again. > > For somebody familiar with XML, all of the listed points should be acceptable > - only that feature values longer than 64 chars are truncated seems a bit > arbitrary. > > As for the DKPro Core XmlWriterInline - I should need to document the > "inaccuracies" and possibly include some sanity checks that log warnings if a > CAS contains overlapping annotations and complex feature structures being > used as features - just to be that novice users are aware that strange things > my be happening.
Good idea :-) -Marshall > Cheers, > > Richard > > Am 29.03.2011 um 02:46 schrieb Marshall Schor: > >> Just to be sure it's well known: >> >> The Javadoc for this class indicates that this code only does an >> "approximate" >> representation of things. >> >> In particular, it says: >> >> * Generates an *approximate* inline XML representation of a CAS. >> * Annotation types are represented as XML tags, features are represented as >> attributes. >> * >> * Features whose values are FeatureStructures are not represented. >> * Feature values which are strings longer than 64 characters are truncated. >> * Feature values which are arrays of primitives are represented by >> * strings that look like [ xxx, xxx ] >> * >> * The Subject of analysis is presumed to be a text string. >> * >> * Some characters in the document's Subject-of-analysis >> * are replaced by blanks, because the characters aren't valid in xml >> documents. >> * >> * It doesn't work for annotations which are overlapping, because these cannot >> * be properly represented as properly - nested XML. >> >> Because of these "inaccuracies" are you sure you want to be using this class >> for >> your projects? >> >> -Marshall >> >> On 3/28/2011 8:34 PM, Richard Eckart de Castilho (JIRA) wrote: >>> [ >>> https://issues.apache.org/jira/browse/UIMA-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel >>> ] >>> >>> Richard Eckart de Castilho updated UIMA-2101: >>> --------------------------------------------- >>> >>> Attachment: UIMA-2101-eckart-20110329.patch >>> >>> In addition to being able to disable formatting - as motivated by Steven - >>> I would like to be able to access the SAX events generated from the CAS, so >>> I can use a custom transformer in the DKPro Core component XmlWriterInline. >>> >>> Added a patch to address the issue. Patch is against SVN trunk rev 1085925 >>> of the uimaj-core module. >>> >>> - Added new method CasToInlineXml.generateXML(CAS, FSMatchConstraint, >>> ContentHandler) which allows the user to use a custom transformer or other >>> SAX event handler. >>> - Added new property outputFormatted controlling whether generated XML >>> strings are formatted or not. This property does not affect the new >>> generateXML(...) method (see above). Per default the property is set to >>> true, resembling the state without the patch. >>> - Added rudimentary test case to check if (not) formatting works. Code >>> borrows from XmiCasDeserializerTest. >>> - Auto-formatted using UIMA Eclipse Code profile added a few braces. >>> >>> >>>> CasToInlineXml adds whitespace >>>> ------------------------------ >>>> >>>> Key: UIMA-2101 >>>> URL: https://issues.apache.org/jira/browse/UIMA-2101 >>>> Project: UIMA >>>> Issue Type: Bug >>>> Affects Versions: 2.3.1SDK >>>> Reporter: Steven Bethard >>>> Attachments: UIMA-2101-eckart-20110329.patch >>>> >>>> >>>> CasToInlineXml adds indentation between adjacent XML elements. E.g. for a >>>> single character document with a single annotation covering that one >>>> character, it will write: >>>> {noformat} >>>> <?xml version="1.0" encoding="UTF-8"?> >>>> <Document> >>>> <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" >>>> language="x-unspecified"> >>>> <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> >>>> </uima.tcas.Annotation> >>>> </uima.tcas.DocumentAnnotation> >>>> </Document> >>>> {noformat} >>>> I think it should instead write everything in a single line, that is: >>>> {noformat} >>>> <?xml version="1.0" encoding="UTF-8"?> >>>> <Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" >>>> language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" >>>> end="1"> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document> >>>> {noformat} >>>> I believe this could be fixed by replacing the line: >>>> {noformat} >>>> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream); >>>> {noformat} >>>> with the line: >>>> {noformat} >>>> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false); >>>> {noformat} >>>> I think it's a bug that CasToInlineXml is changing the character offsets, >>>> but I would also be happy if there was an alternate constructor or a >>>> method on CasToInlineXml that allowed disabling the formatting. >>> -- >>> This message is automatically generated by JIRA. >>> For more information on JIRA, see: http://www.atlassian.com/software/jira >>> >>> > Richard Eckart de Castilho >
