Hello Marshall, in a previous comment to the Jira issue, I have states similar concerns. However, I have to admit that Steven has a point in that the class can greatly facilitate getting things done if you employ a reasonably simple type-system and are sure that you do not have overlapping annotations. Steven's use-case seems to be to import XML data, process it and export it again.
For somebody familiar with XML, all of the listed points should be acceptable - only that feature values longer than 64 chars are truncated seems a bit arbitrary. As for the DKPro Core XmlWriterInline - I should need to document the "inaccuracies" and possibly include some sanity checks that log warnings if a CAS contains overlapping annotations and complex feature structures being used as features - just to be that novice users are aware that strange things my be happening. Cheers, Richard Am 29.03.2011 um 02:46 schrieb Marshall Schor: > Just to be sure it's well known: > > The Javadoc for this class indicates that this code only does an "approximate" > representation of things. > > In particular, it says: > > * Generates an *approximate* inline XML representation of a CAS. > * Annotation types are represented as XML tags, features are represented as > attributes. > * > * Features whose values are FeatureStructures are not represented. > * Feature values which are strings longer than 64 characters are truncated. > * Feature values which are arrays of primitives are represented by > * strings that look like [ xxx, xxx ] > * > * The Subject of analysis is presumed to be a text string. > * > * Some characters in the document's Subject-of-analysis > * are replaced by blanks, because the characters aren't valid in xml > documents. > * > * It doesn't work for annotations which are overlapping, because these cannot > * be properly represented as properly - nested XML. > > Because of these "inaccuracies" are you sure you want to be using this class > for > your projects? > > -Marshall > > On 3/28/2011 8:34 PM, Richard Eckart de Castilho (JIRA) wrote: >> [ >> https://issues.apache.org/jira/browse/UIMA-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel >> ] >> >> Richard Eckart de Castilho updated UIMA-2101: >> --------------------------------------------- >> >> Attachment: UIMA-2101-eckart-20110329.patch >> >> In addition to being able to disable formatting - as motivated by Steven - I >> would like to be able to access the SAX events generated from the CAS, so I >> can use a custom transformer in the DKPro Core component XmlWriterInline. >> >> Added a patch to address the issue. Patch is against SVN trunk rev 1085925 >> of the uimaj-core module. >> >> - Added new method CasToInlineXml.generateXML(CAS, FSMatchConstraint, >> ContentHandler) which allows the user to use a custom transformer or other >> SAX event handler. >> - Added new property outputFormatted controlling whether generated XML >> strings are formatted or not. This property does not affect the new >> generateXML(...) method (see above). Per default the property is set to >> true, resembling the state without the patch. >> - Added rudimentary test case to check if (not) formatting works. Code >> borrows from XmiCasDeserializerTest. >> - Auto-formatted using UIMA Eclipse Code profile added a few braces. >> >> >>> CasToInlineXml adds whitespace >>> ------------------------------ >>> >>> Key: UIMA-2101 >>> URL: https://issues.apache.org/jira/browse/UIMA-2101 >>> Project: UIMA >>> Issue Type: Bug >>> Affects Versions: 2.3.1SDK >>> Reporter: Steven Bethard >>> Attachments: UIMA-2101-eckart-20110329.patch >>> >>> >>> CasToInlineXml adds indentation between adjacent XML elements. E.g. for a >>> single character document with a single annotation covering that one >>> character, it will write: >>> {noformat} >>> <?xml version="1.0" encoding="UTF-8"?> >>> <Document> >>> <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" >>> language="x-unspecified"> >>> <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> >>> </uima.tcas.Annotation> >>> </uima.tcas.DocumentAnnotation> >>> </Document> >>> {noformat} >>> I think it should instead write everything in a single line, that is: >>> {noformat} >>> <?xml version="1.0" encoding="UTF-8"?> >>> <Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" >>> language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" >>> end="1"> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document> >>> {noformat} >>> I believe this could be fixed by replacing the line: >>> {noformat} >>> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream); >>> {noformat} >>> with the line: >>> {noformat} >>> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false); >>> {noformat} >>> I think it's a bug that CasToInlineXml is changing the character offsets, >>> but I would also be happy if there was an alternate constructor or a method >>> on CasToInlineXml that allowed disabling the formatting. >> -- >> This message is automatically generated by JIRA. >> For more information on JIRA, see: http://www.atlassian.com/software/jira >> >> Richard Eckart de Castilho -- ------------------------------------------------------------------- Richard Eckart de Castilho Technical Lead Ubiquitous Knowledge Processing Lab FB 20 Computer Science Department Technische Universität Darmstadt Hochschulstr. 10, D-64289 Darmstadt, Germany phone +49 (6151) 16-7477, fax -5455, room S2/02/E225 [email protected] www.ukp.tu-darmstadt.de -------------------------------------------------------------------
