Just to be sure it's well known: The Javadoc for this class indicates that this code only does an "approximate" representation of things.
In particular, it says: * Generates an *approximate* inline XML representation of a CAS. * Annotation types are represented as XML tags, features are represented as attributes. * * Features whose values are FeatureStructures are not represented. * Feature values which are strings longer than 64 characters are truncated. * Feature values which are arrays of primitives are represented by * strings that look like [ xxx, xxx ] * * The Subject of analysis is presumed to be a text string. * * Some characters in the document's Subject-of-analysis * are replaced by blanks, because the characters aren't valid in xml documents. * * It doesn't work for annotations which are overlapping, because these cannot * be properly represented as properly - nested XML. Because of these "inaccuracies" are you sure you want to be using this class for your projects? -Marshall On 3/28/2011 8:34 PM, Richard Eckart de Castilho (JIRA) wrote: > [ > https://issues.apache.org/jira/browse/UIMA-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] > > Richard Eckart de Castilho updated UIMA-2101: > --------------------------------------------- > > Attachment: UIMA-2101-eckart-20110329.patch > > In addition to being able to disable formatting - as motivated by Steven - I > would like to be able to access the SAX events generated from the CAS, so I > can use a custom transformer in the DKPro Core component XmlWriterInline. > > Added a patch to address the issue. Patch is against SVN trunk rev 1085925 of > the uimaj-core module. > > - Added new method CasToInlineXml.generateXML(CAS, FSMatchConstraint, > ContentHandler) which allows the user to use a custom transformer or other > SAX event handler. > - Added new property outputFormatted controlling whether generated XML > strings are formatted or not. This property does not affect the new > generateXML(...) method (see above). Per default the property is set to true, > resembling the state without the patch. > - Added rudimentary test case to check if (not) formatting works. Code > borrows from XmiCasDeserializerTest. > - Auto-formatted using UIMA Eclipse Code profile added a few braces. > > >> CasToInlineXml adds whitespace >> ------------------------------ >> >> Key: UIMA-2101 >> URL: https://issues.apache.org/jira/browse/UIMA-2101 >> Project: UIMA >> Issue Type: Bug >> Affects Versions: 2.3.1SDK >> Reporter: Steven Bethard >> Attachments: UIMA-2101-eckart-20110329.patch >> >> >> CasToInlineXml adds indentation between adjacent XML elements. E.g. for a >> single character document with a single annotation covering that one >> character, it will write: >> {noformat} >> <?xml version="1.0" encoding="UTF-8"?> >> <Document> >> <uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" >> language="x-unspecified"> >> <uima.tcas.Annotation sofa="Sofa" begin="0" end="1"> >> </uima.tcas.Annotation> >> </uima.tcas.DocumentAnnotation> >> </Document> >> {noformat} >> I think it should instead write everything in a single line, that is: >> {noformat} >> <?xml version="1.0" encoding="UTF-8"?> >> <Document><uima.tcas.DocumentAnnotation sofa="Sofa" begin="0" end="1" >> language="x-unspecified"><uima.tcas.Annotation sofa="Sofa" begin="0" >> end="1"> </uima.tcas.Annotation></uima.tcas.DocumentAnnotation></Document> >> {noformat} >> I believe this could be fixed by replacing the line: >> {noformat} >> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream); >> {noformat} >> with the line: >> {noformat} >> XMLSerializer sax2xml = new XMLSerializer(byteArrayOutputStream, false); >> {noformat} >> I think it's a bug that CasToInlineXml is changing the character offsets, >> but I would also be happy if there was an alternate constructor or a method >> on CasToInlineXml that allowed disabling the formatting. > -- > This message is automatically generated by JIRA. > For more information on JIRA, see: http://www.atlassian.com/software/jira > >
