[ https://issues.apache.org/jira/browse/TIKA-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342031#comment-14342031 ]
Tyler Palsulich commented on TIKA-579: -------------------------------------- +1. DC tags should be put into the Metadata. This is still a problem with 1.8-SNAPHOT. > DcXMLParser: DC metadata text in extracted body > ----------------------------------------------- > > Key: TIKA-579 > URL: https://issues.apache.org/jira/browse/TIKA-579 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.8 > Environment: N/A > Reporter: Scott Severtson > > The DcXMLParser correctly extracts Dublin Core metadata text into the > Metadata object, but the metadata text is included in the extracted "body". > Sample XML document: > --- > <?xml version="1.0" encoding="UTF-8"?> > <a xmlns:dc="http://purl.org/dc/elements/1.1/"> > <dc:title>This is the title</dc:title> > <dc:creator>Scott Severtson</dc:creator> > <dc:subject>This is the subject</dc:subject> > <b>This is the body text.</b> > </a> > --- > Sample code: > --- > URL xmlDocument = ... > TikaConfig tikaConfig = new TikaConfig(); > ParseUtils.getStringContent(xmlDocument, tikaConfig, "application/xml"); > --- > Actual output: > --- > This is the title > Scott Severtson > This is the subject > This is the body text. > --- > Expected output: > --- > This is the body text. > --- > The output is consistent when using ParseUtils *and* when using DcXMLParser > directly with a ContentHandler. The ContentHandler receives a single text > node containing concatinated metadata and body text, so there is no > opportunity to externally work around this issue. We would expect DcXMLParser > to remove DC nodes from the body prior to extracting the body text, to be > more consistent with other Tika parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)