DcXMLParser: DC metadata text in extracted body
-----------------------------------------------

                 Key: TIKA-579
                 URL: https://issues.apache.org/jira/browse/TIKA-579
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.8
         Environment: N/A
            Reporter: Scott Severtson


The DcXMLParser correctly extracts Dublin Core metadata text into the Metadata 
object, but the metadata text is included in the extracted "body". 

Sample XML document:
---
<?xml version="1.0" encoding="UTF-8"?>
<a xmlns:dc="http://purl.org/dc/elements/1.1/";>
        <dc:title>This is the title</dc:title>
        <dc:creator>Scott Severtson</dc:creator>
        <dc:subject>This is the subject</dc:subject>
        <b>This is the body text.</b>
</a>
---

Sample code:
---
URL xmlDocument = ...
TikaConfig tikaConfig = new TikaConfig();
ParseUtils.getStringContent(xmlDocument, tikaConfig, "application/xml");
---

Actual output:
---
        This is the title
        Scott Severtson
        This is the subject
        This is the body text.
---

Expected output:
---
        This is the body text.
---

The output is consistent when using ParseUtils *and* when using DcXMLParser 
directly with a ContentHandler. The ContentHandler receives a single text node 
containing concatinated metadata and body text, so there is no opportunity to 
externally work around this issue. We would expect DcXMLParser to remove DC 
nodes from the body prior to extracting the body text, to be more consistent 
with other Tika parsers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to