[ 
https://issues.apache.org/jira/browse/TIKA-579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-579:
---------------------------------
    Affects Version/s:     (was: 0.8)
                       1.8

> DcXMLParser: DC metadata text in extracted body
> -----------------------------------------------
>
>                 Key: TIKA-579
>                 URL: https://issues.apache.org/jira/browse/TIKA-579
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.8
>         Environment: N/A
>            Reporter: Scott Severtson
>
> The DcXMLParser correctly extracts Dublin Core metadata text into the 
> Metadata object, but the metadata text is included in the extracted "body". 
> Sample XML document:
> ---
> <?xml version="1.0" encoding="UTF-8"?>
> <a xmlns:dc="http://purl.org/dc/elements/1.1/";>
>       <dc:title>This is the title</dc:title>
>       <dc:creator>Scott Severtson</dc:creator>
>       <dc:subject>This is the subject</dc:subject>
>       <b>This is the body text.</b>
> </a>
> ---
> Sample code:
> ---
> URL xmlDocument = ...
> TikaConfig tikaConfig = new TikaConfig();
> ParseUtils.getStringContent(xmlDocument, tikaConfig, "application/xml");
> ---
> Actual output:
> ---
>       This is the title
>       Scott Severtson
>       This is the subject
>       This is the body text.
> ---
> Expected output:
> ---
>       This is the body text.
> ---
> The output is consistent when using ParseUtils *and* when using DcXMLParser 
> directly with a ContentHandler. The ContentHandler receives a single text 
> node containing concatinated metadata and body text, so there is no 
> opportunity to externally work around this issue. We would expect DcXMLParser 
> to remove DC nodes from the body prior to extracting the body text, to be 
> more consistent with other Tika parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to