John Haynes created TIKA-2057:
---------------------------------
Summary: Extract PDF DocInfo fields into separate metadata fields
Key: TIKA-2057
URL: https://issues.apache.org/jira/browse/TIKA-2057
Project: Tika
Issue Type: Improvement
Components: metadata
Affects Versions: 1.13
Reporter: John Haynes
Priority: Minor
Hi,
I have a PDF in which title has been set twice -- once as Dublin core metadata:
{code}<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">
Consumer credit cards - conditions of use
</rdf:li>
</rdf:Alt>
</dc:title>{code}
and again in the PDF DocInfo section: {code}
/Title(Consumer Credit Card - Conditions of Use){code}
When I use Tika to transform the PDF into HTML {code}java -jar
tika-app-1.13.jar int_Consumer_Conditions_of_use.pdf{code} it outputs this
metadata: {code}<meta name="dc:title" content="Consumer credit cards -
conditions of use"/>{code} and this <title> tag: {code}<title>Consumer credit
cards - conditions of use</title>{code} meaning we no longer have access to the
DocInfo title.
Is there some way you could adapt Tika to copy this PDF DocInfo forward during
a conversion under a new type of metadata, e.g. {code}
<meta name="docinfo:title" content="Consumer Credit Card - Conditions of
Use"/>{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)