John Haynes created TIKA-2057:
---------------------------------

             Summary: Extract PDF DocInfo fields into separate metadata fields
                 Key: TIKA-2057
                 URL: https://issues.apache.org/jira/browse/TIKA-2057
             Project: Tika
          Issue Type: Improvement
          Components: metadata
    Affects Versions: 1.13
            Reporter: John Haynes
            Priority: Minor


Hi,

I have a PDF in which title has been set twice -- once as Dublin core metadata: 
{code}<dc:title>
  <rdf:Alt>
    <rdf:li xml:lang="x-default">
      Consumer credit cards - conditions of use
    </rdf:li>
  </rdf:Alt>
</dc:title>{code}

and again in the PDF DocInfo section: {code}
/Title(Consumer Credit Card - Conditions of Use){code}

When I use Tika to transform the PDF into HTML {code}java -jar 
tika-app-1.13.jar int_Consumer_Conditions_of_use.pdf{code} it outputs this 
metadata: {code}<meta name="dc:title" content="Consumer credit cards - 
conditions of use"/>{code} and this <title> tag: {code}<title>Consumer credit 
cards - conditions of use</title>{code} meaning we no longer have access to the 
DocInfo title.

Is there some way you could adapt Tika to copy this PDF DocInfo forward during 
a conversion under a new type of metadata, e.g. {code}
<meta name="docinfo:title" content="Consumer Credit Card - Conditions of 
Use"/>{code}






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to