[ https://issues.apache.org/jira/browse/TIKA-2559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison resolved TIKA-2559. ------------------------------- Resolution: Fixed Fix Version/s: 2.0.0 1.18 Thank you! > Expose language metadata from PDF documents > ------------------------------------------- > > Key: TIKA-2559 > URL: https://issues.apache.org/jira/browse/TIKA-2559 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 2.0 > Reporter: Matt Sheppard > Priority: Major > Fix For: 1.18, 2.0.0 > > Attachments: acrobat-xi-pdf-accessibility-overview.pdf > > > Tika does not currently return the language from a PDF's metadata (for an > example PDF I'm seeking permission to share with you - Perhaps for all PDFs). > It would be useful to me (and I imagine others) if it could do so. > ---- > The example PDF I have does get a language when processed with exiftool... > {noformat} > $ exiftool -X /tmp/my-example.pdf |grep -i lang > <PDF:Language>en-US</PDF:Language>{noformat} > where as it does not with Tika. > > I looked briefly into the PDF parsing code, and it appears that the language > value in question is available within PDFBox's document catalog, so I can > pass it through with a change such as... > {code:java} > diff --git > a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java > b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java > index b2a15cab6..66b1c9343 100644 > --- a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java > +++ b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java > @@ -224,7 +224,10 @@ public class PDFParser extends AbstractParser implements > Initializable { > metadata.set(AccessPermissions.CAN_PRINT_DEGRADED, > Boolean.toString(ap.canPrintDegraded())); > - > + if (document.getDocumentCatalog().getLanguage() != null) { > + metadata.set(Metadata.CONTENT_LANGUAGE, > document.getDocumentCatalog().getLanguage()); > + } > + > //now go for the XMP > Document dom = loadDOM(document.getDocumentCatalog().getMetadata(), > metadata, context); > diff --git > a/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java > b/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java > index 93966e4f2..7b7ba14fe 100644 > --- a/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java > +++ b/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java > @@ -1310,6 +1310,14 @@ public class PDFParserTest extends TikaTest { > assertContains("Tika - Content", content); > } > + @Test > + public void testMissingLanguage() throws Exception { > + Metadata metadata = getXML("my-example.pdf").metadata; > + System.out.println(metadata); > + assertEquals("application/pdf", metadata.get(Metadata.CONTENT_TYPE)); > + assertEquals("en-US", metadata.get(Metadata.CONTENT_LANGUAGE)); > + } > + > @Test > public void testConfiguringMoreParams() throws Exception { > try (InputStream configIs = > getClass().getResourceAsStream("/org/apache/tika/parser/pdf/tika-inline-config.xml")) > { > {code} > > It's my first time looking at this code, so that change may be a bit naive, > but hopefully shows what I'm getting at. -- This message was sent by Atlassian JIRA (v7.6.3#76005)