Peter Hoogendijk created TIKA-4442:
--------------------------------------
Summary: PDFParser does not list all metadata extracted by PDFBox
Key: TIKA-4442
URL: https://issues.apache.org/jira/browse/TIKA-4442
Project: Tika
Issue Type: Improvement
Components: parser
Affects Versions: 3.2.0
Environment: * Docker container based on python:3-slim
* Debian 12.11
* Python 3.13.5
* openjdk 17.0.15 2025-04-15
* tika-server-standard-3.2.0.jar
* pdfbox-app-3.0.5.jar
* PyPDF2 3.0.1
Reporter: Peter Hoogendijk
While using Apache Tika to extract metadata from PDF files, I found the
following XMP metadata entries to be missing:
* dc:identifier
* dc:language
* dc:publisher
* dc:relation
* dc:source
* dc:type
Python (PyPDF2) and PDFBox (as used by Tika's PDFParser) do show these XMP
metadata entries, so I expected Apache Tika to also extract these entries.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)