[
https://issues.apache.org/jira/browse/TIKA-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17986100#comment-17986100
]
Hudson commented on TIKA-4442:
------------------------------
SUCCESS: Integrated in Jenkins build Tika ยป tika-branch_3x-jdk11 #2095 (See
[https://ci-builds.apache.org/job/Tika/job/tika-branch_3x-jdk11/2095/])
TIKA-4442: add test (tilman:
[https://github.com/apache/tika/commit/d74d50edb2ef5f07339b5f8ebde2f26865cc14ae])
* (add)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/resources/test-documents/xmp/TIKA-4442.xmp
* (edit)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/CustomTikaXMPTest.java
> PDFParser does not list all metadata extracted by PDFBox
> --------------------------------------------------------
>
> Key: TIKA-4442
> URL: https://issues.apache.org/jira/browse/TIKA-4442
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 3.2.0
> Environment: * Docker container based on python:3-slim
> * Debian 12.11
> * Python 3.13.5
> * openjdk 17.0.15 2025-04-15
> * tika-server-standard-3.2.0.jar
> * pdfbox-app-3.0.5.jar
> * PyPDF 5.6.1
> Reporter: Peter Hoogendijk
> Priority: Major
> Labels: xmp
> Attachments: lorem-ipsum.pdf, lorem-ipsum.xml
>
>
> While using Apache Tika to extract metadata from PDF files, I found the
> following XMP metadata entries to be missing:
> * dc:identifier
> * dc:language
> * dc:publisher
> * dc:relation
> * dc:source
> * dc:type
> Python (PyPDF2) and PDFBox (as used by Tika's PDFParser) do show these XMP
> metadata entries, so I expected Apache Tika to also extract these entries.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)