[
https://issues.apache.org/jira/browse/TIKA-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18002673#comment-18002673
]
Peter Hoogendijk commented on TIKA-4449:
----------------------------------------
As long as I can determine the original "raw" metadata, in my case for
"xmp-dc:subject", I'll be happy. Right now, using
SNAPSHOT/tika-server-standard-3.2.2-20250705.215057-26.jar, I still have to
fall back to removing the merged metadata entries to be able to extract the
original "xmp-dc:subject". I'll be testing every new snapshot to see if they
make things easier for me by checking the response of the /meta endpoint using
my "lorem-ipsum.pdf" test file.
> Improve xmp metadata key precision for PDFs
> -------------------------------------------
>
> Key: TIKA-4449
> URL: https://issues.apache.org/jira/browse/TIKA-4449
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
>
> PDFs (and other file formats) may have conflicting information within them
> about, for example, the "title" field or the "author" field.
> Tika's parsers typically pick one source over another and normalize the keys
> to dublin core or other standards.
> [~peterhoogendijk] and other users (likely?) want to be able to identify
> whether a given piece of information comes from the XMP or the docinfo. This
> is follow on work from TIKA-4444. The proposal is to add new metadata keys to
> specify when dublin core information comes directly from xmp.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)