[
https://issues.apache.org/jira/browse/TIKA-4466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-4466:
------------------------------
Attachment: image-2025-08-15-10-35-10-476.png
> OPFParser: Only the last dc:identifier is parsed, while multiple are valid.
> ---------------------------------------------------------------------------
>
> Key: TIKA-4466
> URL: https://issues.apache.org/jira/browse/TIKA-4466
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 3.2.2
> Reporter: Grigorii Ioffe
> Priority: Major
> Attachments: image-2025-08-15-10-35-10-476.png
>
>
> I have an ePub file with metadata stored in an OPF file with multiple
> dc:identifier fields. But during its parsing OPFParser extracts only the last
> one.
> For example, if a OPF file inside ePub contains such entries of dc:identifier:
> {code:java}
> <dc:identifier>isbn:9780765350381</dc:identifier>
> <dc:identifier>mobi-asin:JD4PTHPBGIAQYZUBFUU3VFPVEUKY7S3U</dc:identifier>
> <dc:identifier>amazon:0765350386</dc:identifier>
> <dc:identifier>goodreads:243272</dc:identifier>
> <dc:identifier>calibre:55</dc:identifier>
> <dc:identifier>uuid:7dcb83b5-7364-4e29-9e5c-1d7b966a3595</dc:identifier>
> <dc:identifier
> id="uuid_id">uuid:7dcb83b5-7364-4e29-9e5c-1d7b966a3595</dc:identifier> {code}
> only uuid:7dcb83b5-7364-4e29-9e5c-1d7b966a3595 will be in parsed metadata.
> According to the Dublin Core spec it is a valid situation as identifier
> marked as repeatable:
> [https://www.w3.org/TR/epub-33/#sec-opf-dcidentifier]
> My investigation showed that the field is created with PropertyType.SIMPLE
> here:
> `org.apache.tika.metadata/DublinCore.class:60`
> as a result,
> `org.apache.tika.metadata/Property.class:272`
> returns false and therefore each entry overrides a value stored before
> instead of adding to an array.
>
> Also, this is not the only field with incorrect type definition. Looks like
> that Title, language, description and some others fields are also defined
> incorrectly (or at least parsed in OPFParser and DCXmlParcer incorrectly)
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)