Grigorii Ioffe created TIKA-4466:
------------------------------------
Summary: OPFParser extracts DublinCore fields partially
Key: TIKA-4466
URL: https://issues.apache.org/jira/browse/TIKA-4466
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 3.2.2
Reporter: Grigorii Ioffe
I have an ePub file with metadata stored in an OPF file with multiple
dc:identifier fields. But during its parsing OPFParser extracts only the last
one.
For example, if a OPF file inside ePub contains such entries of dc:identifier:
{code:java}
<dc:identifier>isbn:9780765350381</dc:identifier>
<dc:identifier>mobi-asin:JD4PTHPBGIAQYZUBFUU3VFPVEUKY7S3U</dc:identifier>
<dc:identifier>amazon:0765350386</dc:identifier>
<dc:identifier>goodreads:243272</dc:identifier>
<dc:identifier>calibre:55</dc:identifier>
<dc:identifier>uuid:7dcb83b5-7364-4e29-9e5c-1d7b966a3595</dc:identifier>
<dc:identifier
id="uuid_id">uuid:7dcb83b5-7364-4e29-9e5c-1d7b966a3595</dc:identifier> {code}
only uuid:7dcb83b5-7364-4e29-9e5c-1d7b966a3595 will be in parsed metadata.
According to the Dublin Core spec it is a valid situation as identifier marked
as repeatable:
[https://www.w3.org/TR/epub-33/#sec-opf-dcidentifier]
My investigation showed that the field is created with PropertyType.SIMPLE here:
`org.apache.tika.metadata/DublinCore.class:60`
as a result,
`org.apache.tika.metadata/Property.class:272`
returns false and therefore each entry overrides a value stored before instead
of adding to an array.
Also, this is not the only field with incorrect type definition. Looks like
that Title, language, description and some others fields are also defined
incorrectly (or at least parsed in OPFParser and DCXmlParcer incorrectly)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)