Tim Allison created PDFBOX-2855:
-----------------------------------
Summary: Allow some flexibility for divergences from the standard
on Seq vs Bag in DomXMPParser
Key: PDFBOX-2855
URL: https://issues.apache.org/jira/browse/PDFBOX-2855
Project: PDFBox
Issue Type: Improvement
Reporter: Tim Allison
While working on TIKA-1285 (migrate to PDFBox 2.0.0), [~rpialum] noticed that
the DomXmpParser was failing on some XMP with:
{noformat}
org.apache.xmpbox.xml.XmpParsingException: Invalid array type, expecting Seq
and found Bag [prefix=dc; name=creator]
at org.apache.xmpbox.xml.DomXmpParser.manageArray(DomXmpParser.java:449)
at
org.apache.xmpbox.xml.DomXmpParser.createProperty(DomXmpParser.java:338)
at
org.apache.xmpbox.xml.DomXmpParser.parseChildrenAsProperties(DomXmpParser.java:305)
at
org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRoot(DomXmpParser.java:234)
at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:198)
at
org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:202)
{noformat}
One file that triggers this is available on Tika-1252 [here|
https://issues.apache.org/jira/secure/attachment/12632592/Sample%20%28Acrobat%204.x%29.pdf
]
The raw xmp for that file includes:
{noformat}
<dc:creator>
<rdf:Bag>
<rdf:li>Single Author</rdf:li>
</rdf:Bag>
</dc:creator>
{noformat}
On TIKA-1252, I confirmed that this is against the spec
[link|https://issues.apache.org/jira/browse/TIKA-1252?focusedCommentId=13919846&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13919846]
and [[email protected]] confirmed that this was what Acrobat was
generating
[link|https://issues.apache.org/jira/browse/TIKA-1252?focusedCommentId=13919858&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13919858].
So, would it be easy enough to allow for some divergence from the standard?
Code to reproduce issue in tika setup:
{noformat}
@Test
public void oneOffMetadataTest() throws Exception {
PDDocument doc =
PDDocument.load(this.getClass().getResourceAsStream("/test-documents/sampleAcrobat_4_x.pdf"));
DomXmpParser p = new DomXmpParser();
p.setStrictParsing(false);
p.parse(doc.getDocumentCatalog().getMetadata().exportXMPMetadata());
}
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]