[ 
https://issues.apache.org/jira/browse/PDFBOX-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14617227#comment-14617227
 ] 

Tim Allison commented on PDFBOX-2855:
-------------------------------------

Oh...Ok.  I guess we'll have to keep our own copy of JempBox or roll our own?

Will investigate to see how much of an issue this is before heading in that 
direction.

Should I resolve this as "won't fix"? 

Thank you!

> Allow some flexibility for divergences from the standard on Seq vs Bag in 
> DomXMPParser
> --------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-2855
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2855
>             Project: PDFBox
>          Issue Type: Improvement
>            Reporter: Tim Allison
>
> While working on TIKA-1285 (migrate to PDFBox 2.0.0), [~rpialum] noticed that 
> the DomXmpParser was failing on some XMP with:
> {noformat}
> org.apache.xmpbox.xml.XmpParsingException: Invalid array type, expecting Seq 
> and found Bag [prefix=dc; name=creator]
>       at org.apache.xmpbox.xml.DomXmpParser.manageArray(DomXmpParser.java:449)
>       at 
> org.apache.xmpbox.xml.DomXmpParser.createProperty(DomXmpParser.java:338)
>       at 
> org.apache.xmpbox.xml.DomXmpParser.parseChildrenAsProperties(DomXmpParser.java:305)
>       at 
> org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRoot(DomXmpParser.java:234)
>       at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:198)
>       at 
> org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:202)
> {noformat}
> One file that triggers this is available on Tika-1252 [here| 
> https://issues.apache.org/jira/secure/attachment/12632592/Sample%20%28Acrobat%204.x%29.pdf
>  ]
> The raw xmp for that file includes: 
> {noformat}
>          <dc:creator>
>             <rdf:Bag>
>                <rdf:li>Single Author</rdf:li>
>             </rdf:Bag>
>          </dc:creator>
> {noformat}
> On TIKA-1252, I confirmed that this is against the spec 
> [link|https://issues.apache.org/jira/browse/TIKA-1252?focusedCommentId=13919846&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13919846]
>  and [[email protected]] confirmed that this was what Acrobat was 
> generating 
> [link|https://issues.apache.org/jira/browse/TIKA-1252?focusedCommentId=13919858&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13919858].
> So, would it be easy enough to allow for some divergence from the standard?
> Code to reproduce issue in tika setup:
> {noformat}
>     @Test
>     public void oneOffMetadataTest() throws Exception {
>         PDDocument doc = 
> PDDocument.load(this.getClass().getResourceAsStream("/test-documents/sampleAcrobat_4_x.pdf"));
>         DomXmpParser p = new DomXmpParser();
>         p.setStrictParsing(false);
>         p.parse(doc.getDocumentCatalog().getMetadata().exportXMPMetadata());
>     }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to