[ 
https://issues.apache.org/jira/browse/PDFBOX-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14617209#comment-14617209
 ] 

Tilman Hausherr commented on PDFBOX-2855:
-----------------------------------------

I doubt that this will be possible, xmpbox isn't "lenient", much is done with 
tables, like here in DublinCoreSchema.java:
{code}
    @PropertyType(type = Types.Text, card = Cardinality.Seq)
    public static final String CREATOR = "creator";
{code}

> Allow some flexibility for divergences from the standard on Seq vs Bag in 
> DomXMPParser
> --------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-2855
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2855
>             Project: PDFBox
>          Issue Type: Improvement
>            Reporter: Tim Allison
>
> While working on TIKA-1285 (migrate to PDFBox 2.0.0), [~rpialum] noticed that 
> the DomXmpParser was failing on some XMP with:
> {noformat}
> org.apache.xmpbox.xml.XmpParsingException: Invalid array type, expecting Seq 
> and found Bag [prefix=dc; name=creator]
>       at org.apache.xmpbox.xml.DomXmpParser.manageArray(DomXmpParser.java:449)
>       at 
> org.apache.xmpbox.xml.DomXmpParser.createProperty(DomXmpParser.java:338)
>       at 
> org.apache.xmpbox.xml.DomXmpParser.parseChildrenAsProperties(DomXmpParser.java:305)
>       at 
> org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRoot(DomXmpParser.java:234)
>       at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:198)
>       at 
> org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:202)
> {noformat}
> One file that triggers this is available on Tika-1252 [here| 
> https://issues.apache.org/jira/secure/attachment/12632592/Sample%20%28Acrobat%204.x%29.pdf
>  ]
> The raw xmp for that file includes: 
> {noformat}
>          <dc:creator>
>             <rdf:Bag>
>                <rdf:li>Single Author</rdf:li>
>             </rdf:Bag>
>          </dc:creator>
> {noformat}
> On TIKA-1252, I confirmed that this is against the spec 
> [link|https://issues.apache.org/jira/browse/TIKA-1252?focusedCommentId=13919846&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13919846]
>  and [[email protected]] confirmed that this was what Acrobat was 
> generating 
> [link|https://issues.apache.org/jira/browse/TIKA-1252?focusedCommentId=13919858&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13919858].
> So, would it be easy enough to allow for some divergence from the standard?
> Code to reproduce issue in tika setup:
> {noformat}
>     @Test
>     public void oneOffMetadataTest() throws Exception {
>         PDDocument doc = 
> PDDocument.load(this.getClass().getResourceAsStream("/test-documents/sampleAcrobat_4_x.pdf"));
>         DomXmpParser p = new DomXmpParser();
>         p.setStrictParsing(false);
>         p.parse(doc.getDocumentCatalog().getMetadata().exportXMPMetadata());
>     }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to