[
https://issues.apache.org/jira/browse/PDFBOX-5976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048589#comment-18048589
]
Tilman Hausherr commented on PDFBOX-5976:
-----------------------------------------
I've reverted the change in PDFBOX-6099 (but kept the test) because it is no
longer needed. The real cause of the problem is fixed in PDFBOX-6099, which is
that the namespace attributes in rdf:RDF were ignored. The change here wasn't
wrong, but not optimal. My understanding of the code is better now. Many
changes have been done in the last few weeks, and the parser has a very high
test coverage. Make sure that you have a lot of tests on your side and run
these tests regularly before releases.
> DomXmpParser incorrectly expects namespaces on attribute level
> --------------------------------------------------------------
>
> Key: PDFBOX-5976
> URL: https://issues.apache.org/jira/browse/PDFBOX-5976
> Project: PDFBox
> Issue Type: Bug
> Components: XmpBox
> Affects Versions: 2.0.33, 3.0.4 PDFBox
> Reporter: Jochen Stärk
> Assignee: Tilman Hausherr
> Priority: Major
> Labels: xml
> Fix For: 2.0.34, 3.0.5 PDFBox, 4.0.0
>
> Attachments: AN-10005_v28_2025-03-19-2.pdf,
> AN-10005_v28_2025-03-19x-1.pdf
>
>
> When trying to determine the PDF-A-Version like
> {{PDDocument document = null;}}
> {{try {}}
> {{document = Loader.loadPDF(new File("AN-10005_v28_2025-03-19.pdf"));}}
> {{PDDocumentCatalog catalog = document.getDocumentCatalog();}}
> {{PDMetadata metadata = catalog.getMetadata();}}
> {{DomXmpParser xmpParser = new DomXmpParser();}}
> {{XMPMetadata xmp = xmpParser.parse(metadata.createInputStream());}}
> {{PDFAIdentificationSchema pdfaSchema = xmp.getPDFAIdentificationSchema();}}
> {{if (pdfaSchema != null) {}}
> {{System.out.println("It's a PDF A-" + pdfaSchema.getPart());}}
> {{}}}
> {{document.close();}}
> {{} catch (XmpParsingException e) {}}
> {{e.printStackTrace();}}
> {{} catch (IOException e) {}}
> {{e.printStackTrace();}}
> {{}}}
> on the attached (and valid) PDF A-3b AN-10005_v28_2025-03-19-2.pdf, PDFBox
> incorrectly fails with a
>
> {{org.apache.xmpbox.xml.XmpParsingException: Schema is not set in this
> document : http://www.aiim.org/pdfa/ns/id/}}
> {{ at
> org.apache.xmpbox.xml.DomXmpParser.checkPropertyDefinition(DomXmpParser.java:920)}}
> {{ at
> org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRootAttr(DomXmpParser.java:276)}}
> {{ at
> org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRoot(DomXmpParser.java:247)}}
> {{ at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:201)}}
> {{ at de.usegroup.Main.main(Main.java:25)}}
>
> After manipulating the metadata stream with itext RuPS from
> {{<rdf:RDF xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
> xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/"
> xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><rdf:Description
> rdf:about="" pdfaid:part="3" pdfaid:conformance="B" /><rdf:Description
> rdf:about="" pdf:Producer="WeasyPrint 64.1" /></rdf:RDF>}}
> to
> {{ <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">}}
> {{ <rdf:Description rdf:about=""}}
> {{ xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/"}}
> {{ xmlns:pdf="http://ns.adobe.com/pdf/1.3/"}}
> {{ xmlns:xmp="http://ns.adobe.com/xap/1.0/"}}
> {{ pdfaid:conformance="B"}}
> {{ pdfaid:part="3"}}
> {{ pdf:Producer="WeasyPrint 64.1; modified using iText® Core 7.2.5
> (AGPL version) ©2000-2023 iText Group NV"}}
> {{ xmp:ModifyDate="2025-03-21T08:16:58+01:00"/>}}
> {{ </rdf:RDF>}}
> putting the namespace definition in the rdf:Description
> (AN-10005_v28_2025-03-19x-1.pdf) it works.
> The issue is: it should be sufficient to put the namespace definitions in the
> root element, "RDF", i.e. the first example should also work.
>
> When searching for similar issues I had the impression this may be similar to
> PDFBOX-2913.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]