[ 
https://issues.apache.org/jira/browse/PDFBOX-3471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433376#comment-15433376
 ] 

Maruan Sahyoun commented on PDFBOX-3471:
----------------------------------------

{{root.removeChild(node)}} modifies the NodeList i.e. it no longer contains the 
number of entries it had when the loop was entered so the next iteration goes 
to the wrong index. A propose fix is something like this

{code}
    private void removeComments(Node root)
    {
        // will hold the nodes which are to be deleted
        List<Node> forDeletion = new ArrayList<Node>();
        
        NodeList nl = root.getChildNodes();
        
        if (nl.getLength()<=1) 
        {
            // There is only one node so we do not remove it
            return;
        }
        
        for (int i = 0; i < nl.getLength(); i++) 
        {
            Node node = nl.item(i);
            if (node instanceof Comment)
            {
                // comments to be deleted
                forDeletion.add(node);
            }
            else if (node instanceof Text)
            {
                if (node.getTextContent().trim().isEmpty())
                {
                        // TODO: verify why this is necessary
                        // empty text nodes to be deleted
                        forDeletion.add(node);
                }
            }
            else if (node instanceof Element)
            {
                // clean child
                removeComments(node);
            } // else do nothing
        }

        // now remove the child nodes
        for (Node node : forDeletion)
        {
                root.removeChild(node);
        }
    }
{code}

which makes sure that all nodes are visited and the removal is done outside the 
loop.

> XMP parsing fails if XMP contain comments
> -----------------------------------------
>
>                 Key: PDFBOX-3471
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3471
>             Project: PDFBox
>          Issue Type: Bug
>          Components: XmpBox
>    Affects Versions: 2.0.2
>            Reporter: Petras
>         Attachments: PDFBOX-3471_XmpParsingIgnoringComments.patch
>
>
> DomXmpParser parser fails with such correct XMP:
> {code:xml}
> <?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
> <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.1.0-jc003">
>     <!-- PDF/A standarto versija (1 ar 2) ir suderinamumo lygmuo (A, B ar U) 
> -->
>     <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";>
>         <rdf:Description rdf:about = ""
>                          xmlns:pdfaid = "http://www.aiim.org/pdfa/ns/id/";>
>             <pdfaid:part>1</pdfaid:part>
>             <pdfaid:conformance>B</pdfaid:conformance>
>         </rdf:Description>
>     </rdf:RDF>
> </x:xmpmeta>
> <?xpacket end="w"?>
> {code}
> DomXmpParser finds comment node and fails:
> {code}
> org.apache.xmpbox.xml.XmpParsingException: More than one element found in 
> x:xmpmeta
>       at 
> org.apache.xmpbox.xml.DomXmpParser.findDescriptionsParent(DomXmpParser.java:750)
>       at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:183)
>       at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:111)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to