Re: Turning off DTD Validation using XML Utils package - Spark

Darin McBeath Fri, 04 Dec 2015 03:48:05 -0800

ok, a new capability has been added to spark-xml-utils (1.3.0) to address this 
request.


Essentially, the capability to specify 'processor' features has been added 
(through a new getInstance function).  Here is a list of  the features that can 
be set 
(http://www.saxonica.com/html/documentation/javadoc/net/sf/saxon/lib/FeatureKeys.html).
  Since we are leveraging the s9apiProcessor under the covers, features 
relevant to that are the only ones that would make sense to use.

To address your request of completely ignoring the Doctype declaration in the 
xml, you would need to do the following:

import net.sf.saxon.lib.FeatureKeys;
HashMap<String,Object> featureMap = new HashMap<String,Object>();
featureMap.put(FeatureKeys.ENTITY_RESOLVER_CLASS, 
"com.somepackage.IgnoreDoctype");
// The first parameter is the xpath expression
// The second parameter is the hashmap for the namespace mappings (in this case 
there are none)
// The third parameter is the hashmap for the processor features
XPathProcessor proc = XPathProcessor.getInstance("/books/book",null,featureMap);


The following evaluation should now work ok.

proc.evaluateString("<?xml version=\"1.0\" encoding=\"UTF-8\"?><!DOCTYPE books 
SYSTEM \"sample.dtd\"><books><book><title lang=\"en\">Some 
Book</title><author>Some 
Author</author><year>2005</year><price>29.99</price></book></books>"));
} catch (XPathException e) 


You then would define the following class (and make sure it is included in your 
application)

package com.somepackage;
import java.io.ByteArrayInputStream;
import org.xml.sax.EntityResolver;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class IgnoreDoctype implements EntityResolver {

public InputSource resolveEntity(java.lang.String publicId, java.lang.String 
systemId)
             throws SAXException, java.io.IOException
      {
          // Ignore everything
          return new InputSource(new ByteArrayInputStream("<?xml version='1.0' 
encoding='UTF-8'?>".getBytes()));

      }
 

}


Lastly, you will need to include the saxon9he jar file (open source version).

This would work for XPath, XQuery, and XSLT.

Hope this helps.  When I get a chance, I will update the spark-xml-utils github 
site with details on the new getInstance function and some more information on 
the various features.
Darin.


________________________________
From: Darin McBeath <ddmcbe...@yahoo.com.INVALID>
To: "user@spark.apache.org" <user@spark.apache.org> 
Sent: Tuesday, December 1, 2015 11:51 AM
Subject: Re: Turning off DTD Validation using XML Utils package - Spark




The problem isn't really with DTD validation (by default validation is 
disabled).  The underlying problem is that the DTD can't be found (which is 
indicated in your stack trace below).  The underlying parser will try and 
retrieve the DTD (regardless of  validation) because things such as entities 
could be expressed in the DTD.

I will explore providing access to some of the underlying 'processor' 
configurations.  For example, you could provide your own EntityResolver class 
that could either completely ignore the Doctype declaration (return a 'dummy' 
DTD that is completely empty) or you could have it find 'local' versions (on 
the workers or in S3 and then cache them locally for performance).  

I will post an update when the code has been adjusted.

Darin.

----- Original Message -----
From: Shivalik <shivalik.malho...@outlook.com>
To: user@spark.apache.org
Sent: Tuesday, December 1, 2015 8:15 AM
Subject: Turning off DTD Validation using XML Utils package - Spark

Hi Team,

I've been using XML Utils library 
(http://spark-packages.org/package/elsevierlabs-os/spark-xml-utils) to parse
XML using XPath in a spark job. One problem I am facing is with the DTDs. My
XML file, has a doctype tag included in it.

I want to turn off DTD validation using this library since I don't have
access to DTD file. Has someone faced this problem before. Please help.

The exception I am getting it is as below:

stage 0.0 (TID 0, localhost):
com.elsevier.spark_xml_utils.xpath.XPathException: I/O error reported by XML
parser processing null: <path>/filename.dtd (No such file or directory)

at
com.elsevier.spark_xml_utils.xpath.XPathProcessor.evaluate(XPathProcessor.java:301)

at
com.elsevier.spark_xml_utils.xpath.XPathProcessor.evaluateString(XPathProcessor.java:219)

at com.thomsonreuters.xmlutils.XMLParser.lambda$0(XMLParser.java:31)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Turning-off-DTD-Validation-using-XML-Utils-package-Spark-tp25534.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Turning off DTD Validation using XML Utils package - Spark

Reply via email to