[HippoCMS-dev] Patches for HippoMultiValueXMLPropertyExtractor and HippoSimpleXmlExtractor

Nico Tromp Wed, 08 Oct 2008 14:44:23 -0700

Hi all,

as part of the project I'm currently working on we needed to be able to have
a extractors that only act on documents depending on the namespace of the
document. (See a ealier post
http://www.nabble.com/Setting-properties-per-document-namespace-td19380940.htmlor
http://www.nabble.com/Mixed-content-and-extractors-td19319614.html). I
finally made some changes to the Hippo extractors. They are now able to take
into account the namespace of the document. If you specify a so called
target namespace the properties are only set if the document has the same
namespace. If you don't specify the target namespace the properties are set
on every document. The target namespace is set as a attribute on the
'configuration' element within the 'extractor' element. I think it should
have been a attribute of the 'extractor' element but that would mean a lot
more changes and I just wanted it to work.


If you like the idea, please feel free to use the code. And maybe the Hippo
guys can aply them to there code so everybody can benifit from it.

Please see the attached diff files for the patches.


Have fun

Nico Tromp

Index: HippoMultiValueXMLPropertyExtractor.java
===================================================================
--- HippoMultiValueXMLPropertyExtractor.java	(revision 13329)
+++ HippoMultiValueXMLPropertyExtractor.java	(working copy)
@@ -22,47 +22,80 @@
 import java.util.List;
 import java.util.Map;
 
+import org.apache.slide.common.Domain;
 import org.apache.slide.extractor.ExtractorException;
+import org.apache.slide.util.conf.Configuration;
+import org.apache.slide.util.conf.ConfigurationException;
 import org.jdom.Document;
 import org.jdom.JDOMException;
 import org.jdom.input.SAXBuilder;
 import org.jdom.xpath.XPath;
 
-
 /**
- * HippoMultiValueXMLPropertyExtractor
+ * HippoMultiValueXMLPropertyExtractor. The extractor is capable of extracting the properties based on the documents
+ * namespace. If the namespace of a document equals that of the extractor (the target namespace) the properties will be
+ * extracted. If the documents namespace does not mathc that of the extractor the properties will not be extracted. If no
+ * target namepace is specified the properties will be extracted for every document (aka the original mode of operation). The
+ * target namespace can be set as a attribute of the <code>configuration</code> element within a <code>extractor</code>
+ * element. <br>
+ * In the example below the property <code>demo_name</code> will only be extracted if the namespace of a document equals to
+ * http://www.hippocms.org/demo.
+ * 
+ * <pre>
+ * &lt;extractor classname=&quot;nl.hippo.slide.extractor.HippoMultiValueXMLPropertyExtractor&quot; uri=&quot;/files/default.preview/content&quot; content-type=&quot;text/xml | text/xml; charset=UTF-8 | application/xml&quot;&gt; 
+ *   &lt;configuration targetNamespace=&quot;http://www.hippocms.org/demo&quot;&gt;
+ *     &lt;instruction property=&quot;demo_name&quot; namespace=&quot;http://hippo.nl/cms/1.0&quot; xpath=&quot;//d:demoname/text()&quot;/&gt;
+ *   &lt;/configuration&gt;
+ * &lt;/extractor&gt;
+ * </pre>
+ * 
+ * <br>
+ * In the example below the property <code>demo_name</code> will be extracted for every document.
+ * 
+ * <pre>
+ * &lt;extractor classname=&quot;nl.hippo.slide.extractor.HippoMultiValueXMLPropertyExtractor&quot; uri=&quot;/files/default.preview/content&quot; content-type=&quot;text/xml | text/xml; charset=UTF-8 | application/xml&quot;&gt; 
+ *   &lt;configuration&gt;
+ *     &lt;instruction property=&quot;demo_name&quot; namespace=&quot;http://hippo.nl/cms/1.0&quot; xpath=&quot;//d:demoname/text()&quot;/&gt;
+ *   &lt;/configuration&gt;
+ * &lt;/extractor&gt;
+ * </pre>
+ * 
+ * @author Nico Tromp
  */
 public class HippoMultiValueXMLPropertyExtractor extends MultiValueXMLPropertyExtractor {
 
     private SAXBuilder m_builder;
-    
+    private Object targetNamespace;
+
     public HippoMultiValueXMLPropertyExtractor(String uri, String contentType, String namespace) {
         super(uri, contentType, namespace);
         m_builder = new SAXBuilder();
 
         m_builder.setExpandEntities(false);
         m_builder.setValidation(false);
-        
+
         // disable external dtd's
         m_builder.setFeature("http://xml.org/sax/features/validation";, false);
         m_builder.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd";, false);
-        
+
         // disable enitities including
         m_builder.setFeature("http://xml.org/sax/features/external-general-entities";, false);
         m_builder.setFeature("http://xml.org/sax/features/external-parameter-entities";, false);
     }
-    
+
     public Map extract(InputStream content) throws ExtractorException {
         Map properties = new HashMap();
         try {
             Document document = m_builder.build(content);
-            for (Iterator i = instructions.iterator(); i.hasNext();) {
-                Instruction instruction = (Instruction) i.next();
-                XPath xPath = instruction.getxPath();
-                List nodeList = xPath.selectNodes(document);
-                Object propertyValue = filter(nodeList, instruction);
-                if (propertyValue != null) {
-                    properties.put(instruction.getPropertyName(), propertyValue);
+            if (extractProperties(document)) {
+                for (Iterator i = instructions.iterator(); i.hasNext();) {
+                    Instruction instruction = (Instruction) i.next();
+                    XPath xPath = instruction.getxPath();
+                    List nodeList = xPath.selectNodes(document);
+                    Object propertyValue = filter(nodeList, instruction);
+                    if (propertyValue != null) {
+                        properties.put(instruction.getPropertyName(), propertyValue);
+                    }
                 }
             }
         } catch (IOException e) {
@@ -72,4 +105,27 @@
         }
         return properties;
     }
+
+    /**
+     * Determines if the extractors must be run for a document.
+     * 
+     * @param document
+     *            the document for which must be determined if the extractors must be run.
+     * @return <code>true</code> if the extractors must be run for this document, <code>false</code> otherwise.
+     */
+    private boolean extractProperties(Document document) {
+        String documentNamespace = document.getRootElement().getNamespaceURI();
+        Domain.debug("Document namespace [" + documentNamespace + "]");
+        return targetNamespace == null ? true : targetNamespace.equals(documentNamespace);
+    }
+
+    public void configure(Configuration config) throws ConfigurationException {
+        super.configure(config);
+        try {
+            targetNamespace = config.getAttribute("targetNamespace");
+            Domain.debug("Targetnamespace = [" + targetNamespace + "]");
+        } catch (Throwable e) {
+            Domain.debug("Attribute 'targetNamespace' not set, properties will be extracted from all documents.");
+        }
+    }
 }

Index: HippoSimpleXmlExtractor.java
===================================================================
--- HippoSimpleXmlExtractor.java	(revision 13329)
+++ HippoSimpleXmlExtractor.java	(working copy)
@@ -29,6 +29,8 @@
 import org.apache.slide.common.Domain;
 import org.apache.slide.extractor.ExtractorException;
 import org.apache.slide.extractor.SimpleXmlExtractor;
+import org.apache.slide.util.conf.Configuration;
+import org.apache.slide.util.conf.ConfigurationException;
 import org.apache.slide.util.logger.Logger;
 import org.apache.xpath.XPathAPI;
 import org.apache.xpath.objects.XObject;
@@ -37,11 +39,40 @@
 import org.xml.sax.SAXException;
 
 /**
- * A rewrite/improvement of the SimpleXmlExtractor
+ * A rewrite/improvement of the SimpleXmlExtractor. The extractor is capable of extracting the properties based on the
+ * documents namespace. If the namespace of a document equals that of the extractor (the target namespace) the properties
+ * will be extracted. If the documents namespace does not mathc that of the extractor the properties will not be extracted.
+ * If no target namepace is specified the properties will be extracted for every document (aka the original mode of
+ * operation). The target namespace can be set as a attribute of the <code>configuration</code> element within a
+ * <code>extractor</code> element. <br>
+ * In the example below the property <code>demo_name</code> will only be extracted if the namespace of a document equals to
+ * http://www.hippocms.org/demo.
+ * 
+ * <pre>
+ * &lt;extractor classname=&quot;nl.hippo.slide.extractor.HippoSimpleXmlExtractor&quot; uri=&quot;/files/default.preview/content&quot; content-type=&quot;text/xml | text/xml; charset=UTF-8 | application/xml&quot;&gt; 
+ *   &lt;configuration targetNamespace=&quot;http://www.hippocms.org/demo&quot;&gt;
+ *     &lt;instruction property=&quot;demo_name&quot; namespace=&quot;http://hippo.nl/cms/1.0&quot; xpath=&quot;//d:demoname&quot;/&gt;
+ *   &lt;/configuration&gt;
+ * &lt;/extractor&gt;
+ * </pre>
+ * 
+ * <br>
+ * In the example below the property <code>demo_name</code> will be extracted for every document.
+ * 
+ * <pre>
+ * &lt;extractor classname=&quot;nl.hippo.slide.extractor.HippoSimpleXmlExtractor&quot; uri=&quot;/files/default.preview/content&quot; content-type=&quot;text/xml | text/xml; charset=UTF-8 | application/xml&quot;&gt; 
+ *   &lt;configuration&gt;
+ *     &lt;instruction property=&quot;demo_name&quot; namespace=&quot;http://hippo.nl/cms/1.0&quot; xpath=&quot;//d:demoname&quot;/&gt;
+ *   &lt;/configuration&gt;
+ * &lt;/extractor&gt;
+ * </pre>
+ * 
+ * @author Nico Tromp
  */
 public class HippoSimpleXmlExtractor extends SimpleXmlExtractor {
 
     private static final String LOG_CHANNEL = HippoSimpleXmlExtractor.class.getName();
+    private Object targetNamespace;
 
     public HippoSimpleXmlExtractor(String uri, String contentType, String namespace) {
         super(uri, contentType, namespace);
@@ -78,19 +109,21 @@
                 dbf.setAttribute("http://apache.org/xml/features/dom/defer-node-expansion";, Boolean.FALSE);
             } catch (IllegalArgumentException e) {
                 if (Domain.isEnabled(Logger.INFO)) {
-                    Domain.log("HippoSimpleXmlExtractor: Unable to disable external entity resolving: " + e.getMessage(), LOG_CHANNEL, Logger.INFO);
+                    Domain.log("HippoSimpleXmlExtractor: Unable to disable defer node expansion: " + e.getMessage(), LOG_CHANNEL, Logger.INFO);
                 }
             }
 
             DocumentBuilder db = dbf.newDocumentBuilder();
             Document d = db.parse(content);
-            for (Iterator i = instructions.iterator(); i.hasNext();) {
-                Instruction instruction = (Instruction) i.next();
-                XObject x = XPathAPI.eval(d, "string(" + instruction.getxPath().getXPath() + ")");
-                if (x instanceof XString) {
-                    properties.put(instruction.getPropertyName(), ((XString) x).toString());
-                } else {
-                    properties.put(instruction.getPropertyName(), "");
+            if (extractProperties(d)) {
+                for (Iterator i = instructions.iterator(); i.hasNext();) {
+                    Instruction instruction = (Instruction) i.next();
+                    XObject x = XPathAPI.eval(d, "string(" + instruction.getxPath().getXPath() + ")");
+                    if (x instanceof XString) {
+                        properties.put(instruction.getPropertyName(), ((XString) x).toString());
+                    } else {
+                        properties.put(instruction.getPropertyName(), "");
+                    }
                 }
             }
         } catch (IOException e) {
@@ -100,8 +133,31 @@
         } catch (ParserConfigurationException e) {
             throw new ExtractorException("HippoSimpleXmlExtractor: ParserConfigurationException while extracting properties from content: " + e.getMessage());
         } catch (TransformerException e) {
-            throw new ExtractorException("HippoSimpleXmlExtractor: TransformerException while extracting properties fromk content: " + e.getMessage());
+            throw new ExtractorException("HippoSimpleXmlExtractor: TransformerException while extracting properties from content: " + e.getMessage());
         }
         return properties;
     }
+
+    /**
+     * Determines if the extractors must be run for a document.
+     * 
+     * @param document
+     *            the document for which must be determined if the extractors must be run.
+     * @return <code>true</code> if the extractors must be run for this document, <code>false</code> otherwise.
+     */
+    private boolean extractProperties(Document document) {
+        String documentNamespace = document.getDocumentElement().getNamespaceURI();
+        Domain.debug("Document namespace [" + documentNamespace + "]");
+        return targetNamespace == null ? true : targetNamespace.equals(documentNamespace);
+    }
+
+    public void configure(Configuration config) throws ConfigurationException {
+        super.configure(config);
+        try {
+            targetNamespace = config.getAttribute("targetNamespace");
+            Domain.debug("Targetnamespace = [" + targetNamespace + "]");
+        } catch (Throwable e) {
+            Domain.debug("Attribute 'targetNamespace' not set, properties will be extracted from all documents.");
+        }
+    }
 }

********************************************
Hippocms-dev: Hippo CMS development public mailinglist

Searchable archives can be found at:
MarkMail: http://hippocms-dev.markmail.org
Nabble: http://www.nabble.com/Hippo-CMS-f26633.html

[HippoCMS-dev] Patches for HippoMultiValueXMLPropertyExtractor and HippoSimpleXmlExtractor

Reply via email to