[ https://issues.apache.org/jira/browse/NUTCH-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298420#comment-14298420 ]
Hudson commented on NUTCH-1918: ------------------------------- SUCCESS: Integrated in Nutch-trunk #2956 (See [https://builds.apache.org/job/Nutch-trunk/2956/]) NUTCH-1918 TikaParser specifies a default namespace when generating DOM (jnioche: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1655966) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMBuilder.java * /nutch/trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java > TikaParser specifies a default namespace when generating DOM > ------------------------------------------------------------ > > Key: NUTCH-1918 > URL: https://issues.apache.org/jira/browse/NUTCH-1918 > Project: Nutch > Issue Type: Bug > Components: parser > Reporter: Julien Nioche > Fix For: 1.10 > > Attachments: NUTCH-1918.patch > > > The DOM generated by parse-tika differs from the one done by parse-html. > Ideally we should be able to use either parsers with the same XPath > expressions. > This is related to [NUTCH-1592], but this time instead of being a matter of > uppercases, the problem comes from the namespace used. > This issue has been investigated and fixed in storm-crawler > [https://github.com/DigitalPebble/storm-crawler/pull/58]. > Here is what Guillaume explained there : > bq. When parsing the content, Tika creates a properly formatted XHTML > document: all elements are created within the namespace XHTML. > bq. However in XPath 1.0, there's no concept of default namespace so XPath > expressions such as //BODY doesn't match anything. To make this work we > should use //ns1:BODY and define a NamespaceContext which associates ns1 with > "http://www.w3.org/1999/xhtml" > bq. To keep the XPathExpressions simpler, I modified the DOMBuilder which is > our SaxHandler used to convert the SAX Events into a DOM tree to ignore a > "default name space" and the ParserBolt initializes it with the XHTML > namespace. This way //BODY matches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)