[ 
https://issues.apache.org/jira/browse/NUTCH-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298420#comment-14298420
 ] 

Hudson commented on NUTCH-1918:
-------------------------------

SUCCESS: Integrated in Nutch-trunk #2956 (See 
[https://builds.apache.org/job/Nutch-trunk/2956/])
NUTCH-1918 TikaParser specifies a default namespace when generating DOM 
(jnioche: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1655966)
* /nutch/trunk/CHANGES.txt
* 
/nutch/trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMBuilder.java
* 
/nutch/trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java


> TikaParser specifies a default namespace when generating DOM
> ------------------------------------------------------------
>
>                 Key: NUTCH-1918
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1918
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>            Reporter: Julien Nioche
>             Fix For: 1.10
>
>         Attachments: NUTCH-1918.patch
>
>
> The DOM generated by parse-tika differs from the one done by parse-html. 
> Ideally we should be able to use either parsers with the same XPath 
> expressions.
> This is related to [NUTCH-1592], but this time instead of being a matter of 
> uppercases, the problem comes from the namespace used. 
> This issue has been investigated and fixed in storm-crawler 
> [https://github.com/DigitalPebble/storm-crawler/pull/58].
> Here is what Guillaume explained there :
> bq. When parsing the content, Tika creates a properly formatted XHTML 
> document: all elements are created within the namespace XHTML.
> bq. However in XPath 1.0, there's no concept of default namespace so XPath 
> expressions such as //BODY doesn't match anything. To make this work we 
> should use //ns1:BODY and define a NamespaceContext which associates ns1 with 
> "http://www.w3.org/1999/xhtml";
> bq. To keep the XPathExpressions simpler, I modified the DOMBuilder which is 
> our SaxHandler used to convert the SAX Events into a DOM tree to ignore a 
> "default name space" and the ParserBolt initializes it with the XHTML 
> namespace. This way //BODY matches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to