Greg Padiasek created NUTCH-1749:
------------------------------------

             Summary: Title duplicated in document body
                 Key: NUTCH-1749
                 URL: https://issues.apache.org/jira/browse/NUTCH-1749
             Project: Nutch
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.7
            Reporter: Greg Padiasek


The HTML parser plugin inserts document title into document content. Since the 
title alone can be retrieved via DOMContentUtils.getTitle() and content is 
retrieved via DOMContentUtils.getText(), there is no need to duplicate title in 
the content. When title is included in the content it becomes 
difficult/impossible to extract document body without title. A need to extract 
document body without title is visible when user wants to index or display body 
and title separately.

Attached is a patch which prevents including title in document content in the 
HTML parser plugin.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to