[ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Elisabeth Adler updated NUTCH-585:
----------------------------------

    Attachment: blacklist_whitelist_plugin.patch

Based on the suggestions/code above, I have created a plugin to blacklist or 
whitelist html elements (blacklist_whitelist_plugin.patch). This was based on 
the need for not indexing header/footer/navigation, so the user gets really 
only relevant results, e.g. even if the term shows up in the navigation.

The elements to be parsed (or not) can be defined by using CSS-like selectors. 
A new field called "strippedContent" is available in the index which can be 
used for searching. Links are still crawled and parsed from the "content" 
field, allowing all pages to be parsed. The full documentation is in the 
README.txt in the patch.

> [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-585
>                 URL: https://issues.apache.org/jira/browse/NUTCH-585
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>         Environment: All operating systems
>            Reporter: Andrea Spinelli
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: blacklist_whitelist_plugin.patch, 
> nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch
>
>
> We are using nutch to index our own web sites; we would like not to index 
> certain parts of our pages, because we know they are not relevant (for 
> instance, there are several links to change the background color) and 
> generate spurious matches.
> We have modified the plugin so that it ignores HTML code between certain HTML 
> comments, like
> <!-- START-IGNORE -->
> ... ignored part ...
> <!-- STOP-IGNORE -->
> We feel this might be useful to someone else, maybe factorizing the comment 
> strings as constants in the configuration files (say parser.html.ignore.start 
> and parser.html.ignore.stop in nutch-site.xml).
> We are almost ready to contribute our code snippet.  Looking forward for any 
> expression of  interest - or for an explanation why waht we are doing is 
> plain wrong!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to