[ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210498#comment-13210498
 ] 

Lewis John McGibbney commented on NUTCH-585:
--------------------------------------------

I like this contribution Elisabeth. Is there any way it could be updated to 
trunk with the following suggestions
1) Please rename the package names to org.apache.nutch.blah.blah
2) In your ivy.xml please change the ivy-configuration.xml to
{code}
  <configurations>
      <include file="../../..//ivy/ivy-configurations.xml"/>
  </configurations>
{code}
This is eclipse specific.
3) Would it be possible to change the CHANGES.txt to package.html and store it 
in the lowest most folder within the java directory
4) It would really put the cherry on top if we could get a test case scenario, 
this would be a big +1.
5) I think the name is maybe a bit large... but I am fine keeping it if you 
think it is appropriate as it is your patch afterall.

Thank you for the contribution.
                
> [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-585
>                 URL: https://issues.apache.org/jira/browse/NUTCH-585
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>         Environment: All operating systems
>            Reporter: Andrea Spinelli
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: blacklist_whitelist_plugin.patch, 
> nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch
>
>
> We are using nutch to index our own web sites; we would like not to index 
> certain parts of our pages, because we know they are not relevant (for 
> instance, there are several links to change the background color) and 
> generate spurious matches.
> We have modified the plugin so that it ignores HTML code between certain HTML 
> comments, like
> <!-- START-IGNORE -->
> ... ignored part ...
> <!-- STOP-IGNORE -->
> We feel this might be useful to someone else, maybe factorizing the comment 
> strings as constants in the configuration files (say parser.html.ignore.start 
> and parser.html.ignore.stop in nutch-site.xml).
> We are almost ready to contribute our code snippet.  Looking forward for any 
> expression of  interest - or for an explanation why waht we are doing is 
> plain wrong!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to