thank you for the hint.

I was studying further this and found the following info:

http://mail-archives.apache.org/mod_mbox/nutch-user/201102.mbox/%[email protected]%3E

Can somebody tell me in which file exactly I have to add the filter

            CrawlDatum datum, Inlinks inlinks) throws IndexingException {


        String content = parse.getText();
        System.out.println("Content : "+content);
        System.out.println("Contains : "+content.contains("nutch"));
        if(content.contains("nutch")){
            System.out.println("Nutch keyword found! Hence not indexing the
doc :)");
            return null;
        }

        return doc;
    }

I am simply looking to exclude documents containing the word "nutch" as
example.

I also have read
http://www.attuneinfocom.com/blog/how-to-build-and-deploy-plugin-with-apache-nutch.html
and http://florianhartl.com/nutch-plugin-tutorial.html

Thank you!

Domi

Reply via email to