Eyeris Rodriguez Rueda created NUTCH-2387:
---------------------------------------------

             Summary: Nutch should not index document with "noindex" meta
                 Key: NUTCH-2387
                 URL: https://issues.apache.org/jira/browse/NUTCH-2387
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 1.13
         Environment: Linux mint 18,
            Reporter: Eyeris Rodriguez Rueda
             Fix For: 1.14


I'm using nutch 1.12 in local mode and solr 4.10.3.
For some reason i have detected that nutch index document with "noindex" robots 
meta.
 I have use nutch script for a complete cycle: 
bin/crawl -i urls/ crawl/ -2
with this url:
https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ 
After various testing the problem persist and aproximately 200 document with 
this robots meta are indexed incorrectly.
I have read the method configure in IndexerMapReduce.java class and it has a 
line for that property but for some reason it is not doing appropiately.
this.deleteRobotsNoIndex =  
job.getBoolean(INDEXER_DELETE_ROBOTS_NOINDEX,false);   (line 97)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to