Hi Felix, I tried to reproduce the problem. The parse-metatags plugin only duplicates the "robots" metatags, adding it also as "metatag.robots" but keep the original "robots".
That is the case using the current master: - with parse-metatags and metatags.names="robots" the ParseData object contains: Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 metatag.robots=noindex,nofollow metatag.robots=noindex,nofollow generator=WordPress 3.1 robots=noindex,nofollow metatag.robots is even added twice, but most important "robots" is still present - without: Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 generator=WordPress 3.1 robots=noindex,nofollow Deleting the robots=noindex documents works as expected for both settings. > Or is it and I should file a report/patch? Yes, please open an issue to fix this on https://issues.apache.org/jira/projects/NUTCH Could be that there is some additional condition which I didn't hit. Can you also share the document for which it does not work it does not work? Thanks, Sebastian On 5/13/19 11:34 AM, Felix von Zadow wrote: > Hi all! > > So I was trying to use the option indexer.delete.robots.noindex (exclude page > when <meta robots="noindex"> is encountered). > > However, the page I'm testing with is still being indexed. I have > parse-metatags and index-metadata activated and > indexer.delete.robots.noindex=true, metatags.names="robots" and > index.parse.md="metatag.robots". > > Looking at IndexerMapReduce.java (#257) [1], the field that is being checked > is "robots" and not "metatag.robots". It does work as expected when I change > it to "metatag.robots": > > Before: > Indexing 3/3 documents > Deleting 0 documents > Indexer: number of documents indexed, deleted, or skipped: > Indexer: 3 indexed (add/update) > > After: > Indexing 2/2 documents > Deleting 0 documents > Indexer: number of documents indexed, deleted, or skipped: > Indexer: 1 deleted (robots=noindex) > Indexer: 2 indexed (add/update) > > > Am I missing something and this is not actually a bug but rather some > misconfiguration on my part? > Or is it and I should file a report/patch? > > Thanks! > Felix > > > [1] > https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L257 > >