Hi Felix,

I tried to reproduce the problem. The parse-metatags plugin only duplicates the 
"robots" metatags,
adding it also as "metatag.robots" but keep the original "robots".

That is the case using the current master:

- with parse-metatags and metatags.names="robots" the ParseData object contains:

Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
metatag.robots=noindex,nofollow metatag.robots=noindex,nofollow 
generator=WordPress 3.1
robots=noindex,nofollow

metatag.robots is even added twice, but most important "robots" is still present

- without:

Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 
generator=WordPress 3.1
robots=noindex,nofollow


Deleting the robots=noindex documents works as expected for both settings.

> Or is it and I should file a report/patch?

Yes, please open an issue to fix this on
    https://issues.apache.org/jira/projects/NUTCH

Could be that there is some additional condition which I didn't hit.

Can you also share the document for which it does not work it does not work?


Thanks,
Sebastian



On 5/13/19 11:34 AM, Felix von Zadow wrote:
> Hi all!
> 
> So I was trying to use the option indexer.delete.robots.noindex (exclude page 
> when <meta robots="noindex"> is encountered).
> 
> However, the page I'm testing with is still being indexed. I have 
> parse-metatags and index-metadata activated and 
> indexer.delete.robots.noindex=true, metatags.names="robots" and 
> index.parse.md="metatag.robots".
> 
> Looking at IndexerMapReduce.java (#257) [1], the field that is being checked 
> is "robots" and not "metatag.robots". It does work as expected when I change 
> it to "metatag.robots":
> 
> Before:
> Indexing 3/3 documents
> Deleting 0 documents
> Indexer: number of documents indexed, deleted, or skipped:
> Indexer:      3  indexed (add/update)
> 
> After:
> Indexing 2/2 documents
> Deleting 0 documents
> Indexer: number of documents indexed, deleted, or skipped:
> Indexer:      1  deleted (robots=noindex)
> Indexer:      2  indexed (add/update)
> 
> 
> Am I missing something and this is not actually a bug but rather some 
> misconfiguration on my part?
> Or is it and I should file a report/patch?
> 
> Thanks!
> Felix
> 
> 
> [1] 
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L257
> 
> 

Reply via email to