[jira] [Commented] (NUTCH-2456) Redirected documents are not indexed

Sebastian Nagel (JIRA) Tue, 07 Nov 2017 07:56:17 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16242247#comment-16242247
 ]


Sebastian Nagel commented on NUTCH-2456:
----------------------------------------

For every item in a redirect chain  URL -> target_1 -> target_2 -> target_n, a 
new CrawlDatum is created and stored in the segment's  After running "updatedb" 
these CrawlDatum's are added to the CrawlDb, and an index job will get them as 
input. Only if the CrawlDb isn't updated (or this is done with -noAdditions) 
before indexing. Is this a possible reason?

> Redirected documents are not indexed
> ------------------------------------
>
>                 Key: NUTCH-2456
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2456
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.13
>            Reporter: Yossi Tamari
>            Priority: Critical
>
> If http.redirect.max is set to a positive value, the Fetcher will follow 
> redirects, creating a new CrawlDatum.
> If the redirected URL is fetched and parsed, during indexing for it we have a 
> special case: dbDatum is null. This means that in 
> [https://github.com/apache/nutch/blob/6199492f5e1e8811022257c88dbf63f1e1c739d0/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L259]
>  the document is not indexed, as it is assumed it only has inlinks (actually 
> it has everything but dbDatum).
> I'm not sure what the correct fix is here. It seems to me the condition 
> should use AND instead of OR anyway, but I may not understand the original 
> intent. It is clear that it is too strict as is.
> However, the code following that line assumes all 4 objects are not null, so 
> a patch would need to change more than just the condition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2456) Redirected documents are not indexed

Reply via email to