[jira] [Commented] (NUTCH-2456) Allow to index pages/URLs not contained in CrawlDb

ASF GitHub Bot (JIRA) Tue, 05 Dec 2017 01:41:06 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278291#comment-16278291
 ]


ASF GitHub Bot commented on NUTCH-2456:
---------------------------------------

sebastian-nagel closed pull request #240: NUTCH-2456 - Redirected documents are 
not indexed
URL: https://github.com/apache/nutch/pull/240
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/indexer/IndexerMapReduce.java 
b/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
index cb6e1215e..00829c486 100644
--- a/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
+++ b/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
@@ -238,9 +238,9 @@ public void reduce(Text key, Iterator<NutchWritable> values,
     }
 
     // Whether to delete GONE or REDIRECTS
-    if (delete && fetchDatum != null && dbDatum != null) {
+    if (delete && fetchDatum != null) {
       if (fetchDatum.getStatus() == CrawlDatum.STATUS_FETCH_GONE
-          || dbDatum.getStatus() == CrawlDatum.STATUS_DB_GONE) {
+          || dbDatum != null && dbDatum.getStatus() == 
CrawlDatum.STATUS_DB_GONE) {
         reporter.incrCounter("IndexerStatus", "deleted (gone)", 1);
         output.collect(key, DELETE_ACTION);
         return;
@@ -248,28 +248,27 @@ public void reduce(Text key, Iterator<NutchWritable> 
values,
 
       if (fetchDatum.getStatus() == CrawlDatum.STATUS_FETCH_REDIR_PERM
           || fetchDatum.getStatus() == CrawlDatum.STATUS_FETCH_REDIR_TEMP
-          || dbDatum.getStatus() == CrawlDatum.STATUS_DB_REDIR_PERM
-          || dbDatum.getStatus() == CrawlDatum.STATUS_DB_REDIR_TEMP) {
+          || dbDatum != null && dbDatum.getStatus() == 
CrawlDatum.STATUS_DB_REDIR_PERM
+          || dbDatum != null && dbDatum.getStatus() == 
CrawlDatum.STATUS_DB_REDIR_TEMP) {
         reporter.incrCounter("IndexerStatus", "deleted (redirects)", 1);
         output.collect(key, DELETE_ACTION);
         return;
       }
     }
 
-    if (fetchDatum == null || dbDatum == null || parseText == null
-        || parseData == null) {
+    if (fetchDatum == null || parseText == null || parseData == null) {
       return; // only have inlinks
     }
 
     // Whether to delete pages marked as duplicates
-    if (delete && dbDatum.getStatus() == CrawlDatum.STATUS_DB_DUPLICATE) {
+    if (delete && dbDatum != null && dbDatum.getStatus() == 
CrawlDatum.STATUS_DB_DUPLICATE) {
       reporter.incrCounter("IndexerStatus", "deleted (duplicates)", 1);
       output.collect(key, DELETE_ACTION);
       return;
     }
 
     // Whether to skip DB_NOTMODIFIED pages
-    if (skip && dbDatum.getStatus() == CrawlDatum.STATUS_DB_NOTMODIFIED) {
+    if (skip && dbDatum != null && dbDatum.getStatus() == 
CrawlDatum.STATUS_DB_NOTMODIFIED) {
       reporter.incrCounter("IndexerStatus", "skipped (not modified)", 1);
       return;
     }
@@ -309,22 +308,24 @@ public void reduce(Text key, Iterator<NutchWritable> 
values,
     doc.add("boost", Float.toString(boost));
 
     try {
-      // Indexing filters may also be interested in the signature
-      fetchDatum.setSignature(dbDatum.getSignature());
-      
-      // extract information from dbDatum and pass it to
-      // fetchDatum so that indexing filters can use it
-      final Text url = (Text) dbDatum.getMetaData().get(
-          Nutch.WRITABLE_REPR_URL_KEY);
-      if (url != null) {
-        // Representation URL also needs normalization and filtering.
-        // If repr URL is excluded by filters we still accept this document
-        // but represented by its primary URL ("key") which has passed URL
-        // filters.
-        String urlString = filterUrl(normalizeUrl(url.toString()));
-        if (urlString != null) {
-          url.set(urlString);
-          fetchDatum.getMetaData().put(Nutch.WRITABLE_REPR_URL_KEY, url);
+      if (dbDatum != null) {
+        // Indexing filters may also be interested in the signature
+        fetchDatum.setSignature(dbDatum.getSignature());
+        
+        // extract information from dbDatum and pass it to
+        // fetchDatum so that indexing filters can use it
+        final Text url = (Text) dbDatum.getMetaData().get(
+            Nutch.WRITABLE_REPR_URL_KEY);
+        if (url != null) {
+          // Representation URL also needs normalization and filtering.
+          // If repr URL is excluded by filters we still accept this document
+          // but represented by its primary URL ("key") which has passed URL
+          // filters.
+          String urlString = filterUrl(normalizeUrl(url.toString()));
+          if (urlString != null) {
+            url.set(urlString);
+            fetchDatum.getMetaData().put(Nutch.WRITABLE_REPR_URL_KEY, url);
+          }
         }
       }
       // run indexing filters


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Allow to index pages/URLs not contained in CrawlDb
> --------------------------------------------------
>
>                 Key: NUTCH-2456
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2456
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.13
>            Reporter: Yossi Tamari
>            Priority: Critical
>
> If http.redirect.max is set to a positive value, the Fetcher will follow 
> redirects, creating a new CrawlDatum.
> If the redirected URL is fetched and parsed, during indexing for it we have a 
> special case: dbDatum is null. This means that in 
> [https://github.com/apache/nutch/blob/6199492f5e1e8811022257c88dbf63f1e1c739d0/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L259]
>  the document is not indexed, as it is assumed it only has inlinks (actually 
> it has everything but dbDatum).
> I'm not sure what the correct fix is here. It seems to me the condition 
> should use AND instead of OR anyway, but I may not understand the original 
> intent. It is clear that it is too strict as is.
> However, the code following that line assumes all 4 objects are not null, so 
> a patch would need to change more than just the condition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2456) Allow to index pages/URLs not contained in CrawlDb

Reply via email to