[jira] Updated: (NUTCH-393) Indexer doesn't handle null documents returned by filters

2007-04-14 Thread Eelco Lempsink (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eelco Lempsink updated NUTCH-393:
-

Affects Version/s: 0.9.0

 Indexer doesn't handle null documents returned by filters
 -

 Key: NUTCH-393
 URL: https://issues.apache.org/jira/browse/NUTCH-393
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.8.1, 0.9.0
Reporter: Eelco Lempsink
 Attachments: NUTCH-393.patch


 Plugins (like IndexingFilter) may return a null value, but this isn't handled 
 by the Indexer.  A trivial adjustment is all it takes:
 @@ -237,6 +237,7 @@
if (LOG.isWarnEnabled()) { LOG.warn(Error indexing +key+: +e); }
return;
  }
 +if (doc == null) return;
  
  float boost = 1.0f;
  // run scoring filters

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-273) When a page is redirected, the original url is NOT updated.

2006-12-22 Thread Eelco Lempsink (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-273?page=all ]

Eelco Lempsink updated NUTCH-273:
-

Attachment: Fetcher.java-489586.diff

Let's not overcomplicate this issue.  At the moment, two different problems of 
different priorities are mixed in one issue.

Problem 1, blocker: The status of the URL causing the redirect isn't updated. 
Fixing that is not hard, attached is a one-liner patch.  Hopefully this can be 
applied soon.

Problem 2, minor: Should redirects be fetched immediately or not? One argument 
to fetch it immediately is that otherwise the redirectCount should be moved 
into the CrawlDatum (metadata).  If it's possible (in Jira) I suggest this 
problem should be split into a different issue.


 When a page is redirected, the original url is NOT updated.
 ---

 Key: NUTCH-273
 URL: http://issues.apache.org/jira/browse/NUTCH-273
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8
 Environment: n/a
Reporter: Lukas Vlcek
Priority: Blocker
 Attachments: Fetcher.java-489586.diff


 [Excerpt from maillist, sender: Andrzej Bialecki]
 When a page is redirected, the original url is NOT updated - so, CrawlDB will 
 never know that a redirect occured, it won't even know that a fetch 
 occured... This looks like a bug.
 In 0.7 this was recorded in the segment, and then it would affect the Page 
 status during updatedb. It should do so 0.8, too...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-393) Indexer doesn't handle null documents returned by filters

2006-11-07 Thread Eelco Lempsink (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-393?page=comments#action_12447939 ] 

Eelco Lempsink commented on NUTCH-393:
--

I'm not sure I agree with that. After running a document through a set of 
filters you'd expect all filters ran. If not, that's an exception.  For 
instance, your index might depend on all numbers and non-english words being 
stripped. When one of those filters hits an exception, but the other one runs, 
your index will become dirty.

 Indexer doesn't handle null documents returned by filters
 -

 Key: NUTCH-393
 URL: http://issues.apache.org/jira/browse/NUTCH-393
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.8.1
Reporter: Eelco Lempsink
 Attachments: NUTCH-393.patch


 Plugins (like IndexingFilter) may return a null value, but this isn't handled 
 by the Indexer.  A trivial adjustment is all it takes:
 @@ -237,6 +237,7 @@
if (LOG.isWarnEnabled()) { LOG.warn(Error indexing +key+: +e); }
return;
  }
 +if (doc == null) return;
  
  float boost = 1.0f;
  // run scoring filters

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-393) Indexer doesn't handle null documents returned by filters

2006-10-27 Thread Eelco Lempsink (JIRA)
Indexer doesn't handle null documents returned by filters
-

 Key: NUTCH-393
 URL: http://issues.apache.org/jira/browse/NUTCH-393
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.8.1
Reporter: Eelco Lempsink


Plugins (like IndexingFilter) may return a null value, but this isn't handled 
by the Indexer.  A trivial adjustment is all it takes:


@@ -237,6 +237,7 @@
   if (LOG.isWarnEnabled()) { LOG.warn(Error indexing +key+: +e); }
   return;
 }
+if (doc == null) return;
 
 float boost = 1.0f;
 // run scoring filters


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-393) Indexer doesn't handle null documents returned by filters

2006-10-27 Thread Eelco Lempsink (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-393?page=all ]

Eelco Lempsink updated NUTCH-393:
-

Attachment: NUTCH-393.patch

Here's a complete patch against the latest revision to fix this issue.  

Note that not only the Indexer.java must be adjusted, the loop in 
IndexingFilters.java that executes each filter must also stop when doc == null. 
 

This means that once a filter decides to drop the document no other filter can 
undo that action.

 Indexer doesn't handle null documents returned by filters
 -

 Key: NUTCH-393
 URL: http://issues.apache.org/jira/browse/NUTCH-393
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.8.1
Reporter: Eelco Lempsink
 Attachments: NUTCH-393.patch


 Plugins (like IndexingFilter) may return a null value, but this isn't handled 
 by the Indexer.  A trivial adjustment is all it takes:
 @@ -237,6 +237,7 @@
if (LOG.isWarnEnabled()) { LOG.warn(Error indexing +key+: +e); }
return;
  }
 +if (doc == null) return;
  
  float boost = 1.0f;
  // run scoring filters

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira