[jira] Updated: (NUTCH-393) Indexer doesn't handle null documents returned by filters
[ https://issues.apache.org/jira/browse/NUTCH-393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eelco Lempsink updated NUTCH-393: - Affects Version/s: 0.9.0 Indexer doesn't handle null documents returned by filters - Key: NUTCH-393 URL: https://issues.apache.org/jira/browse/NUTCH-393 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 0.8.1, 0.9.0 Reporter: Eelco Lempsink Attachments: NUTCH-393.patch Plugins (like IndexingFilter) may return a null value, but this isn't handled by the Indexer. A trivial adjustment is all it takes: @@ -237,6 +237,7 @@ if (LOG.isWarnEnabled()) { LOG.warn(Error indexing +key+: +e); } return; } +if (doc == null) return; float boost = 1.0f; // run scoring filters -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-273) When a page is redirected, the original url is NOT updated.
[ http://issues.apache.org/jira/browse/NUTCH-273?page=all ] Eelco Lempsink updated NUTCH-273: - Attachment: Fetcher.java-489586.diff Let's not overcomplicate this issue. At the moment, two different problems of different priorities are mixed in one issue. Problem 1, blocker: The status of the URL causing the redirect isn't updated. Fixing that is not hard, attached is a one-liner patch. Hopefully this can be applied soon. Problem 2, minor: Should redirects be fetched immediately or not? One argument to fetch it immediately is that otherwise the redirectCount should be moved into the CrawlDatum (metadata). If it's possible (in Jira) I suggest this problem should be split into a different issue. When a page is redirected, the original url is NOT updated. --- Key: NUTCH-273 URL: http://issues.apache.org/jira/browse/NUTCH-273 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8 Environment: n/a Reporter: Lukas Vlcek Priority: Blocker Attachments: Fetcher.java-489586.diff [Excerpt from maillist, sender: Andrzej Bialecki] When a page is redirected, the original url is NOT updated - so, CrawlDB will never know that a redirect occured, it won't even know that a fetch occured... This looks like a bug. In 0.7 this was recorded in the segment, and then it would affect the Page status during updatedb. It should do so 0.8, too... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-393) Indexer doesn't handle null documents returned by filters
[ http://issues.apache.org/jira/browse/NUTCH-393?page=comments#action_12447939 ] Eelco Lempsink commented on NUTCH-393: -- I'm not sure I agree with that. After running a document through a set of filters you'd expect all filters ran. If not, that's an exception. For instance, your index might depend on all numbers and non-english words being stripped. When one of those filters hits an exception, but the other one runs, your index will become dirty. Indexer doesn't handle null documents returned by filters - Key: NUTCH-393 URL: http://issues.apache.org/jira/browse/NUTCH-393 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 0.8.1 Reporter: Eelco Lempsink Attachments: NUTCH-393.patch Plugins (like IndexingFilter) may return a null value, but this isn't handled by the Indexer. A trivial adjustment is all it takes: @@ -237,6 +237,7 @@ if (LOG.isWarnEnabled()) { LOG.warn(Error indexing +key+: +e); } return; } +if (doc == null) return; float boost = 1.0f; // run scoring filters -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-393) Indexer doesn't handle null documents returned by filters
Indexer doesn't handle null documents returned by filters - Key: NUTCH-393 URL: http://issues.apache.org/jira/browse/NUTCH-393 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 0.8.1 Reporter: Eelco Lempsink Plugins (like IndexingFilter) may return a null value, but this isn't handled by the Indexer. A trivial adjustment is all it takes: @@ -237,6 +237,7 @@ if (LOG.isWarnEnabled()) { LOG.warn(Error indexing +key+: +e); } return; } +if (doc == null) return; float boost = 1.0f; // run scoring filters -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-393) Indexer doesn't handle null documents returned by filters
[ http://issues.apache.org/jira/browse/NUTCH-393?page=all ] Eelco Lempsink updated NUTCH-393: - Attachment: NUTCH-393.patch Here's a complete patch against the latest revision to fix this issue. Note that not only the Indexer.java must be adjusted, the loop in IndexingFilters.java that executes each filter must also stop when doc == null. This means that once a filter decides to drop the document no other filter can undo that action. Indexer doesn't handle null documents returned by filters - Key: NUTCH-393 URL: http://issues.apache.org/jira/browse/NUTCH-393 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 0.8.1 Reporter: Eelco Lempsink Attachments: NUTCH-393.patch Plugins (like IndexingFilter) may return a null value, but this isn't handled by the Indexer. A trivial adjustment is all it takes: @@ -237,6 +237,7 @@ if (LOG.isWarnEnabled()) { LOG.warn(Error indexing +key+: +e); } return; } +if (doc == null) return; float boost = 1.0f; // run scoring filters -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira