Prakhar Chaube created NUTCH-3164:
-------------------------------------
Summary: Generic exceptions in catch block may lead to deletion of
links from crawldb
Key: NUTCH-3164
URL: https://issues.apache.org/jira/browse/NUTCH-3164
Project: Nutch
Issue Type: Bug
Components: crawldb
Affects Versions: 1.22
Reporter: Prakhar Chaube
In CrawlDbFilter.java (lines ~107-121), both the URL normalization and URL
filtering blocks catch Exception instead of the specific checked exceptions
declared by URLNormalizers.normalize() (MalformedURLException) and
URLFilters.filter() (URLFilterException).
try {
url = normalizers.normalize(url, scope);
} catch (Exception e) {
LOG.warn("Skipping {}: ", url, e);
url = null;
}
try {
url = filters.filter(url);
} catch (Exception e) {
LOG.warn("Skipping {}: ", url, e);
url = null;
}
*Problem:*
Any {{RuntimeException}} (e.g., {{{}NullPointerException{}}},
{{{}IllegalArgumentException{}}}, {{{}ArrayIndexOutOfBoundsException{}}})
thrown by a buggy normalizer/filter plugin gets caught, logged as a WARN, and
the URL is silently nulled out — counted as "filtered." This has two
consequences:
# *Silent data loss* — legitimate URLs are dropped from CrawlDb not because
they failed normalization/filtering, but because of an unrelated bug in a
plugin. The operator sees a WARN log but the URL is gone with no distinction
between "bad URL" and "broken plugin."
# *Bug masking* — {{{}RuntimeException{}}}s typically indicate programming
errors. Swallowing them makes it significantly harder to detect and diagnose
faulty normalizer/filter implementations, especially at scale where WARN logs
get lost in noise.
Raising as critical since this can lead to data loss.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)