Prakhar Chaube created NUTCH-3164:
-------------------------------------

             Summary: Generic exceptions in catch block may lead to deletion of 
links from crawldb
                 Key: NUTCH-3164
                 URL: https://issues.apache.org/jira/browse/NUTCH-3164
             Project: Nutch
          Issue Type: Bug
          Components: crawldb
    Affects Versions: 1.22
            Reporter: Prakhar Chaube


In CrawlDbFilter.java (lines ~107-121), both the URL normalization and URL 
filtering blocks catch Exception instead of the specific checked exceptions 
declared by URLNormalizers.normalize() (MalformedURLException) and 
URLFilters.filter() (URLFilterException).

try {
    url = normalizers.normalize(url, scope);
} catch (Exception e) {
    LOG.warn("Skipping {}: ", url, e);
    url = null;
}

try {
    url = filters.filter(url);
} catch (Exception e) {
    LOG.warn("Skipping {}: ", url, e);
    url = null;
}



*Problem:*

Any {{RuntimeException}} (e.g., {{{}NullPointerException{}}}, 
{{{}IllegalArgumentException{}}}, {{{}ArrayIndexOutOfBoundsException{}}}) 
thrown by a buggy normalizer/filter plugin gets caught, logged as a WARN, and 
the URL is silently nulled out — counted as "filtered." This has two 
consequences:
 # *Silent data loss* — legitimate URLs are dropped from CrawlDb not because 
they failed normalization/filtering, but because of an unrelated bug in a 
plugin. The operator sees a WARN log but the URL is gone with no distinction 
between "bad URL" and "broken plugin."
 # *Bug masking* — {{{}RuntimeException{}}}s typically indicate programming 
errors. Swallowing them makes it significantly harder to detect and diagnose 
faulty normalizer/filter implementations, especially at scale where WARN logs 
get lost in noise.

 

Raising as critical since this can lead to data loss.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to