[ 
https://issues.apache.org/jira/browse/NUTCH-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18073574#comment-18073574
 ] 

Prakhar Chaube commented on NUTCH-3164:
---------------------------------------

[~snagel] did you get a chance to review my comment, thanks!

> Generic exceptions in catch block may lead to deletion of links from crawldb
> ----------------------------------------------------------------------------
>
>                 Key: NUTCH-3164
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3164
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb
>    Affects Versions: 1.22
>            Reporter: Prakhar Chaube
>            Priority: Critical
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In CrawlDbFilter.java (lines ~107-121), both the URL normalization and URL 
> filtering blocks catch Exception instead of the specific checked exceptions 
> declared by URLNormalizers.normalize() (MalformedURLException) and 
> URLFilters.filter() (URLFilterException).
> try {
>     url = normalizers.normalize(url, scope);
> } catch (Exception e) {
>     LOG.warn("Skipping {}: ", url, e);
>     url = null;
> }
> try {
>     url = filters.filter(url);
> } catch (Exception e) {
>     LOG.warn("Skipping {}: ", url, e);
>     url = null;
> }
> *Problem:*
> Any {{RuntimeException}} (e.g., {{{}NullPointerException{}}}, 
> {{{}IllegalArgumentException{}}}, {{{}ArrayIndexOutOfBoundsException{}}}) 
> thrown by a buggy normalizer/filter plugin gets caught, logged as a WARN, and 
> the URL is silently nulled out — counted as "filtered." This has two 
> consequences:
>  # *Silent data loss* — legitimate URLs are dropped from CrawlDb not because 
> they failed normalization/filtering, but because of an unrelated bug in a 
> plugin. The operator sees a WARN log but the URL is gone with no distinction 
> between "bad URL" and "broken plugin."
>  # *Bug masking* — {{{}RuntimeException{}}}s typically indicate programming 
> errors. Swallowing them makes it significantly harder to detect and diagnose 
> faulty normalizer/filter implementations, especially at scale where WARN logs 
> get lost in noise.
>  
> Raising as critical since this can lead to data loss.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to