[ 
https://issues.apache.org/jira/browse/NUTCH-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18083023#comment-18083023
 ] 

ASF GitHub Bot commented on NUTCH-3164:
---------------------------------------

prakharchaube opened a new pull request, #918:
URL: https://github.com/apache/nutch/pull/918

   ## Summary
   
     Both catch blocks in `CrawlDbFilter.map()` caught generic `Exception` and 
set `url = null`, which silently dropped URLs both for legitimate filtering 
reasons and for plugin
     programming errors (NPE, etc.). The latter masked plugin bugs as ordinary 
filtering decisions.
     
     ## Changes
   
     **Normalizer block**
     - `MalformedURLException` — the only legitimate reason to drop. Tracked 
via `ErrorTracker` (`ErrorType.URL`) and no longer increments 
`urlsFilteredCounter`, which conflated
     filtering with malformed input.
     - `RuntimeException` — logged at ERROR and tracked, URL is **not** dropped 
so plugin bugs do not silently delete data.
     
     **Filter block**
     - `URLFilterException` — per the `URLFilter` contract, reserved for 
internal filter failures (rejection is signaled by returning `null`). Logged at 
ERROR and tracked, URL is
     **not** dropped.
     - `RuntimeException` — same handling as above.
     
     All error paths now use `ErrorTracker` for categorized counters and log at 
ERROR rather than WARN, per [@lewismc's recommendation in the JIRA 
     discussion](https://issues.apache.org/jira/browse/NUTCH-3164).
     
     ## JIRA
   
     [NUTCH-3164](https://issues.apache.org/jira/browse/NUTCH-3164)
     
     ## Test plan
   
     - [x] `ant compile` passes locally
     - [ ] No new unit tests in this PR; happy to add one if reviewers want a 
mock plugin throwing each exception type (flag as follow-up otherwise)
     
     ## Out of scope (separate tickets if desired)
     
     - Same `catch (Exception)` pattern exists in `Injector.filterNormalize` 
and ~20 other call sites of `normalizers.normalize` / `filters.filter` — per 
lewismc comment, those
     should be rolled out separately.
     - Whether to track normalized URLs as a metric system-wide.




> Generic exceptions in catch block may lead to deletion of links from crawldb
> ----------------------------------------------------------------------------
>
>                 Key: NUTCH-3164
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3164
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb
>    Affects Versions: 1.22
>            Reporter: Prakhar Chaube
>            Priority: Critical
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In CrawlDbFilter.java (lines ~107-121), both the URL normalization and URL 
> filtering blocks catch Exception instead of the specific checked exceptions 
> declared by URLNormalizers.normalize() (MalformedURLException) and 
> URLFilters.filter() (URLFilterException).
> try {
>     url = normalizers.normalize(url, scope);
> } catch (Exception e) {
>     LOG.warn("Skipping {}: ", url, e);
>     url = null;
> }
> try {
>     url = filters.filter(url);
> } catch (Exception e) {
>     LOG.warn("Skipping {}: ", url, e);
>     url = null;
> }
> *Problem:*
> Any {{RuntimeException}} (e.g., {{{}NullPointerException{}}}, 
> {{{}IllegalArgumentException{}}}, {{{}ArrayIndexOutOfBoundsException{}}}) 
> thrown by a buggy normalizer/filter plugin gets caught, logged as a WARN, and 
> the URL is silently nulled out — counted as "filtered." This has two 
> consequences:
>  # *Silent data loss* — legitimate URLs are dropped from CrawlDb not because 
> they failed normalization/filtering, but because of an unrelated bug in a 
> plugin. The operator sees a WARN log but the URL is gone with no distinction 
> between "bad URL" and "broken plugin."
>  # *Bug masking* — {{{}RuntimeException{}}}s typically indicate programming 
> errors. Swallowing them makes it significantly harder to detect and diagnose 
> faulty normalizer/filter implementations, especially at scale where WARN logs 
> get lost in noise.
>  
> Raising as critical since this can lead to data loss.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to