[
https://issues.apache.org/jira/browse/NUTCH-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18083023#comment-18083023
]
ASF GitHub Bot commented on NUTCH-3164:
---------------------------------------
prakharchaube opened a new pull request, #918:
URL: https://github.com/apache/nutch/pull/918
## Summary
Both catch blocks in `CrawlDbFilter.map()` caught generic `Exception` and
set `url = null`, which silently dropped URLs both for legitimate filtering
reasons and for plugin
programming errors (NPE, etc.). The latter masked plugin bugs as ordinary
filtering decisions.
## Changes
**Normalizer block**
- `MalformedURLException` — the only legitimate reason to drop. Tracked
via `ErrorTracker` (`ErrorType.URL`) and no longer increments
`urlsFilteredCounter`, which conflated
filtering with malformed input.
- `RuntimeException` — logged at ERROR and tracked, URL is **not** dropped
so plugin bugs do not silently delete data.
**Filter block**
- `URLFilterException` — per the `URLFilter` contract, reserved for
internal filter failures (rejection is signaled by returning `null`). Logged at
ERROR and tracked, URL is
**not** dropped.
- `RuntimeException` — same handling as above.
All error paths now use `ErrorTracker` for categorized counters and log at
ERROR rather than WARN, per [@lewismc's recommendation in the JIRA
discussion](https://issues.apache.org/jira/browse/NUTCH-3164).
## JIRA
[NUTCH-3164](https://issues.apache.org/jira/browse/NUTCH-3164)
## Test plan
- [x] `ant compile` passes locally
- [ ] No new unit tests in this PR; happy to add one if reviewers want a
mock plugin throwing each exception type (flag as follow-up otherwise)
## Out of scope (separate tickets if desired)
- Same `catch (Exception)` pattern exists in `Injector.filterNormalize`
and ~20 other call sites of `normalizers.normalize` / `filters.filter` — per
lewismc comment, those
should be rolled out separately.
- Whether to track normalized URLs as a metric system-wide.
> Generic exceptions in catch block may lead to deletion of links from crawldb
> ----------------------------------------------------------------------------
>
> Key: NUTCH-3164
> URL: https://issues.apache.org/jira/browse/NUTCH-3164
> Project: Nutch
> Issue Type: Bug
> Components: crawldb
> Affects Versions: 1.22
> Reporter: Prakhar Chaube
> Priority: Critical
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> In CrawlDbFilter.java (lines ~107-121), both the URL normalization and URL
> filtering blocks catch Exception instead of the specific checked exceptions
> declared by URLNormalizers.normalize() (MalformedURLException) and
> URLFilters.filter() (URLFilterException).
> try {
> url = normalizers.normalize(url, scope);
> } catch (Exception e) {
> LOG.warn("Skipping {}: ", url, e);
> url = null;
> }
> try {
> url = filters.filter(url);
> } catch (Exception e) {
> LOG.warn("Skipping {}: ", url, e);
> url = null;
> }
> *Problem:*
> Any {{RuntimeException}} (e.g., {{{}NullPointerException{}}},
> {{{}IllegalArgumentException{}}}, {{{}ArrayIndexOutOfBoundsException{}}})
> thrown by a buggy normalizer/filter plugin gets caught, logged as a WARN, and
> the URL is silently nulled out — counted as "filtered." This has two
> consequences:
> # *Silent data loss* — legitimate URLs are dropped from CrawlDb not because
> they failed normalization/filtering, but because of an unrelated bug in a
> plugin. The operator sees a WARN log but the URL is gone with no distinction
> between "bad URL" and "broken plugin."
> # *Bug masking* — {{{}RuntimeException{}}}s typically indicate programming
> errors. Swallowing them makes it significantly harder to detect and diagnose
> faulty normalizer/filter implementations, especially at scale where WARN logs
> get lost in noise.
>
> Raising as critical since this can lead to data loss.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)