[
https://issues.apache.org/jira/browse/NUTCH-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18071746#comment-18071746
]
Sebastian Nagel commented on NUTCH-3164:
----------------------------------------
Thanks, [~prakharchaube], for reporting the issue!
What is your proposed solution:
1. Fail the job by passing the Exception through?
2. Or log it more prominently, eventually report in a job counter of unexpected
exceptions?
> Generic exceptions in catch block may lead to deletion of links from crawldb
> ----------------------------------------------------------------------------
>
> Key: NUTCH-3164
> URL: https://issues.apache.org/jira/browse/NUTCH-3164
> Project: Nutch
> Issue Type: Bug
> Components: crawldb
> Affects Versions: 1.22
> Reporter: Prakhar Chaube
> Priority: Critical
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> In CrawlDbFilter.java (lines ~107-121), both the URL normalization and URL
> filtering blocks catch Exception instead of the specific checked exceptions
> declared by URLNormalizers.normalize() (MalformedURLException) and
> URLFilters.filter() (URLFilterException).
> try {
> url = normalizers.normalize(url, scope);
> } catch (Exception e) {
> LOG.warn("Skipping {}: ", url, e);
> url = null;
> }
> try {
> url = filters.filter(url);
> } catch (Exception e) {
> LOG.warn("Skipping {}: ", url, e);
> url = null;
> }
> *Problem:*
> Any {{RuntimeException}} (e.g., {{{}NullPointerException{}}},
> {{{}IllegalArgumentException{}}}, {{{}ArrayIndexOutOfBoundsException{}}})
> thrown by a buggy normalizer/filter plugin gets caught, logged as a WARN, and
> the URL is silently nulled out — counted as "filtered." This has two
> consequences:
> # *Silent data loss* — legitimate URLs are dropped from CrawlDb not because
> they failed normalization/filtering, but because of an unrelated bug in a
> plugin. The operator sees a WARN log but the URL is gone with no distinction
> between "bad URL" and "broken plugin."
> # *Bug masking* — {{{}RuntimeException{}}}s typically indicate programming
> errors. Swallowing them makes it significantly harder to detect and diagnose
> faulty normalizer/filter implementations, especially at scale where WARN logs
> get lost in noise.
>
> Raising as critical since this can lead to data loss.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)