[
https://issues.apache.org/jira/browse/NUTCH-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18071752#comment-18071752
]
Prakhar Chaube commented on NUTCH-3164:
---------------------------------------
Hi Sebastian,
this is what I propose:
1. In case of valid exceptions like MalformedURLException which is
intentionally thrown from the called methods normalize we continue to skip the
urls. In case of filter URLFilterException is not a direct indication of url
being inappropriate so not sure if allowing url skip on its trigger is a good
option. Let me know.
2. Add another catch block for generic exception where we can log and increase
the counter but not mark url = null;
Let me know your thoughts.
> Generic exceptions in catch block may lead to deletion of links from crawldb
> ----------------------------------------------------------------------------
>
> Key: NUTCH-3164
> URL: https://issues.apache.org/jira/browse/NUTCH-3164
> Project: Nutch
> Issue Type: Bug
> Components: crawldb
> Affects Versions: 1.22
> Reporter: Prakhar Chaube
> Priority: Critical
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> In CrawlDbFilter.java (lines ~107-121), both the URL normalization and URL
> filtering blocks catch Exception instead of the specific checked exceptions
> declared by URLNormalizers.normalize() (MalformedURLException) and
> URLFilters.filter() (URLFilterException).
> try {
> url = normalizers.normalize(url, scope);
> } catch (Exception e) {
> LOG.warn("Skipping {}: ", url, e);
> url = null;
> }
> try {
> url = filters.filter(url);
> } catch (Exception e) {
> LOG.warn("Skipping {}: ", url, e);
> url = null;
> }
> *Problem:*
> Any {{RuntimeException}} (e.g., {{{}NullPointerException{}}},
> {{{}IllegalArgumentException{}}}, {{{}ArrayIndexOutOfBoundsException{}}})
> thrown by a buggy normalizer/filter plugin gets caught, logged as a WARN, and
> the URL is silently nulled out — counted as "filtered." This has two
> consequences:
> # *Silent data loss* — legitimate URLs are dropped from CrawlDb not because
> they failed normalization/filtering, but because of an unrelated bug in a
> plugin. The operator sees a WARN log but the URL is gone with no distinction
> between "bad URL" and "broken plugin."
> # *Bug masking* — {{{}RuntimeException{}}}s typically indicate programming
> errors. Swallowing them makes it significantly harder to detect and diagnose
> faulty normalizer/filter implementations, especially at scale where WARN logs
> get lost in noise.
>
> Raising as critical since this can lead to data loss.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)