[ 
https://issues.apache.org/jira/browse/NUTCH-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18074113#comment-18074113
 ] 

Lewis John McGibbney commented on NUTCH-3164:
---------------------------------------------

[~prakharchaube] good catch. A few comments
 # If url == null then that is not a warning, it is an error. We should use the 
[ErrorTracker|https://nutch.apache.org/documentation/javadoc/api/org/apache/nutch/metrics/ErrorTracker.html].
 # Regarding URL normalization: your proposal to skip sounds good. The 
urlsFilteredCounter is not the correct counter to increment so I think this is 
also a bug which we can fix in this case. Currently we don't count and track 
normalized URLs within the metrics system. It would be good for us to decide on 
whether we wish to track normalized URLs in this way. Currently it looks like 
[URL normalization is performed in at least 20 
files|https://github.com/search?q=repo%3Aapache%2Fnutch+%22normalizers.normalize%22&type=code]
 so whatever decision we make here could be rolled out across those files in a 
separate ticket.
 # Regarding URL filtering: I propose we log the error and track the error in 
ErrorTracker. The filtering decision (appropriate vs. inappropriate URL) is 
communicated through the return value of URLFilter.filter():
 ## Returns the URL string means it was accepted
 ## Returns null means it was rejected
...this is to say that no exception is involved in normal accept/reject logic. 
Any filter that decides a URL is unwanted simply returns null — that is the 
entire contract defined by the interface. URLFilterException is reserved for 
the case where the filter couldn't even make a decision due to an internal 
error. Hence my suggestion to use ErrorTracker. After some investigation it 
looks like the current implementations don't even throw it — they tend to log 
the operational error and return null, which conflates "I failed" with "I 
rejected this URL." That is arguably a separate design problem, but I would say 
it confirms (agrees with you) that URLFilterException was never intended to 
mean "inappropriate URL."

> Generic exceptions in catch block may lead to deletion of links from crawldb
> ----------------------------------------------------------------------------
>
>                 Key: NUTCH-3164
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3164
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb
>    Affects Versions: 1.22
>            Reporter: Prakhar Chaube
>            Priority: Critical
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In CrawlDbFilter.java (lines ~107-121), both the URL normalization and URL 
> filtering blocks catch Exception instead of the specific checked exceptions 
> declared by URLNormalizers.normalize() (MalformedURLException) and 
> URLFilters.filter() (URLFilterException).
> try {
>     url = normalizers.normalize(url, scope);
> } catch (Exception e) {
>     LOG.warn("Skipping {}: ", url, e);
>     url = null;
> }
> try {
>     url = filters.filter(url);
> } catch (Exception e) {
>     LOG.warn("Skipping {}: ", url, e);
>     url = null;
> }
> *Problem:*
> Any {{RuntimeException}} (e.g., {{{}NullPointerException{}}}, 
> {{{}IllegalArgumentException{}}}, {{{}ArrayIndexOutOfBoundsException{}}}) 
> thrown by a buggy normalizer/filter plugin gets caught, logged as a WARN, and 
> the URL is silently nulled out — counted as "filtered." This has two 
> consequences:
>  # *Silent data loss* — legitimate URLs are dropped from CrawlDb not because 
> they failed normalization/filtering, but because of an unrelated bug in a 
> plugin. The operator sees a WARN log but the URL is gone with no distinction 
> between "bad URL" and "broken plugin."
>  # *Bug masking* — {{{}RuntimeException{}}}s typically indicate programming 
> errors. Swallowing them makes it significantly harder to detect and diagnose 
> faulty normalizer/filter implementations, especially at scale where WARN logs 
> get lost in noise.
>  
> Raising as critical since this can lead to data loss.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to