lewismc opened a new pull request, #882: URL: https://github.com/apache/nutch/pull/882
See [NUTCH-3142](https://issues.apache.org/jira/browse/NUTCH-3142) for background. This PR implements **Missing Error Context** (recommendation #8) from the Nutch Hadoop Metrics Analysis report. It introduces a centralized `ErrorTracker` utility that categorizes errors by type and emits structured Hadoop counters, replacing the previous approach of counting errors without categorization. ## Changes ### New Files - **`src/java/org/apache/nutch/metrics/ErrorTracker.java`** - Thread-safe error categorization utility that: - Defines 8 error categories: `NETWORK`, `PROTOCOL`, `PARSING`, `URL`, `SCORING`, `INDEXING`, `TIMEOUT`, `OTHER` - Automatically categorizes exceptions based on type and class name - Supports cached counters for performance in hot paths - Provides both local accumulation (`recordError`/`emitCounters`) and direct increment (`incrementCounters`) APIs - **`src/test/org/apache/nutch/metrics/TestErrorTracker.java`** - Comprehensive test suite with 26 tests covering: - Exception categorization for all error types - Nutch-specific exceptions (ProtocolException, ParseException, ScoringFilterException, etc.) - Cached counter initialization and usage - Thread safety - Nested cause chain handling ### Modified Files #### Metrics Constants (`NutchMetrics.java`) - Added standard error counter constants: `ERROR_TOTAL`, `ERROR_NETWORK_TOTAL`, `ERROR_PROTOCOL_TOTAL`, `ERROR_PARSING_TOTAL`, `ERROR_URL_TOTAL`, `ERROR_SCORING_TOTAL`, `ERROR_INDEXING_TOTAL`, `ERROR_TIMEOUT_TOTAL`, `ERROR_OTHER_TOTAL` - Removed redundant component-specific error counters (which I introduced initially in #871) now handled by `ErrorTracker` #### Component Integrations | Component | File | Changes | |-----------|------|---------| | Fetcher | `FetcherThread.java`, `Fetcher.java` | Integrated `ErrorTracker` for fetch error categorization | | Parser | `ParseSegment.java` | Added error tracking for parsing and scoring exceptions | | Indexer | `IndexerMapReduce.java` | Replaced `errorsScoringFilterCounter` and `errorsIndexingFilterCounter` with `ErrorTracker` | | Generator | `Generator.java` | Replaced URL filter and malformed URL counters with `ErrorTracker` | | Injector | `Injector.java` | Added error tracking for URL processing exceptions | | CrawlDb | `CrawlDbReducer.java` | Added error tracking for scoring filter exceptions | | HostDb | `UpdateHostDbMapper.java`, `ResolverThread.java` | Replaced `malformedUrlCounter` with `ErrorTracker`; added DNS resolution error tracking | | Sitemap | `SitemapProcessor.java` | Added error tracking for sitemap processing exceptions | | WARC | `WARCExporter.java` | Replaced `exceptionCounter` and `invalidUriCounter` with `ErrorTracker` | #### Dependencies (`ivy/ivy.xml`) - Added `mockito-core` and `mockito-junit-jupiter` (v5.18.0) as test dependencies. I had been thinking about doing this with some previous PR's but didn't want to introduce new dependencies to the project. In this case, it made for much cleaner more intuitive tests. ## Benefits 1. **Better Debugging**: Errors are now categorized by type, making it easier to identify patterns 2. **Reduced Counter Cardinality**: Uses a fixed set of error categories (~10 counters) instead of unlimited component-specific counters 3. **Consistent API**: All components use the same error tracking mechanism 4. **Performance**: Cached counters avoid repeated lookups in hot paths, this is consistent with #878 5. **Thread Safety**: `ConcurrentHashMap` ensures safe concurrent access I've incorporated these new counters locally into [nutch-grafana-resources collector configuration. and dashboards](https://github.com/lewismc/nutch-grafana-resources) and will push those updates entirely separately. This patch is best tested by looking at Hadoop Counters in STDOUT/logging. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]

