[
https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18051427#comment-18051427
]
ASF GitHub Bot commented on NUTCH-2455:
---------------------------------------
lewismc opened a new pull request, #888:
URL: https://github.com/apache/nutch/pull/888
This PR is proposed as a fix for
[NUTCH-2455](https://issues.apache.org/jira/browse/NUTCH-2455) and also to
supersede https://github.com/apache/nutch/pull/254/
In essence this PR implements scalable HostDb integration in the Generator
using MapReduce secondary sorting, eliminating the need to load the entire
HostDb into memory.
## Problem
The previous implementation loaded the entire HostDb into memory at reducer
startup. For crawls with millions of hosts, this caused:
- High memory consumption (O(HostDb size) per reducer)
- OutOfMemoryError for large HostDbs
- Startup latency while loading data
## Solution
Use MapReduce secondary sorting to stream HostDb entries through the
pipeline:
1. **Composite Key (`FloatTextPair`)**: Combines score and hostname to
enable sorting
2. **Custom Comparator (`ScoreHostKeyComparator`)**: Ensures HostDb entries
arrive before CrawlDb entries
3. **MultipleInputs**: Reads both HostDb and CrawlDb in a single MapReduce
job
4. **Streaming Reducer**: Processes HostDb entries as they arrive, no
preloading required
## Key Components
### FloatTextPair
```java
public static class FloatTextPair implements
WritableComparable<FloatTextPair> {
public FloatWritable first; // score (negative for HostDb)
public Text second; // hostname (empty for CrawlDb)
}
```
### ScoreHostKeyComparator
Sorting order:
1. HostDb entries first (non-empty hostname), sorted by hostname
2. CrawlDb entries second (empty hostname), sorted by score descending
### HostDbReaderMapper
Reads HostDb and emits with special key to ensure sorting before CrawlDb
entries:
```java
context.write(new FloatTextPair(-Float.MAX_VALUE, hostname), entry);
```
## Configuration
| Property | Description |
|----------|-------------|
| `generate.hostdb` | Path to HostDb (enables feature) |
| `generate.max.count.expr` | JEXL expression for per-host URL limit |
| `generate.fetch.delay.expr` | JEXL expression for per-host fetch delay |
### Example JEXL Expressions
```xml
<!
> Use secondary sorting for memory-efficient HostDb integration in Generator
> --------------------------------------------------------------------------
>
> Key: NUTCH-2455
> URL: https://issues.apache.org/jira/browse/NUTCH-2455
> Project: Nutch
> Issue Type: Improvement
> Components: generator
> Affects Versions: 1.13
> Reporter: Markus Jelsma
> Priority: Major
> Fix For: 1.22
>
> Attachments: NUTCH-2455.patch
>
>
> Citing Sebastian at NUTCH-2420:
> ??The correct solution would be to use <host,score> pairs as keys in the
> Selector job, with a partitioner and secondary sorting so that all keys with
> same host end up in the same call of the reducer. If values can also hold a
> HostDb entry and the sort comparator guarantees that the HostDb entry
> (entries if partitioned by domain or IP) comes in front of all CrawlDb
> entries. But that would be a substantial improvement...??
--
This message was sent by Atlassian Jira
(v8.20.10#820010)