tanmay created NUTCH-3170:
-----------------------------
Summary: Improve in-memory outlink deduplication: Replace
HashSet/hashCode with BloomFilter
Key: NUTCH-3170
URL: https://issues.apache.org/jira/browse/NUTCH-3170
Project: Nutch
Issue Type: Improvement
Components: fetcher
Environment: strong text
Reporter: tanmay
Currently, when the property fetcher.follow.outlinks.depth > 0 is used, the
Fetcher uses an in-memory Set<Integer> named alreadyFetched to track which
outlinks have already been followed during the current execution cycle. It
stores url.toString().hashCode() (a 32-bit Java hash) inside a HashSet.
As mentioned in the comments in FetchItemQueue.java (lines 56-58), the core
perspective behind using hashCode() instead of the raw String was strictly to
save memory and avoid OutOfMemory exceptions at scale:
// keep track of duplicates if fetcher.follow.outlinks.depth > 0. Some urls may
// not get followed due to hash collisions. Hashing is used to reduce memory
usage.
While we know that Nutch's map-reduce CrawlDB (updatedb) handles these
collision omissions reliably in the next round, the 32-bit HashCode
implementation presents an unnecessary bottleneck during the immediate fetch
cycle due to inevitable hash collisions, which cause valid URLs to be skipped
arbitrarily.
*Proposed Solution:* We can vastly improve both the memory efficiency and the
collision rate by replacing the HashSet<Integer> with a Bloom Filter.
Why Bloom Filter is the perfect fit for this project:
Unmatched Space Efficiency: A HashSet<Integer> incurs standard Java object
overheads (~32 bytes per entry). A Bloom Filter operates strictly on bits.
Tracking 1,000,000 URLs translates to barely ~1-2 MBs of RAM usage. It strictly
aligns with the original developer's goal of "reducing memory usage".
Controlled Collision Rates: Instead of blindly accepting the high collision
rate of a single 32-bit hash, a Bloom Filter allows us to explicitly define a
False Positive Probability (e.g., 0.001% or 0.01%). This makes the immediate
fetch optimization highly deterministic.
Speed: K-hashes in a Bloom Filter are computed efficiently with O(k) complexity.
I am aware that a Bloom Filter comes with a specific trade-off: elements cannot
be deleted. However, since the FetcherQueue short-term cache only requires
contains() and add() operations (without any removals during a fetcher cycle),
a Bloom Filter is the ideal data structure for this use case.
I would like to contribute a patch for this.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)