tanmay created NUTCH-3170:
-----------------------------

             Summary: Improve in-memory outlink deduplication: Replace 
HashSet/hashCode with BloomFilter
                 Key: NUTCH-3170
                 URL: https://issues.apache.org/jira/browse/NUTCH-3170
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
         Environment: strong text
            Reporter: tanmay


Currently, when the property fetcher.follow.outlinks.depth > 0 is used, the 
Fetcher uses an in-memory Set<Integer> named alreadyFetched to track which 
outlinks have already been followed during the current execution cycle. It 
stores url.toString().hashCode() (a 32-bit Java hash) inside a HashSet.

As mentioned in the comments in FetchItemQueue.java (lines 56-58), the core 
perspective behind using hashCode() instead of the raw String was strictly to 
save memory and avoid OutOfMemory exceptions at scale:
// keep track of duplicates if fetcher.follow.outlinks.depth > 0. Some urls may 
// not get followed due to hash collisions. Hashing is used to reduce memory 
usage.
While we know that Nutch's map-reduce CrawlDB (updatedb) handles these 
collision omissions reliably in the next round, the 32-bit HashCode 
implementation presents an unnecessary bottleneck during the immediate fetch 
cycle due to inevitable hash collisions, which cause valid URLs to be skipped 
arbitrarily.

*Proposed Solution:* We can vastly improve both the memory efficiency and the 
collision rate by replacing the HashSet<Integer> with a Bloom Filter.

Why Bloom Filter is the perfect fit for this project:

Unmatched Space Efficiency: A HashSet<Integer> incurs standard Java object 
overheads (~32 bytes per entry). A Bloom Filter operates strictly on bits. 
Tracking 1,000,000 URLs translates to barely ~1-2 MBs of RAM usage. It strictly 
aligns with the original developer's goal of "reducing memory usage".

Controlled Collision Rates: Instead of blindly accepting the high collision 
rate of a single 32-bit hash, a Bloom Filter allows us to explicitly define a 
False Positive Probability (e.g., 0.001% or 0.01%). This makes the immediate 
fetch optimization highly deterministic.
Speed: K-hashes in a Bloom Filter are computed efficiently with O(k) complexity.

I am aware that a Bloom Filter comes with a specific trade-off: elements cannot 
be deleted. However, since the FetcherQueue short-term cache only requires 
contains() and add() operations (without any removals during a fetcher cycle), 
a Bloom Filter is the ideal data structure for this use case.

I would like to contribute a patch for this.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to