Actually some kv storages use bloom filter for similar purpose.
What is your queue size? And what is redirect rate?
If most redirects are not crossdomain and average number of urls per
domain is not very big some fixed size chache in FetchItemQueue may
help. But this leads to lots of changes in fetcher.
On Tue 18 Oct 2011 05:01:06 PM MSK, Markus Jelsma wrote:
That sounds creepy indeed. It would still need a similar amount of RAM plus
network overhead. Would a bloom filter be useful at all? It takes a lot less
space and i can live with a non-deterministic approach.
On Tuesday 18 October 2011 01:45:20 Sergey A Volkov wrote:
Hi
I think some external key-value storage may replace map. They are fast
enough and overhead will be unsignificant (for many threads)
But this is very creepy solution.
Sergey Volkov.
On Tue 18 Oct 2011 03:15:33 AM MSK, Markus Jelsma wrote:
Anyone?
Hi,
With a> 0 value for http.redirect.max there's a possibility for
fetching and parsing duplicates, this is especially true for fetch
lists with many domains, even with just a few (+10) records per
domain/host queue.
Assuming there's only one thread per queue, how can we use
http.redirect.max and prevent fetch and parse of duplicates?
I'm not a big fan of keeping a map of fetched records in memory as it'll
blow up the heap. We can also not safely remove a record from the fetch
queue as the queue feeder may not have finished and duplicates may still
enter a queue.
Any thoughts?
Thanks,
Markus