> Actually some kv storages use bloom filter for similar purpose. > > What is your queue size? And what is redirect rate?
There are 2500-5000 domain queues per fetcher and 20.000-40.000 fetch items. We usually have around 8 URL's per domain. The redirect rate is quite low, it doesn't happen that often so it's not a very big deal, just an inconvenience and a thing we might want to optimize. > > If most redirects are not crossdomain and average number of urls per > domain is not very big some fixed size chache in FetchItemQueue may > help. But this leads to lots of changes in fetcher. I haven't seen crossdomain redirects yet but it's possible. Just like false positives this is something we could live with. Thanks for sharing your thoughts. > > On Tue 18 Oct 2011 05:01:06 PM MSK, Markus Jelsma wrote: > > That sounds creepy indeed. It would still need a similar amount of RAM > > plus network overhead. Would a bloom filter be useful at all? It takes a > > lot less space and i can live with a non-deterministic approach. > > > > On Tuesday 18 October 2011 01:45:20 Sergey A Volkov wrote: > >> Hi > >> > >> I think some external key-value storage may replace map. They are fast > >> enough and overhead will be unsignificant (for many threads) > >> But this is very creepy solution. > >> > >> Sergey Volkov. > >> > >> On Tue 18 Oct 2011 03:15:33 AM MSK, Markus Jelsma wrote: > >>> Anyone? > >>> > >>>> Hi, > >>>> > >>>> With a> 0 value for http.redirect.max there's a possibility for > >>>> fetching and parsing duplicates, this is especially true for fetch > >>>> lists with many domains, even with just a few (+10) records per > >>>> domain/host queue. > >>>> > >>>> Assuming there's only one thread per queue, how can we use > >>>> http.redirect.max and prevent fetch and parse of duplicates? > >>>> > >>>> I'm not a big fan of keeping a map of fetched records in memory as > >>>> it'll blow up the heap. We can also not safely remove a record from > >>>> the fetch queue as the queue feeder may not have finished and > >>>> duplicates may still enter a queue. > >>>> > >>>> Any thoughts? > >>>> > >>>> Thanks, > >>>> Markus

