> Actually some kv storages use bloom filter for similar purpose.
> 
> What is your queue size? And what is redirect rate?

There are 2500-5000 domain queues per fetcher and 20.000-40.000 fetch items. 
We usually have around 8 URL's per domain. The redirect rate is quite low, it 
doesn't happen that often so it's not a very big deal, just an inconvenience 
and a thing we might want to optimize.

> 
> If most redirects are not crossdomain and average number of urls per
> domain is not very big some fixed size chache in FetchItemQueue may
> help. But this leads to lots of changes in fetcher.

I haven't seen crossdomain redirects yet but it's possible. Just like false 
positives this is something we could live with.

Thanks for sharing your thoughts.

> 
> On Tue 18 Oct 2011 05:01:06 PM MSK, Markus Jelsma wrote:
> > That sounds creepy indeed. It would still need a similar amount of RAM
> > plus network overhead. Would a bloom filter be useful at all? It takes a
> > lot less space and i can live with a non-deterministic approach.
> > 
> > On Tuesday 18 October 2011 01:45:20 Sergey A Volkov wrote:
> >> Hi
> >> 
> >> I think some external key-value storage may replace map. They are fast
> >> enough and overhead will be unsignificant (for many threads)
> >> But this is very creepy solution.
> >> 
> >> Sergey Volkov.
> >> 
> >> On Tue 18 Oct 2011 03:15:33 AM MSK, Markus Jelsma wrote:
> >>> Anyone?
> >>> 
> >>>> Hi,
> >>>> 
> >>>> With a>   0 value for http.redirect.max there's a possibility for
> >>>> fetching and parsing duplicates, this is especially true for fetch
> >>>> lists with many domains, even with just a few (+10) records per
> >>>> domain/host queue.
> >>>> 
> >>>> Assuming there's only one thread per queue, how can we use
> >>>> http.redirect.max and prevent fetch and parse of duplicates?
> >>>> 
> >>>> I'm not a big fan of keeping a map of fetched records in memory as
> >>>> it'll blow up the heap. We can also not safely remove a record from
> >>>> the fetch queue as the queue feeder may not have finished and
> >>>> duplicates may still enter a queue.
> >>>> 
> >>>> Any thoughts?
> >>>> 
> >>>> Thanks,
> >>>> Markus

Reply via email to