[ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961009#comment-13961009
 ] 

Julien Nioche commented on NUTCH-1687:
--------------------------------------

I like the idea but am a bit concerned by the potential impact of : 

it = Iterables.cycle(queues.keySet()).iterator();

whenever a new FetchItemQueue is added. It will be called a lot at the 
beginning of a Fetch when we create most of the queues and we'd create loads of 
iterator that would be overridden straight away.

What about doing this lazily and trigger the generation of a new iterator only 
if getFetchItem() is called and at least one FetchItemQueue has been added? 

I agree that in the middle of a Fetch, queues don't get added so often compared 
to calls to getFetchItem() so not having to create an iterator there as we 
currently do would definitely be a plus.

In extreme cases when there is a large diversity of hostnames / domains within 
a fetchlist we could end up creating a new iterator for every new URL and would 
always start at the first one anyway which is what we currently do so the new 
approach would not be worse anyway.

What do you think?

Also why not using Iterators.cycle() directly? 

Thanks

> Pick queue in Round Robin
> -------------------------
>
>                 Key: NUTCH-1687
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1687
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Tien Nguyen Manh
>            Priority: Minor
>             Fix For: 1.9
>
>         Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>       final Iterator<Map.Entry<String, FetchItemQueue>> it =
>         queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>       while (it.hasNext()) {
> ....
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to