Re: a lot of threads spinwaiting
Thanks a lot for all your answers, this really is an active community Roland, I had that problem once, it's not the case here, I'll try to look into the crawldb, though hbase is not as friendly for filtering as I would like it to, I'm still a newbie there Regards, JC -- View this message in context: http://lucene.472066.n3.nabble.com/a-lot-of-threads-spinwaiting-tp4043801p4044084.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: a lot of threads spinwaiting
Hi JC, I think Marcus already answered about politeness :) But without delay it will be worse :) Do this missing URLs match on one of the filtering regex? Take a look at .../conf/regex-urlfilter.txt, I had a problem with this regex: # skip URLs containing certain characters as probable queries, etc. -[?*!@=] It will just silently drop all URLs with GET parameters. --Roland Am 01.03.2013 15:08, schrieb jc: Hi Roland and lufeng, Thank you very much for your replies, I already tested lufeng advice, with results pretty much as expected. By the way, my nutch installation is based on 2.1 version with hbase as crawldb storage Roland, maybe fetcher.server.delay param has something to do with that as well, I set it to 3 secs, setting it to 0 would be unpolite? All info you provided has helped me a lot, only one issue remains unfixed yet, there are more than 60 URLs from different hosts in my seed file, and only 20 queues, things may seem that all other 40 hosts have no more URLs to generate, but I really haven't seen any URL coming from those hosts since the creation of the crawldb. Based on my poor experience following params would allow a number of 60 queues for my vertical crawl, am I missing something? topN = 1 million fetcher.threads.per.queue = 3 fetcher.threads.per.host = 3 (just in case, I remember you told me to use per.queue instead) fetcher.threads.fetch = 200 seed urls of different hosts = 60 or more (regex-urlfilter.txt allows only urls from these hosts, they're all there, I checked) crawldb record count > 1 million Thanks again for all your help Regards, JC
RE: a lot of threads spinwaiting
Hi, Regarding politeness, 3 threads per queue is not really polite :) Cheers -Original message- > From:jc > Sent: Fri 01-Mar-2013 15:08 > To: user@nutch.apache.org > Subject: Re: a lot of threads spinwaiting > > Hi Roland and lufeng, > > Thank you very much for your replies, I already tested lufeng advice, with > results pretty much as expected. > > By the way, my nutch installation is based on 2.1 version with hbase as > crawldb storage > > Roland, maybe fetcher.server.delay param has something to do with that as > well, I set it to 3 secs, setting it to 0 would be unpolite? > > All info you provided has helped me a lot, only one issue remains unfixed > yet, there are more than 60 URLs from different hosts in my seed file, and > only 20 queues, things may seem that all other 40 hosts have no more URLs to > generate, but I really haven't seen any URL coming from those hosts since > the creation of the crawldb. > > Based on my poor experience following params would allow a number of 60 > queues for my vertical crawl, am I missing something? > > topN = 1 million > fetcher.threads.per.queue = 3 > fetcher.threads.per.host = 3 (just in case, I remember you told me to use > per.queue instead) > fetcher.threads.fetch = 200 > seed urls of different hosts = 60 or more (regex-urlfilter.txt allows only > urls from these hosts, they're all there, I checked) > crawldb record count > 1 million > > Thanks again for all your help > > Regards, > JC > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/a-lot-of-threads-spinwaiting-tp4043801p4043988.html > Sent from the Nutch - User mailing list archive at Nabble.com. >
Re: a lot of threads spinwaiting
Hi Roland and lufeng, Thank you very much for your replies, I already tested lufeng advice, with results pretty much as expected. By the way, my nutch installation is based on 2.1 version with hbase as crawldb storage Roland, maybe fetcher.server.delay param has something to do with that as well, I set it to 3 secs, setting it to 0 would be unpolite? All info you provided has helped me a lot, only one issue remains unfixed yet, there are more than 60 URLs from different hosts in my seed file, and only 20 queues, things may seem that all other 40 hosts have no more URLs to generate, but I really haven't seen any URL coming from those hosts since the creation of the crawldb. Based on my poor experience following params would allow a number of 60 queues for my vertical crawl, am I missing something? topN = 1 million fetcher.threads.per.queue = 3 fetcher.threads.per.host = 3 (just in case, I remember you told me to use per.queue instead) fetcher.threads.fetch = 200 seed urls of different hosts = 60 or more (regex-urlfilter.txt allows only urls from these hosts, they're all there, I checked) crawldb record count > 1 million Thanks again for all your help Regards, JC -- View this message in context: http://lucene.472066.n3.nabble.com/a-lot-of-threads-spinwaiting-tp4043801p4043988.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: a lot of threads spinwaiting
Hi jc, and one thing to add: check the robots.txt file of your crawled hosts, maybe they are limiting your fetches with delays: Crawl-delay: 10 --Roland Am 01.03.2013 03:32, schrieb feng lu: Hi jc << I don't understand why there are 19 queues, is it maybe that only 19 websites are being fetched? Because each queue handles FetchItems which come from the same Queue ID (be it a proto/hostname or proto/IP or proto/domain pair). And the Queue ID will be created based on queueMode argument. So here may be there 19 different Queue ID in FetchItemQueues. << Anyways, why is it that there are 194 spinwaiting out of 200 active threads? First of all, i see that the parameter "fetcher.threads.per.host" has been replaced by "fetcher.threads.per.queue" in nutch 1.6. I see that there are 200 fetching threads that can fetch items from any host. However, all remaining items are from the different 19 hosts. And total urls count is 1. Each queue come from the same Queue ID. So the logs indicate that only 6 threads is fetching and another 13 threads have finished fetching. Maybe another 13 queues are too small without spend too much time. Thanks lufeng On Fri, Mar 1, 2013 at 6:44 AM, jc wrote: Hi guys, I'm sorry if this question has been answered before, I looked but didn't find anything. This is my scenario (only relevant settings I think): seed urls: about 60 homepages from different domains generate.max.count = 1 fetcher.threads.per.host = 3 I'm trying to be polite here :-) partition.url.mode = byHost fetcher.threads.fetch = 200 fetcher.threads.per.queue = 1 topN = 100 depth = 1 Since the very beggining I've got a lot of spinwaiting threads (I'm not sure if those are threads because it doesn't really say in the log) 194/200 spinwaiting/active, 166 pages, 3 errors, 4.7 3.8 pages/s , 1471 1412 kb/s, 1 URLs in 19 queues I don't understand why there are 19 queues, is it maybe that only 19 websites are being fetched? Anyways, why is it that there are 194 spinwaiting out of 200 active threads? Thanks a lot in advance for your time. Regards, jc -- View this message in context: http://lucene.472066.n3.nabble.com/a-lot-of-threads-spinwaiting-tp4043801.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: a lot of threads spinwaiting
Hi jc << I don't understand why there are 19 queues, is it maybe that only 19 websites are being fetched? >> Because each queue handles FetchItems which come from the same Queue ID (be it a proto/hostname or proto/IP or proto/domain pair). And the Queue ID will be created based on queueMode argument. So here may be there 19 different Queue ID in FetchItemQueues. << Anyways, why is it that there are 194 spinwaiting out of 200 active threads? >> First of all, i see that the parameter "fetcher.threads.per.host" has been replaced by "fetcher.threads.per.queue" in nutch 1.6. I see that there are 200 fetching threads that can fetch items from any host. However, all remaining items are from the different 19 hosts. And total urls count is 1. Each queue come from the same Queue ID. So the logs indicate that only 6 threads is fetching and another 13 threads have finished fetching. Maybe another 13 queues are too small without spend too much time. Thanks lufeng On Fri, Mar 1, 2013 at 6:44 AM, jc wrote: > Hi guys, > > I'm sorry if this question has been answered before, I looked but didn't > find anything. > > This is my scenario (only relevant settings I think): > seed urls: about 60 homepages from different domains > generate.max.count = 1 > fetcher.threads.per.host = 3 I'm trying to be polite here :-) > partition.url.mode = byHost > fetcher.threads.fetch = 200 > fetcher.threads.per.queue = 1 > topN = 100 > depth = 1 > > Since the very beggining I've got a lot of spinwaiting threads (I'm not > sure > if those are threads because it doesn't really say in the log) > > 194/200 spinwaiting/active, 166 pages, 3 errors, 4.7 3.8 pages/s , 1471 > 1412 > kb/s, 1 URLs in 19 queues > > I don't understand why there are 19 queues, is it maybe that only 19 > websites are being fetched? Anyways, why is it that there are 194 > spinwaiting out of 200 active threads? > > Thanks a lot in advance for your time. > > Regards, > jc > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/a-lot-of-threads-spinwaiting-tp4043801.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- Don't Grow Old, Grow Up... :-)