Re: Fetching of URLs from seed list ends up with only a small portion of them indexed by Solr
Hi Amit. As i answered you before. There is a config paramter to activate the crawling of redirections (db_redir_temp 4,770, db_redir_perm 56,810). you have to activate this in the nutch-site.xml. Please have a look at the nutch-default.xml to find out which one it is... Only the pages with db_fetched will be indexed. Regards Stefan Am 02.03.2013 01:01, schrieb Amit Sela: I am using the crawl script that executes Solr indexing with: $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT and then executes Solr dedup: $bin/nutch solrdedup $SOLRURL I think it has something to do with the CrawlDB job. The job counters show: db_redir_temp 4,770 db_redir_perm 56,810 db_notmodified 5,343 db_unfetched 27,385 db_gone 3,741 db_fetched 22,065 On Thu, Feb 28, 2013 at 10:02 PM, kiran chitturi wrote: This looks odd. From what i know, the successfully parsed documents are sent to Solr. Did you check the logs for any exceptions ? What command are you using to index ? On Thu, Feb 28, 2013 at 1:51 PM, Amit Sela wrote: Hi everyone, I'm running with nutch 1.6 and Solr 3.6.2. I'm trying to crawl only the seed list (depth 1) and it seems that the process ends with only ~255 of the URLs indexed in Solr. Seed list is about 120K. Fetcher map input is 117K where success is 62K and temp_moved 45K. Parse shows success of 62K. CrawlDB after the fetch shows db_redir_perm=56K, db_unfetched=27K and db_fetched=22K. And finally IndexerStatus shows 20K documents added. What am I missing ? Thanks! my nutch-site.xml includes: - plugin.includes protocol-httpclient|urlfilter-regex|parse-(text|html|tika|metatags|js)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)i metatags.names keywords;Keywords;description;Description index.parse.md metatag.keywords,metatag.Keywords,metatag.description,metatag.Description db.update.additions.allowed false generate.count.mode domain partition.url.mode byDomain file.content.limit 262144 http.content.limit 262144 parse.filter.urls true parse.normalize.urls true -- Kiran Chitturi
Re: Fetching of URLs from seed list ends up with only a small portion of them indexed by Solr
I am using the crawl script that executes Solr indexing with: $bin/nutch solrindex $SOLRURL $CRAWL_PATH/crawldb -linkdb $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT and then executes Solr dedup: $bin/nutch solrdedup $SOLRURL I think it has something to do with the CrawlDB job. The job counters show: db_redir_temp 4,770 db_redir_perm 56,810 db_notmodified 5,343 db_unfetched 27,385 db_gone 3,741 db_fetched 22,065 On Thu, Feb 28, 2013 at 10:02 PM, kiran chitturi wrote: > This looks odd. From what i know, the successfully parsed documents are > sent to Solr. Did you check the logs for any exceptions ? > > What command are you using to index ? > > > On Thu, Feb 28, 2013 at 1:51 PM, Amit Sela wrote: > > > Hi everyone, > > > > I'm running with nutch 1.6 and Solr 3.6.2. > > I'm trying to crawl only the seed list (depth 1) and it seems that the > > process ends with only ~255 of the URLs indexed in Solr. > > > > Seed list is about 120K. > > Fetcher map input is 117K where success is 62K and temp_moved 45K. > > Parse shows success of 62K. > > CrawlDB after the fetch shows db_redir_perm=56K, db_unfetched=27K > > and db_fetched=22K. > > > > And finally IndexerStatus shows 20K documents added. > > What am I missing ? > > > > Thanks! > > > > my nutch-site.xml includes: > > - > > plugin.includes > > > > > protocol-httpclient|urlfilter-regex|parse-(text|html|tika|metatags|js)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)i > > metatags.names > > keywords;Keywords;description;Description > > index.parse.md > > > > > metatag.keywords,metatag.Keywords,metatag.description,metatag.Description > > db.update.additions.allowed > > false > > generate.count.mode > > domain > > partition.url.mode > > byDomain > > file.content.limit > > 262144 > > http.content.limit > > 262144 > > parse.filter.urls > > true > > parse.normalize.urls > > true > > > > > > -- > Kiran Chitturi >
Re: Problem compiling FeedParser plugin with Nutch 2.1 source
Well in addition to obtaining links from the feed content to continue your crawl, the feed plugin as also provides an indexingfilter to index feed documents with the following specific fields; author, tags, published, updated and the actual feed. Just to confirm, the feed plugin also uses rome as the underlying parser library. On Thursday, February 28, 2013, Anand Bhagwat wrote: > Thanks for quick reply. > > Actually I needed some plugin for ATOM feed parsing so while searching in > the source I found FeedParser but it was giving compilation errors. Later I > tried Tika parser and was able to parse ATOM feed. I am not sure if I am > missing something. Basically the tika parser extracted urls and created new > entries in the database and later when I ran fetch job again I was able to > fetch those urls. > > So the question is does FeedParser provides some additional functionality > which is missing in Tika parser? As far as I know Tika parser uses ROME > which is well known library for parsing feeds. > > Regards, > Anand. > > On 1 March 2013 03:38, kiran chitturi wrote: > >> Lewis, >> >> On the same note, the following plugins needs to be ported when i tried to >> build 2.x with Eclipse >> >> i) Feed >> ii) parse-swf >> iii) parse-ext >> iv) parse-zip >> v) parse-metatags ( I wrote patch for this earlier, NUTCH-1478) >> >> The above plugins need to be ported to build 2.x successfully with plugins. >> >> >> >> On Thu, Feb 28, 2013 at 4:58 PM, Lewis John Mcgibbney < >> lewis.mcgibb...@gmail.com> wrote: >> >> > honestly, I think we should get this fixed. >> > Can someone please explain to me why we don't build every plugin within >> > Nutch 2.x? >> > I think we should. >> > >> > >> > On Thu, Feb 28, 2013 at 12:58 PM, kiran chitturi >> > wrote: >> > >> > > This is a problem with the feed plugin. It is not yet ported to 2.x. >> > > >> > > The FeedIndexingFilter Class extends the IndexingFilter whose interface >> > and >> > > method changed from 1.x to 2.x >> > > >> > > I fixed a similar one in Parse-metaTags which extends the ParseFilter >> > > interface. >> > > >> > > [Nutch-874] was opened related to these issues but we do not know still >> > > what plugins need to be ported due to the API changes. >> > > >> > > >> > > >> > >> https://issues.apache.org/jira/browse/NUTCH-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel >> > > >> > > >> > > >> > > On Thu, Feb 28, 2013 at 3:26 PM, Lewis John Mcgibbney < >> > > lewis.mcgibb...@gmail.com> wrote: >> > > >> > > > This shouldn't be happening but we are aware (the Jira instance >> > reflects >> > > > this) that there are some existing compatibility issues with Nutch >> 2.x >> > > > HEAD. >> > > > IIRC Kiran had a patch integrated which dealt with some of these >> > issues. >> > > > What I have to ask is what JDK are you using? I use 1.6.0_25 (I >> really >> > > need >> > > > to upgrade) on my laptop and we run the Apache Nutch nightly builds >> for >> > > > both 1.x trunk and 2.x branch on the latest 1.7 version of Java. >> > > > Unless I have broken my code whilst writing some patches, my code >> > > compiles >> > > > flawlessly locally and as a project we do not have regular compiler >> > > issues >> > > > with our development nightly builds. >> > > > >> > > > On Wed, Feb 27, 2013 at 10:15 PM, Anand Bhagwat < >> abbhagwa...@gmail.com >> > > > >wrote: >> > > > >> > > > > Hi, >> > > > > I want to use FeedParser plugin which comes as part of Nutch 2.1 >> > > > > distribution. When I am trying to build it its giving compilation >> > > > errors. >> > > > > I think its using some classes from Nutch 1.6 which are not >> > available. >> > > > Any >> > > > > suggestions as to how I can resolve this issue? >> > > > > >> > > > > *[javac] >> > > > > >> > > > > >> > > > >> > > >> > >> /home/adminibm/Documents/workspace-sts-3.1.0.RELEASE/nutch2/src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java:28: >> > > > > cannot find symbol >> > > > > [javac] symbol : class CrawlDatum >> > > > > [javac] location: package org.apache.nutch.crawl >> > > > > [javac] import org.apache.nutch.crawl.Cra -- *Lewis*
Re: a lot of threads spinwaiting
Thanks a lot for all your answers, this really is an active community Roland, I had that problem once, it's not the case here, I'll try to look into the crawldb, though hbase is not as friendly for filtering as I would like it to, I'm still a newbie there Regards, JC -- View this message in context: http://lucene.472066.n3.nabble.com/a-lot-of-threads-spinwaiting-tp4043801p4044084.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: a lot of threads spinwaiting
Hi JC, I think Marcus already answered about politeness :) But without delay it will be worse :) Do this missing URLs match on one of the filtering regex? Take a look at .../conf/regex-urlfilter.txt, I had a problem with this regex: # skip URLs containing certain characters as probable queries, etc. -[?*!@=] It will just silently drop all URLs with GET parameters. --Roland Am 01.03.2013 15:08, schrieb jc: Hi Roland and lufeng, Thank you very much for your replies, I already tested lufeng advice, with results pretty much as expected. By the way, my nutch installation is based on 2.1 version with hbase as crawldb storage Roland, maybe fetcher.server.delay param has something to do with that as well, I set it to 3 secs, setting it to 0 would be unpolite? All info you provided has helped me a lot, only one issue remains unfixed yet, there are more than 60 URLs from different hosts in my seed file, and only 20 queues, things may seem that all other 40 hosts have no more URLs to generate, but I really haven't seen any URL coming from those hosts since the creation of the crawldb. Based on my poor experience following params would allow a number of 60 queues for my vertical crawl, am I missing something? topN = 1 million fetcher.threads.per.queue = 3 fetcher.threads.per.host = 3 (just in case, I remember you told me to use per.queue instead) fetcher.threads.fetch = 200 seed urls of different hosts = 60 or more (regex-urlfilter.txt allows only urls from these hosts, they're all there, I checked) crawldb record count > 1 million Thanks again for all your help Regards, JC
RE: a lot of threads spinwaiting
Hi, Regarding politeness, 3 threads per queue is not really polite :) Cheers -Original message- > From:jc > Sent: Fri 01-Mar-2013 15:08 > To: user@nutch.apache.org > Subject: Re: a lot of threads spinwaiting > > Hi Roland and lufeng, > > Thank you very much for your replies, I already tested lufeng advice, with > results pretty much as expected. > > By the way, my nutch installation is based on 2.1 version with hbase as > crawldb storage > > Roland, maybe fetcher.server.delay param has something to do with that as > well, I set it to 3 secs, setting it to 0 would be unpolite? > > All info you provided has helped me a lot, only one issue remains unfixed > yet, there are more than 60 URLs from different hosts in my seed file, and > only 20 queues, things may seem that all other 40 hosts have no more URLs to > generate, but I really haven't seen any URL coming from those hosts since > the creation of the crawldb. > > Based on my poor experience following params would allow a number of 60 > queues for my vertical crawl, am I missing something? > > topN = 1 million > fetcher.threads.per.queue = 3 > fetcher.threads.per.host = 3 (just in case, I remember you told me to use > per.queue instead) > fetcher.threads.fetch = 200 > seed urls of different hosts = 60 or more (regex-urlfilter.txt allows only > urls from these hosts, they're all there, I checked) > crawldb record count > 1 million > > Thanks again for all your help > > Regards, > JC > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/a-lot-of-threads-spinwaiting-tp4043801p4043988.html > Sent from the Nutch - User mailing list archive at Nabble.com. >
Re: a lot of threads spinwaiting
Hi Roland and lufeng, Thank you very much for your replies, I already tested lufeng advice, with results pretty much as expected. By the way, my nutch installation is based on 2.1 version with hbase as crawldb storage Roland, maybe fetcher.server.delay param has something to do with that as well, I set it to 3 secs, setting it to 0 would be unpolite? All info you provided has helped me a lot, only one issue remains unfixed yet, there are more than 60 URLs from different hosts in my seed file, and only 20 queues, things may seem that all other 40 hosts have no more URLs to generate, but I really haven't seen any URL coming from those hosts since the creation of the crawldb. Based on my poor experience following params would allow a number of 60 queues for my vertical crawl, am I missing something? topN = 1 million fetcher.threads.per.queue = 3 fetcher.threads.per.host = 3 (just in case, I remember you told me to use per.queue instead) fetcher.threads.fetch = 200 seed urls of different hosts = 60 or more (regex-urlfilter.txt allows only urls from these hosts, they're all there, I checked) crawldb record count > 1 million Thanks again for all your help Regards, JC -- View this message in context: http://lucene.472066.n3.nabble.com/a-lot-of-threads-spinwaiting-tp4043801p4043988.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: a lot of threads spinwaiting
Hi jc, and one thing to add: check the robots.txt file of your crawled hosts, maybe they are limiting your fetches with delays: Crawl-delay: 10 --Roland Am 01.03.2013 03:32, schrieb feng lu: Hi jc << I don't understand why there are 19 queues, is it maybe that only 19 websites are being fetched? Because each queue handles FetchItems which come from the same Queue ID (be it a proto/hostname or proto/IP or proto/domain pair). And the Queue ID will be created based on queueMode argument. So here may be there 19 different Queue ID in FetchItemQueues. << Anyways, why is it that there are 194 spinwaiting out of 200 active threads? First of all, i see that the parameter "fetcher.threads.per.host" has been replaced by "fetcher.threads.per.queue" in nutch 1.6. I see that there are 200 fetching threads that can fetch items from any host. However, all remaining items are from the different 19 hosts. And total urls count is 1. Each queue come from the same Queue ID. So the logs indicate that only 6 threads is fetching and another 13 threads have finished fetching. Maybe another 13 queues are too small without spend too much time. Thanks lufeng On Fri, Mar 1, 2013 at 6:44 AM, jc wrote: Hi guys, I'm sorry if this question has been answered before, I looked but didn't find anything. This is my scenario (only relevant settings I think): seed urls: about 60 homepages from different domains generate.max.count = 1 fetcher.threads.per.host = 3 I'm trying to be polite here :-) partition.url.mode = byHost fetcher.threads.fetch = 200 fetcher.threads.per.queue = 1 topN = 100 depth = 1 Since the very beggining I've got a lot of spinwaiting threads (I'm not sure if those are threads because it doesn't really say in the log) 194/200 spinwaiting/active, 166 pages, 3 errors, 4.7 3.8 pages/s , 1471 1412 kb/s, 1 URLs in 19 queues I don't understand why there are 19 queues, is it maybe that only 19 websites are being fetched? Anyways, why is it that there are 194 spinwaiting out of 200 active threads? Thanks a lot in advance for your time. Regards, jc -- View this message in context: http://lucene.472066.n3.nabble.com/a-lot-of-threads-spinwaiting-tp4043801.html Sent from the Nutch - User mailing list archive at Nabble.com.