Re: Connection refused

2012-04-10 Thread tamanjit.bin...@yahoo.co.in
Hi Markus, I also thought on similar lines. But I am able to wget the page as well as ping it from the server where I run crawl without any issues. Do you still think its the firewall issue? -- View this message in context: http://lucene.472066.n3.nabble.com/Connection-refused-tp3898889p3898925.h

Connection refused

2012-04-10 Thread tamanjit.bin...@yahoo.co.in
Hi, I have been using nutch for quite sometime now. All had been working fine. I crawl some sites once a fortnight. It worked fine till now, except i cant seem to make it work for last couple of days. I am getting the following exception when i run the bin/nutch crawl command: 2012-04-10 11:03:38

Re: some questions about the crawling with Nutch

2011-07-17 Thread tamanjit.bin...@yahoo.co.in
Hi, Whta would be optimal parameters would require some experimentation. But with the right db.fetch.interval.max between two fetches (in the nutch-default.xml) and scheduled daily crawl you would be able crawl through all of the pages eventually. Here you may like to restrict the crawls to the dom

Re: Is it possible to crawl yahoo answer?

2011-07-17 Thread tamanjit.bin...@yahoo.co.in
I am not sure if i was able to convey what i meant. But I guess it was a bit confusing now that I re-read my previous comment. You are supposed to un-comment the line -[?*!@=] This will help nutch crawl through urls with special characters. Please revert back if you have done this. -- View t

Re: Is it possible to crawl yahoo answer?

2011-07-15 Thread tamanjit.bin...@yahoo.co.in
Don't think that should be a problem. Though I still feel you would have to try to actually know, because am not sure if it is going to crawl to an encrypted url (Experts please help here) Just make sure the following line is coomented out in crawl-urlfilter.txt: # skip URLs containing certain ch

Re: No more urls to fetch

2011-06-29 Thread tamanjit.bin...@yahoo.co.in
Although am not sure if this is the real solution. But my understanding of the problem is at the time of reading from crawldb. I feel crawldb files may have been corrupted (am not sure here). So i deleted the crawldb folder and it worked. Though it starts frm the scratch. As in it re-crawls all th

Re: No more urls to fetch

2011-06-29 Thread tamanjit.bin...@yahoo.co.in
I forgot to mention that no changes were made either in the crawl-ulrfilter.txt and regex-urlfilter.txt between a successful crawl and a crawl with the message "no more urls to fetch" rootUrlDir = urls/$folder/urls.txt threads = 10 depth = 1 indexer=lucene topN = 1500 Injector: starting Injector:

No more urls to fetch

2011-06-29 Thread tamanjit.bin...@yahoo.co.in
Hi, I was going through past threads and found the problem i face has been faced by many others. But mostly either it has been ignored or has been unresolved. I use Nutch 1.1. My crawl has been working fine mostly (though i am still getting a hang of how all the screws work). I have a particular

Re: Crawl algo

2011-06-19 Thread tamanjit.bin...@yahoo.co.in
Thanks for the link. Things are much clearer now. -- View this message in context: http://lucene.472066.n3.nabble.com/Crawl-algo-tp3066930p3084831.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawling - basic questions.

2011-06-19 Thread tamanjit.bin...@yahoo.co.in
He thanks for the response. I had lost hope of one. :-) 3. Firstly, yes I am using Solr for indexing. So whatever you have said makes a lot of sense. For 404 pages, which are not picked up in crawl I am doing a manual delete as of now, but it is a pain. I am thinking of some ways to get this autom

Crawl algo

2011-06-15 Thread tamanjit.bin...@yahoo.co.in
Perhaps a naive question: During crawl if i state topN as say 100, does that mean the first 100 links that nutch gets on a particular page? Or does it fetch as per the page rank? Either ways does it mean that it would always fetch the same links from a page? -- View this message in context: htt

Re: Index not getting cleaned up

2011-06-15 Thread tamanjit.bin...@yahoo.co.in
That explains it. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Index-not-getting-cleaned-up-tp3066254p3066366.html Sent from the Nutch - User mailing list archive at Nabble.com.

Index not getting cleaned up

2011-06-14 Thread tamanjit.bin...@yahoo.co.in
Hi, When I run the cleaner script i.e. for f in $FILES do echo "Runing $f ..."; bin/nutch solrclean crawl/$f/crawldb/ http://solrip echo "Finished $f .."; done Though the log says: 2011-06-15 12:06:02,007 INFO solr.SolrClean - SolrClean: starting at 2011-06-15 12:06:02 2011-06-15