Hi Markus,
I also thought on similar lines. But I am able to wget the page as well as
ping it from the server where I run crawl without any issues. Do you still
think its the firewall issue?
--
View this message in context:
http://lucene.472066.n3.nabble.com/Connection-refused-tp3898889p3898925.h
Hi,
I have been using nutch for quite sometime now. All had been working fine. I
crawl some sites once a fortnight. It worked fine till now, except i cant
seem to make it work for last couple of days. I am getting the following
exception when i run the bin/nutch crawl command:
2012-04-10 11:03:38
Hi,
Whta would be optimal parameters would require some experimentation.
But with the right db.fetch.interval.max between two fetches (in the
nutch-default.xml) and scheduled daily crawl you would be able crawl through
all of the pages eventually. Here you may like to restrict the crawls to the
dom
I am not sure if i was able to convey what i meant. But I guess it was a bit
confusing now that I re-read my previous comment.
You are supposed to un-comment the line
-[?*!@=]
This will help nutch crawl through urls with special characters.
Please revert back if you have done this.
--
View t
Don't think that should be a problem. Though I still feel you would have to
try to actually know, because am not sure if it is going to crawl to an
encrypted url (Experts please help here)
Just make sure the following line is coomented out in crawl-urlfilter.txt:
# skip URLs containing certain ch
Although am not sure if this is the real solution. But my understanding of
the problem is at the time of reading from crawldb.
I feel crawldb files may have been corrupted (am not sure here). So i
deleted the crawldb folder and it worked. Though it starts frm the scratch.
As in it re-crawls all th
I forgot to mention that no changes were made either in the
crawl-ulrfilter.txt and regex-urlfilter.txt between a successful crawl and
a crawl with the message "no more urls to fetch"
rootUrlDir = urls/$folder/urls.txt
threads = 10
depth = 1
indexer=lucene
topN = 1500
Injector: starting
Injector:
Hi,
I was going through past threads and found the problem i face has been faced
by many others. But mostly either it has been ignored or has been
unresolved.
I use Nutch 1.1. My crawl has been working fine mostly (though i am still
getting a hang of how all the screws work).
I have a particular
Thanks for the link. Things are much clearer now.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Crawl-algo-tp3066930p3084831.html
Sent from the Nutch - User mailing list archive at Nabble.com.
He thanks for the response. I had lost hope of one. :-)
3. Firstly, yes I am using Solr for indexing. So whatever you have said
makes a lot of sense. For 404 pages, which are not picked up in crawl I am
doing a manual delete as of now, but it is a pain. I am thinking of some
ways to get this autom
Perhaps a naive question:
During crawl if i state topN as say 100, does that mean the first 100 links
that nutch gets on a particular page? Or does it fetch as per the page rank?
Either ways does it mean that it would always fetch the same links from a
page?
--
View this message in context:
htt
That explains it. Thanks.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Index-not-getting-cleaned-up-tp3066254p3066366.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Hi,
When I run the cleaner script i.e.
for f in $FILES
do
echo "Runing $f ...";
bin/nutch solrclean crawl/$f/crawldb/ http://solrip
echo "Finished $f ..";
done
Though the log says:
2011-06-15 12:06:02,007 INFO solr.SolrClean - SolrClean: starting at
2011-06-15 12:06:02
2011-06-15
13 matches
Mail list logo