Hi,

Take a look here: http://issues.apache.org/jira/browse/NUTCH-100

If you have further questions...

Regards,

Gal

Kai Hagemeister wrote:
Hello,

I have a few basic questions and hope that somebody can assist.
I'm trying to search different domains. It seems fairly simple to crawl
one special domain (intranet-search) which is defined in the configuration
file. But this seems to be limited to only the one, specified domain.
I also could search through the web (websearch) by giving different urls
via an urlfile. But I want to search complete domains without going
outside.
So, if I handover the urls bla.com and blub.net, only sites from this
domains should be fetched. I tried to set the parameter follow
outsitelinks to 0. But then, also links inside of the domain were ignored.
Is there a way to acomplish the task? I mean an other then changing the
sourcecode :-).
Furthermore I created a directory db for the database and one for
segments. Then I started tomcat from a parent-directory of segments. The
Java class seems to search for a child-directory segments from the current
position. The problem: after each update of the index I have to restart
tomcat :-(. It's getting worse each time when I start the processes I must
delete the database and the segments.
How do I accomplish a reasonable fetching cycle. Could somebody give an
example?
My idea would be to put the following snippet in a endless loop and call
this with nohup:

bin/nutch generate db segments -topN 1000
s1=`ls -d segments/2* | tail -1`
bin/nutch fetch $s1
bin/nutch updatedb db $s1
bin/nutch index $s1

Would this be advisable? And can sombody explain the meaning of -topN 1000.
Is there no other way then restarting tomcat?
I would appriciate any assistance.
Best regards
Kai






-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to