Hi, Take a look here: http://issues.apache.org/jira/browse/NUTCH-100
If you have further questions... Regards, Gal Kai Hagemeister wrote:
Hello, I have a few basic questions and hope that somebody can assist. I'm trying to search different domains. It seems fairly simple to crawl one special domain (intranet-search) which is defined in the configuration file. But this seems to be limited to only the one, specified domain. I also could search through the web (websearch) by giving different urls via an urlfile. But I want to search complete domains without going outside. So, if I handover the urls bla.com and blub.net, only sites from this domains should be fetched. I tried to set the parameter follow outsitelinks to 0. But then, also links inside of the domain were ignored. Is there a way to acomplish the task? I mean an other then changing the sourcecode :-). Furthermore I created a directory db for the database and one for segments. Then I started tomcat from a parent-directory of segments. The Java class seems to search for a child-directory segments from the current position. The problem: after each update of the index I have to restart tomcat :-(. It's getting worse each time when I start the processes I must delete the database and the segments. How do I accomplish a reasonable fetching cycle. Could somebody give an example? My idea would be to put the following snippet in a endless loop and call this with nohup: bin/nutch generate db segments -topN 1000 s1=`ls -d segments/2* | tail -1` bin/nutch fetch $s1 bin/nutch updatedb db $s1 bin/nutch index $s1 Would this be advisable? And can sombody explain the meaning of -topN 1000. Is there no other way then restarting tomcat? I would appriciate any assistance. Best regards Kai
------------------------------------------------------- This SF.Net email is sponsored by: Power Architecture Resource Center: Free content, downloads, discussions, and more. http://solutions.newsforge.com/ibmarch.tmpl _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
