> -----Original Message----- > From: Thomas Delnoij [mailto:[EMAIL PROTECTED] > Sent: Tuesday, July 25, 2006 2:53 PM > To: nutch-user@lucene.apache.org > Subject: Re: Injecting Into Intranet Crawl > > For stuff like this best use whole web concepts as explained > in the tutorial. > > Rgrds, Thomas
The tutorial suggests using a segment of the DMOZ directory which really doesn't work for me as I only want to index a specific collection of sites. But in that tutorial it does use the "inject" command option which may actually be useful. >From the CommandLine Options page in the wiki I find... Usage: bin/nutch inject (-local | -ndfs <namenode:port>) <db_dir> (-urlfile <url_file> | -dmozfile <dmoz_file>) [-subset <subsetDenominator>] [-includeAdultMaterial] [-skew skew] [-noDmozDesc] [-topicFile <topic list file>] [-topic <topic> [-topic <topic> [...]]] So I would use something like bin/nutch inject crawl.out urls.txt Where "crawl.out" is the result of my original crawl and "urls.txt" is my original list of home pages. Or is "urls.txt" supposed to be a file containing the list of home pages to be injected? There's no list of what each of the options represent in the wiki like there is for the "crawl" command so I have to guess. My assumptions based on that help are: 1 - My urls.txt file will be modified by the inject command and 2 - My crawl.out directory will be updated with index information from the injected site. I think I may have to run some additional commands to get the index updated but I'm not 100% sure. Maybe the maintenance shell script from http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine is what it is I need to rescan all of the sites I want indexed? Assuming that I eventually get the syntax of the inject command correct I still have to ask about conf/crawl-urlfilter.txt because I modified that to only use the URIs that I want crawled. Does the inject command modify that file or do I have to add in those domains manually? Many thanks! rjsjr > > On 7/25/06, Robert Sanford <[EMAIL PROTECTED]> wrote: > > I'm running version 0.7.2 and I'm using the Intranet crawl where I > > specify a list of site root URIs in a text file along with > a list of > > regex for allowed URIs. > > > > The question that I have is how to inject a new site into the crawl. > > > > If I simply add a site URI into the file I have to > completely restart > > the crawl and can't use the same output directory as I used > previously > > and when that finishes I have to copy over the old one and then > > restart my app server. That doesn't make sense... I really want to > > just give it a new site root and have it added to the index. > > > > Is that possible using the intranet config option? > > > > rjsjr > > >