Just do normal crawl in hadoop and use:
bin/hadoop dfs -get crawldir local_path
to store it on the local filesystem after the crawl is done.

- Espen

hzhong wrote:
> Hello,
> 
> I currently have nutch running on hadoop.  However, for one specific crawl,
> I would like to store the data on a local machine instead of putting it on
> hadoop.
> 
> I basically modified the crawl.java to change the filesystem to local.  
> Configuration conf = NutchConfiguration.create();
> conf.addDefaultResource("crawl-tool.xml");
> FileSystem localFs = FileSystem.getNamed("local", conf);                      
> JobConf job = new NutchJob(localFs.getConf());
> 
> Path dir = new Path(some_local_path_on_the_machine);
> Path crawlDb = new Path(dir + "/crawldb");
> Path linkDb = new Path(dir + "/linkdb");
> Path segments = new Path(dir + "/segments");
> Path indexes = new Path(dir + "/indexes");
> Path index = new Path(dir + "/index");
> Path rootURL = new Path(local_path_on_the_machine);
> 
> Injector injector = new Injector(conf);
> Generator generator = new Generator(conf);
> Fetcher fetcher = new Fetcher(conf);
> ParseSegment parseSegment = new ParseSegment(conf);
> CrawlDb crawlDbTool = new CrawlDb(conf);
> LinkDb linkDbTool = new LinkDb(conf);
> Indexer indexer = new Indexer(conf);
> DeleteDuplicates dedup = new DeleteDuplicates(conf);
> IndexMerger merger = new IndexMerger(conf);
>                                       
> // initialize crawlDb
> injector.inject(crawlDb, rootURL);
> and so on... 
> 
> I keep getting 
> Injector: starting
> Injector: crawlDb: crawl_db path
> Injector: urlDir: url path
> Injector: Converting injected urls to crawl db entries.
> Connection refused
> 
> or 
> 
> Injector: starting
> Injector: crawlDb: crawldb path
> Injector: urlDir: url path
> Injector: Converting injected urls to crawl db entries.
> Input path doesnt exist : url path
> 
> However, the url path does exist.  
> 
> Can someone give me pointers as to what's going on?  Or perhaps give me
> pointers on how to store data on a local machine?  I am not sure if this is
> the correct way of putting the data on the local machine.  
> 
> Thank you very much.
> 
> Hanna


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to