Hi,

I'm using Nutch to index a single site.  I have a need to crawl/fetch/index the 
staging version of the site and then using the resulting index for searching of 
the production site.  The problem is that staging and production sites have 
different URLs, for example:

  Staging:
    http://STAGING.example.com/foo/bar.html

  Production:
    http://WWW.example.com/foo/bar.html
 

What I'd like to be able to do do is index the staging site and then just push 
the index to production and have it work for production searches.  Obviously, 
the links stored in the index would be wrong (STAGING.example.com vs. 
WWW.example.com).

What is the best way to accomplish this?

One thing I was thinking was to index the staging site, then open up CrawlDb 
and LinkDb (any others?), loop through them and write out a new version of 
those files, changing the keys (URLs) along the way, for instance from 
http://STAGING.example.com/foo/bar.html to http://WWW.example.com/foo/bar.html

Has anyone done this?  Does this sound realistic/doable?
Is there a better/faster/easier way?
  e.g. changing URLs immediately at fetch/parse/index time?
  e.g. changing URLs on the fly at search time when displaying results?


Thanks,
Otis

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to