[EMAIL PROTECTED] wrote: > Hi, > > I'm using Nutch to index a single site. I have a need to crawl/fetch/index > the staging version of the site and then using the resulting index for > searching of the production site. The problem is that staging and production > sites have different URLs, for example: > > Staging: > http://STAGING.example.com/foo/bar.html > > Production: > http://WWW.example.com/foo/bar.html > > > What I'd like to be able to do do is index the staging site and then just > push the index to production and have it work for production searches. > Obviously, the links stored in the index would be wrong (STAGING.example.com > vs. WWW.example.com). > > What is the best way to accomplish this? > > One thing I was thinking was to index the staging site, then open up CrawlDb > and LinkDb (any others?), loop through them and write out a new version of > those files, changing the keys (URLs) along the way, for instance from > http://STAGING.example.com/foo/bar.html to http://WWW.example.com/foo/bar.html > > Has anyone done this? Does this sound realistic/doable? > Is there a better/faster/easier way? > e.g. changing URLs immediately at fetch/parse/index time? > e.g. changing URLs on the fly at search time when displaying results?
There is another option - when fetching configure nutch to use a URL rewriting proxy, which will rewrite on the fly your requests of www.example.com to staging.example.com, get the response, and return the content - the only thing to do then would be to rewrite absolute outlinks contained in the content, from staging to www - but this can be done in URLNormalizers. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
