[EMAIL PROTECTED] wrote:
> Hi,
> 
> I'm using Nutch to index a single site.  I have a need to crawl/fetch/index 
> the staging version of the site and then using the resulting index for 
> searching of the production site.  The problem is that staging and production 
> sites have different URLs, for example:
> 
>   Staging:
>     http://STAGING.example.com/foo/bar.html
> 
>   Production:
>     http://WWW.example.com/foo/bar.html
>  
> 
> What I'd like to be able to do do is index the staging site and then just 
> push the index to production and have it work for production searches.  
> Obviously, the links stored in the index would be wrong (STAGING.example.com 
> vs. WWW.example.com).
> 
> What is the best way to accomplish this?
> 
> One thing I was thinking was to index the staging site, then open up CrawlDb 
> and LinkDb (any others?), loop through them and write out a new version of 
> those files, changing the keys (URLs) along the way, for instance from 
> http://STAGING.example.com/foo/bar.html to http://WWW.example.com/foo/bar.html
> 
> Has anyone done this?  Does this sound realistic/doable?
> Is there a better/faster/easier way?
>   e.g. changing URLs immediately at fetch/parse/index time?
>   e.g. changing URLs on the fly at search time when displaying results?

There is another option - when fetching configure nutch to use a URL 
rewriting proxy, which will rewrite on the fly your requests of 
www.example.com to staging.example.com, get the response, and return the 
content - the only thing to do then would be to rewrite absolute 
outlinks contained in the content, from staging to www - but this can be 
done in URLNormalizers.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to