Hi,
I'm using Nutch to index a single site. I have a need to crawl/fetch/index the
staging version of the site and then using the resulting index for searching of
the production site. The problem is that staging and production sites have
different URLs, for example:
Staging:
http://STAGING.example.com/foo/bar.html
Production:
http://WWW.example.com/foo/bar.html
What I'd like to be able to do do is index the staging site and then just push
the index to production and have it work for production searches. Obviously,
the links stored in the index would be wrong (STAGING.example.com vs.
WWW.example.com).
What is the best way to accomplish this?
One thing I was thinking was to index the staging site, then open up CrawlDb
and LinkDb (any others?), loop through them and write out a new version of
those files, changing the keys (URLs) along the way, for instance from
http://STAGING.example.com/foo/bar.html to http://WWW.example.com/foo/bar.html
Has anyone done this? Does this sound realistic/doable?
Is there a better/faster/easier way?
e.g. changing URLs immediately at fetch/parse/index time?
e.g. changing URLs on the fly at search time when displaying results?
Thanks,
Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag - Search - Share
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general