Hi, I'm new to Nutch and am trying to get my head around some basics... I need to index two sites, one of which is under my control, into a single search.
The first site, under my control, I have ran a complete 'seed' crawl over and would like to update the index daily. To avoid recrawling the whole site I have set up a 'what's new/changed' page which I want to crawl daily to pick up any changes. I then want to merge this with the complete crawl to produce an up to date index. (I tried the recrawl script from the wiki but it didn't seem to be doing what I wanted). I have merged the two indexes in the following way: - created a new directory mergedcrawl - copied seedcrawl/indexes/part-00000/* to mergedcrawl/indexes/part-00000 - copied changedcrawl/indexes/part-00000/* to mergedcrawl/indexes/part-00001 - ran 'bin/nutch dedup indexes' on mergedcrawl - ran 'bin/nutch merge index indexes' on mergedcrawl - copied /segments/* from both crawls into the mergedcrawl Pointing the searcher.dir to the new directory, the search seems to return results from both indexes successfully. Is this the correct way to do this? The second site is not under my control, so I need to find an alternative way to keep the index up to date. Am I correct in thinking that simply recrawling the whole site is the easiest way to do this - or is there a way to only index modified pages? Finally - I seem to have a problem with identical pages with different urls - i.e. http://website/ http://website/default.htm I was under the impression that these would be removed by the dedup process, but this does not seem to be working. Is there something I'm missing? (I also have a similar problem with the external site as it carries session ids around in the URL which change - although the content of the duplicate pages is identical). Sorry for the long post - any help is appreciated! -- View this message in context: http://www.nabble.com/Quick-questions---merging-deduping-tf3267849.html#a9084405 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
