Hi,

I'm new to Nutch and am trying to get my head around some basics... I need
to index two sites, one of which is under my control, into a single search.

The first site, under my control, I have ran a complete 'seed' crawl over
and would like to update the index daily. To avoid recrawling the whole site
I have set up a 'what's new/changed' page which I want to crawl daily to
pick up any changes. I then want to merge this with the complete crawl to
produce an up to date index. (I tried the recrawl script from the wiki but
it didn't seem to be doing what I wanted).

I have merged the two indexes in the following way:

- created a new directory mergedcrawl
- copied seedcrawl/indexes/part-00000/* to mergedcrawl/indexes/part-00000
- copied changedcrawl/indexes/part-00000/* to mergedcrawl/indexes/part-00001
- ran 'bin/nutch dedup indexes' on mergedcrawl
- ran 'bin/nutch merge index indexes' on mergedcrawl
- copied /segments/* from both crawls into the mergedcrawl

Pointing the searcher.dir to the new directory, the search seems to return
results from both indexes successfully. Is this the correct way to do this?

The second site is not under my control, so I need to find an alternative
way to keep the index up to date. Am I correct in thinking that simply
recrawling the whole site is the easiest way to do this - or is there a way
to only index modified pages?

Finally - I seem to have a problem with identical pages with different urls
- i.e.

http://website/
http://website/default.htm

I was under the impression that these would be removed by the dedup process,
but this does not seem to be working. Is there something I'm missing? (I
also have a similar problem with the external site as it carries session ids
around in the URL which change - although the content of the duplicate pages
is identical).

Sorry for the long post - any help is appreciated!
-- 
View this message in context: 
http://www.nabble.com/Quick-questions---merging-deduping-tf3267849.html#a9084405
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to