hi,

we are in a sort of similar situation;

So would be really happy to hear any suggestions on this.

incremental crawling doesnt seem to really work for us because it seems
the same urls are being crawled over and over (on a daily basis!);

have you tried these settings or similar?

db.fetch.schedule.class = AdaptiveFetchSchedule
db.update.additions.allowed = true
db.ignore.internal.links = false
db.ignore.external.links = true (because we are intranet only)



>
>
> Currently we crawl every two days and create a new index and then merge
> with
> earlier index. For one it takes  too long as mergesegs seems to take time
> proportional to the size of both indexes combined.  Equally problematic
> issue is mergesegs fail a significant portion of the time. Probability
> becomes higher with size.Problems exist whether merge is done within
> Hadoop
> or outside.
>
> Two questions:
> (a) Has anybody been successful to do a Nutch merge predictably
> irrespective
> of the size. Any tips.  We are trying to merge upto data for 200K url at a
> time.
>
> (b) How can we do incremental indexing, where we add data from latest
> crawl,
> but there is only one index that keeps growing.  I saw lot of older posts
> regarding incremental indexing and no clear answers.
>
> Thanks in advance for your help.
>
> Shreekanth
>
> --
> View this message in context:
> http://old.nabble.com/Growing-the-index-%3A-Merging-vs-incremental-tp26228341p26228341.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


Reply via email to