[
http://issues.apache.org/jira/browse/NUTCH-49?page=comments#action_12355864 ]
byron miller commented on NUTCH-49:
---
Can something like this be adapted to use the regex filter as well? it would be
nice to say new only and match urls of x type or x link
Hi Doug,
I copy a working index and merge the original and the old together.
Than I run the dedub over these index. Shouldn't the dedub tool
remove the duplicates in the merged index?
Thanks,
Stefan
Am 24.10.2005 um 21:25 schrieb Doug Cutting:
It works for me. It currently only deletes
Hi,
here is a shell script that reproduce the problem.
We notice that after dedub in the merged index we have less documents
than in the orginal index.
Number of Documents in
Original Index: 42
Dedup Index: 17
Do we may have a mistake somehow in the script or in the process itself?
Regards,
Stefan Groschupf wrote:
I copy a working index and merge the original and the old together.
Than I run the dedub over these index. Shouldn't the dedub tool remove
the duplicates in the merged index?
I usually dedup before index merge, so that the merged index contains no
duplicates. The
I am by no means a Nutch expert yet, but this is how I merged two
separate segments so I could search through them:
Step 1:
$ bin/nutch mergesegs -local -o testmerge -i
../crawls/foo/segments/20051018224434/
../crawls/bar/segments/20051018225505/
bunch of stuff happens
This creates a segment
Thanks so much, Graham. This should do it.
A related question: After the merge, is it possible to build the new webdb
as well? The link data for the merged db can be different from the two
original db. In order to have accurate page ranking, the link data should be
updated.
AJ
On 10/25/05,
If you merge two segments page ranks are off. You have to build new webdb,
calculate page rank and then build one more segment again.
Thank you,
Andrey
-Original Message-
From: AJ Chen [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 25, 2005 2:02 PM
To: nutch-dev@lucene.apache.org
How do you buid a new webdb from the merged segment/index? Could you provide
detailed steps for the process you described? Thanks.
AJ
On 10/25/05, Andrey Ilinykh [EMAIL PROTECTED] wrote:
If you merge two segments page ranks are off. You have to build new webdb,
calculate page rank and then