[jira] Commented: (NUTCH-49) Flag for generate to fetch only new pages to complement the -refetchonly flag

2005-10-25 Thread byron miller (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-49?page=comments#action_12355864 ] byron miller commented on NUTCH-49: --- Can something like this be adapted to use the regex filter as well? it would be nice to say new only and match urls of x type or x link

Re: status dedub

2005-10-25 Thread Stefan Groschupf
Hi Doug, I copy a working index and merge the original and the old together. Than I run the dedub over these index. Shouldn't the dedub tool remove the duplicates in the merged index? Thanks, Stefan Am 24.10.2005 um 21:25 schrieb Doug Cutting: It works for me. It currently only deletes

Re: status dedub

2005-10-25 Thread Marko Bauhardt
Hi, here is a shell script that reproduce the problem. We notice that after dedub in the merged index we have less documents than in the orginal index. Number of Documents in Original Index: 42 Dedup Index: 17 Do we may have a mistake somehow in the script or in the process itself? Regards,

Re: status dedub

2005-10-25 Thread Doug Cutting
Stefan Groschupf wrote: I copy a working index and merge the original and the old together. Than I run the dedub over these index. Shouldn't the dedub tool remove the duplicates in the merged index? I usually dedup before index merge, so that the merged index contains no duplicates. The

RE: merge indices from multiple webdb

2005-10-25 Thread Graham Stead
I am by no means a Nutch expert yet, but this is how I merged two separate segments so I could search through them: Step 1: $ bin/nutch mergesegs -local -o testmerge -i ../crawls/foo/segments/20051018224434/ ../crawls/bar/segments/20051018225505/ bunch of stuff happens This creates a segment

Re: merge indices from multiple webdb

2005-10-25 Thread AJ Chen
Thanks so much, Graham. This should do it. A related question: After the merge, is it possible to build the new webdb as well? The link data for the merged db can be different from the two original db. In order to have accurate page ranking, the link data should be updated. AJ On 10/25/05,

RE: merge indices from multiple webdb

2005-10-25 Thread Andrey Ilinykh
If you merge two segments page ranks are off. You have to build new webdb, calculate page rank and then build one more segment again. Thank you, Andrey -Original Message- From: AJ Chen [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 25, 2005 2:02 PM To: nutch-dev@lucene.apache.org

Re: merge indices from multiple webdb

2005-10-25 Thread AJ Chen
How do you buid a new webdb from the merged segment/index? Could you provide detailed steps for the process you described? Thanks. AJ On 10/25/05, Andrey Ilinykh [EMAIL PROTECTED] wrote: If you merge two segments page ranks are off. You have to build new webdb, calculate page rank and then