Hi Antony,
I decided first to delete all duplicates from master(iW) and then to insert
all temporary indices(other).
Any other opinions?
Best regards
Karsten
<code>
public static synchronized void merge(IndexWriter iW, Directory[] other,
final String uniqueID_FieldName) throws IOException{
final Term firstFieldTerm = new Term(uniqueID_FieldName, "");
boolean rollback = true;
try {
Term[] possibleDuplicates;
for(Directory toAddDir : other){
IndexReader toAddIR = IndexReader.open(toAddDir);
try{
int indexSize = toAddIR.numDocs();
possibleDuplicates = new Term[indexSize];
int cnt = 0;
TermEnum possibleDuplicateTerms =
toAddIR.terms(firstFieldTerm);
Term possibleDuplicateTerm =
possibleDuplicateTerms.term();
while(true){
if(possibleDuplicateTerm == null){
break;
}
if(possibleDuplicateTerm.field() !=
uniqueID_FieldName){
assert
!possibleDuplicateTerm.field().equals(uniqueID_FieldName);
break;
}
//assert:
if(moreThenOneDocument(toAddIR,
possibleDuplicateTerm)){
System.out.println( "please use then unique id
unique! " + possibleDuplicateTerm);
}
assert cnt < indexSize : "please don't use more then
one unique id for each document";
possibleDuplicates[cnt++]=possibleDuplicateTerm;
possibleDuplicateTerms.next();
possibleDuplicateTerm =
possibleDuplicateTerms.term();
}
if( indexSize != cnt ){
possibleDuplicates =
Arrays.copyOf(possibleDuplicates, cnt);
System.out.println("log: " + indexSize + " != " +
cnt);
}
} finally {
toAddIR.close();
}
iW.deleteDocuments(possibleDuplicates);
}
iW.addIndexes(other);
rollback = false;
} finally {
if(rollback){
iW.abort();
} else {
iW.flush();
}
}
}
public static boolean moreThenOneDocument(IndexReader iR, Term term)
throws IOException{
TermDocs tDoc = iR.termDocs(term);
if(tDoc.next()){
if(tDoc.next()){
return true;
}
}
return false;
}
</code>
Antony Bowesman wrote:
>
> I am creating several temporary batches of indexes to separate indices and
> periodically will merge those batches to a set of master indices. I'm
> using
> IndexWriter#addIndexesNoOptimise(), but problem that gives me is that the
> master
> may already contain the index for that document and I get a duplicate.
>
> Duplicates are prevented in the temporary index, because when adding
> Documents,
> I call IndexWriter#deleteDocuments(Term) with my UID, before I add the
> Document.
>
> I have two choices
>
> a) merge indexes then clean up any duplicates in the master (or vice
> versa).
> Probably IndexWriter.deleteDocuments(Term[]) would suit here with all the
> UIDs
> of the incoming documents.
>
> b) iterate through the Documents in the temporary index and add them to
> the master
>
> b sounds worse as it seems an IndexWriter's Analyzer cannot be null and I
> guess
> there's a penalty in assembling the Document from the reader.
>
> Any views?
> Antony
>
--
View this message in context:
http://www.nabble.com/Merging-indexes---which-is-best-option--tp19325185p19380709.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]