solr-duplicate post management

2009-01-11 Thread S.Selvam Siva
Hi,

I have 6 fields in my solr-schema.
   1)id(unique key)
   2)urlid
   3)url
and so on to 6)

We have been posting  3 to 4 lakh .xml files per day which includes 50%
duplicate posts.

what i need is ,to log the existing urlid and new urlid(of course both will
not be same) ,when a .xml file of same id(unique field) is posted.

I want to make this by modifying the solr source.Which file do i need to
modify so that i could get the above details in log ?

I tried with DirectUpdateHandler2.java(which removes the duplicate
entries),but efforts in vein.


-- 
Yours,
S.Selvam


Re: solr-duplicate post management

2009-01-21 Thread Chris Hostetter

: what i need is ,to log the existing urlid and new urlid(of course both will
: not be same) ,when a .xml file of same id(unique field) is posted.
: 
: I want to make this by modifying the solr source.Which file do i need to
: modify so that i could get the above details in log ?
: 
: I tried with DirectUpdateHandler2.java(which removes the duplicate
: entries),but efforts in vein.

DirectUpdateHandler2.java (on the trunk) delegates to Lucene-Java's 
IndexWriter.updateDocument method when you have a uniqueKey and you aren't 
allowing duplicates -- this method doesn't give you any way to access the 
old document(s) that had that existing key.

The easiest way to make a change like what you are interested in might be 
an UpdateProcessor that does a lookup/search for the uniqueKey of each 
document about to be added to see if it already exists.  that's probably 
about as efficient as you can get, and would be nicely encapsulated.

You might also want to take a look at SOLR-799, where some work is being 
done to create UpdateProcessors that can do "near duplicate" detection...

http://wiki.apache.org/solr/Deduplication
https://issues.apache.org/jira/browse/SOLR-799






-Hoss



Re: solr-duplicate post management

2009-01-22 Thread S.Selvam Siva
On Thu, Jan 22, 2009 at 7:12 AM, Chris Hostetter
wrote:

>
> : what i need is ,to log the existing urlid and new urlid(of course both
> will
> : not be same) ,when a .xml file of same id(unique field) is posted.
> :
> : I want to make this by modifying the solr source.Which file do i need to
> : modify so that i could get the above details in log ?
> :
> : I tried with DirectUpdateHandler2.java(which removes the duplicate
> : entries),but efforts in vein.
>
> DirectUpdateHandler2.java (on the trunk) delegates to Lucene-Java's
> IndexWriter.updateDocument method when you have a uniqueKey and you aren't
> allowing duplicates -- this method doesn't give you any way to access the
> old document(s) that had that existing key.
>
> The easiest way to make a change like what you are interested in might be
> an UpdateProcessor that does a lookup/search for the uniqueKey of each
> document about to be added to see if it already exists.  that's probably
> about as efficient as you can get, and would be nicely encapsulated.
>
> You might also want to take a look at SOLR-799, where some work is being
> done to create UpdateProcessors that can do "near duplicate" detection...
>
> http://wiki.apache.org/solr/Deduplication
> https://issues.apache.org/jira/browse/SOLR-799
>
>
>
>
>
>
> -Hoss
>

Thank you for your response.I will try it out.



-- 
Yours,
S.Selvam


Re: solr-duplicate post management

2009-01-24 Thread S.Selvam Siva
On Thu, Jan 22, 2009 at 2:33 PM, S.Selvam Siva wrote:

>
>
> On Thu, Jan 22, 2009 at 7:12 AM, Chris Hostetter  > wrote:
>
>>
>> : what i need is ,to log the existing urlid and new urlid(of course both
>> will
>> : not be same) ,when a .xml file of same id(unique field) is posted.
>> :
>> : I want to make this by modifying the solr source.Which file do i need to
>> : modify so that i could get the above details in log ?
>> :
>> : I tried with DirectUpdateHandler2.java(which removes the duplicate
>> : entries),but efforts in vein.
>>
>> DirectUpdateHandler2.java (on the trunk) delegates to Lucene-Java's
>> IndexWriter.updateDocument method when you have a uniqueKey and you aren't
>> allowing duplicates -- this method doesn't give you any way to access the
>> old document(s) that had that existing key.
>>
>> The easiest way to make a change like what you are interested in might be
>> an UpdateProcessor that does a lookup/search for the uniqueKey of each
>> document about to be added to see if it already exists.  that's probably
>> about as efficient as you can get, and would be nicely encapsulated.
>>
>> You might also want to take a look at SOLR-799, where some work is being
>> done to create UpdateProcessors that can do "near duplicate" detection...
>>
>> http://wiki.apache.org/solr/Deduplication
>> https://issues.apache.org/jira/browse/SOLR-799
>>
>>
>>
>>
>>
>>
>> -Hoss
>>
>
>

Hi, i added some code to *DirectUpdateHandler2.java's doDeletions()* (solr
1.2.0) ,and got the solution i wanted.(logging duplicate post entry-i.e old
field and new field of duplicate post)


   Document d1=searcher.doc(prev);//existing doc to be deleted
   Document d2=searcher.doc(tdocs.doc());//new doc
   String oldname=d1.get("name");
   String id1=d1.get("id");
   String newname=d2.get("name");
   String id2=d1.get("id");
   out3.write(id1+","+oldname+","+newname+"\n");

But i dont know ,wether the performance of solr will be affected by this.
Any comment on the performance issue for the above solution is welcome...
-- 
Yours,
S.Selvam


Re: solr-duplicate post management

2009-01-26 Thread Chris Hostetter

: Hi, i added some code to *DirectUpdateHandler2.java's doDeletions()* (solr
: 1.2.0) ,and got the solution i wanted.(logging duplicate post entry-i.e old
: field and new field of duplicate post)
: 
: 
:Document d1=searcher.doc(prev);//existing doc to be deleted
:Document d2=searcher.doc(tdocs.doc());//new doc
:String oldname=d1.get("name");
:String id1=d1.get("id");
:String newname=d2.get("name");
:String id2=d1.get("id");
:out3.write(id1+","+oldname+","+newname+"\n");
: 
: But i dont know ,wether the performance of solr will be affected by this.
: Any comment on the performance issue for the above solution is welcome...

it's probably going to be painfully slow -- you're probably going to be a 
lot better off avoiding the use of searcher.doc and instead stick with 
using the FieldCache, but there are trade offs there as well, it's largely 
going to depend on how often you're doing adds vs. commits.

BTW: as i mentioned before, it probably make more sense to implement this 
in an UpdateProcessor instead of hacking DirectUpdateHandler2 ... that way 
you'll be able to upgrade Solr without worryiing about losing/redocing 
your changes.




-Hoss



Re: solr-duplicate post management

2009-01-26 Thread S.Selvam Siva
On Tue, Jan 27, 2009 at 5:03 AM, Chris Hostetter
wrote:

>
> : Hi, i added some code to *DirectUpdateHandler2.java's doDeletions()*
> (solr
> : 1.2.0) ,and got the solution i wanted.(logging duplicate post entry-i.e
> old
> : field and new field of duplicate post)
> :
> :
> :Document d1=searcher.doc(prev);//existing doc to be
> deleted
> :Document d2=searcher.doc(tdocs.doc());//new doc
> :String oldname=d1.get("name");
> :String id1=d1.get("id");
> :String newname=d2.get("name");
> :String id2=d1.get("id");
> :out3.write(id1+","+oldname+","+newname+"\n");
> :
> : But i dont know ,wether the performance of solr will be affected by this.
> : Any comment on the performance issue for the above solution is welcome...
>
> it's probably going to be painfully slow -- you're probably going to be a
> lot better off avoiding the use of searcher.doc and instead stick with
> using the FieldCache, but there are trade offs there as well, it's largely
> going to depend on how often you're doing adds vs. commits.
>
> BTW: as i mentioned before, it probably make more sense to implement this
> in an UpdateProcessor instead of hacking DirectUpdateHandler2 ... that way
> you'll be able to upgrade Solr without worryiing about losing/redocing
> your changes.
>
>
>
>
> -Hoss
>


Thanks a lot Chris Hostetter ,

I realize i must make it to UpdateProcessor for best performance and
 i am new to SOLR (a few months back i started working on solr).
I found modifying DirectUpdateHandler2 bit easy.
Further,for the current importance of finding duplicate post,i made the
 above modification to DirectUpdateHandler2.

Note:And for your information,we are commiting for every 1000 posts.



-- 
Yours,
S.Selvam