Re: How to run the solr dedup for the document which match 80% or match almost.
Hi Lance, This is out of context but still asking you the question . I implemented TextProfileSignature dedupe as suggested but here is something weired which I came through while implementing - I am testing it with two documents and trying to index them . Please see the below content- <<<<<<<<>>>>>>>>>> I bought a Toyota Camry in 2007. After driven 6km, Test02 my engine oil light starts flash after change engine oil and just drive 5000Km during I use brake. I went to Toyota to ask a , it is said the normal engine Test03 oil consumption is 0.4 to 0.5L/1000Km. Test04 If so, Toyota recommends 6000Km for each engine oil change. If so, after driving 6000Km,Test05 the engine oil consumption is 3Litre. But each time, the dealer just put 4 Litre oil in. That means there is just 1 Litre in engine after driving 6000Km. Test06 Does anybody have standard engine oil consumption? As I searched, even in some undeveloped countries, it is just 0.3Litre/1000Km. <<<<<<<<>>>>>>>>>> If i keep on adding test words like --- test01 test02 test03 in the second document,and so on,solr still recognizes the second document as the duplicate.But if I add any of the test word more than once(test11 or test07) ,the document count becomes 2 and the dedupe doesn't works after that. 1)Is this the default behavior or is there something to fix? 2)Can you please also tell me what is the threshold limit for dedupe? 3) QUANT = QUANT_RATE * maxFreq, where QUANT_RATE is 0.01f by default, and maxFreq is the maximum token frequency. If maxFreq is higher than 1, then QUANT is always higher than 2 Can you please clarify the above given explanation? I mean to say is QUANT_RATE=.01f and f is less than 100 ,then how Quant rate is an integer? Regards, Vibhor -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-run-the-solr-dedup-for-the-document-which-match-80-or-match-almost-tp3614239p3628221.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to run the solr dedup for the document which match 80% or match almost.
Hi, I implemented TextProfileSignature dedupe as suggested but here is something weired which I came through while implementing - I am testing it with two documents and trying to index them . Please see the below content- <<<<<<<<>>>>>>>>>> I bought a Toyota Camry in 2007. After driven 6km, Test02 my engine oil light starts flash after change engine oil and just drive 5000Km during I use brake. I went to Toyota to ask a , it is said the normal engine Test03 oil consumption is 0.4 to 0.5L/1000Km. Test04 If so, Toyota recommends 6000Km for each engine oil change. If so, after driving 6000Km,Test05 the engine oil consumption is 3Litre. But each time, the dealer just put 4 Litre oil in. That means there is just 1 Litre in engine after driving 6000Km. Test06 Does anybody have standard engine oil consumption? As I searched, even in some undeveloped countries, it is just 0.3Litre/1000Km. <<<<<<<<>>>>>>>>>> If i keep on adding test words like --- test01 test02 test03 in the second document,and so on,solr still recognizes the second document as the duplicate.But if I add any of the test word more than once(test11 or test07) ,the document count becomes 2 and the dedupe doesn't works after that. 1)Is this the default behavior or is there something to fix? 2)Can you please also tell me what is the threshold limit for dedupe? 3) Q/UANT = QUANT_RATE * maxFreq, where QUANT_RATE is 0.01f by default, and maxFreq is the maximum token frequency. If maxFreq is higher than 1, then QUANT is always higher than 2/ Can you please clarify the above given explanation? I mean to say is QUANT_RATE=.01f and f is less than 100 ,then how Quant rate is an integer? Regards, Vibhor -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-run-the-solr-dedup-for-the-document-which-match-80-or-match-almost-tp3614239p3626526.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to run the solr dedup for the document which match 80% or match almost.
You would have to implement this yourself in your indexing code. Solr has an analysis plugin which does the analysis for your text and then returns the result, but does not query or index. You can use this to calculate the fuzzy hash, then search against index. You might be able to code this in an UpdateRequestProcessor. On Tue, Dec 27, 2011 at 9:45 PM, vibhoreng04 wrote: > Hi Shashi, > > That's correct !But I need something for index time comparision.Can cosine > compare from the already indexed documents and compare the incrementally > indexed files ? > > > > Regards, > > > Vibhor > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/How-to-run-the-solr-dedup-for-the-document-which-match-80-or-match-almost-tp3614239p3615787.html > Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Re: How to run the solr dedup for the document which match 80% or match almost.
Hi Shashi, That's correct !But I need something for index time comparision.Can cosine compare from the already indexed documents and compare the incrementally indexed files ? Regards, Vibhor -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-run-the-solr-dedup-for-the-document-which-match-80-or-match-almost-tp3614239p3615787.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to run the solr dedup for the document which match 80% or match almost.
You can also look at cosine similarity (or related metrics) to measure document similarity. On Tue, Dec 27, 2011 at 6:51 AM, vibhoreng04 wrote: > Hi iorixxx, > > Thanks for the quick update.I hope I can take it from here ! > > > Regards, > > Vibhor > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/How-to-run-the-solr-dedup-for-the-document-which-match-80-or-match-almost-tp3614239p3614253.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to run the solr dedup for the document which match 80% or match almost.
Hi iorixxx, Thanks for the quick update.I hope I can take it from here ! Regards, Vibhor -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-run-the-solr-dedup-for-the-document-which-match-80-or-match-almost-tp3614239p3614253.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to run the solr dedup for the document which match 80% or match almost.
> I am doing dedup for my solr instance which works on the > content and the url > fields.My question is if I want to eliminate the records > which are 80% > matching or 90% matching in the content field then how I > should proceed for > that? > Already I have changed my solrconfig.xml and have changed > the part of file > which is required for the dedup(update Request Processor > chain) and that > part is working fine. You can use TextProfileSignature, which is a Fuzzy hashing implementation, instead of Lookup3Signature.
How to run the solr dedup for the document which match 80% or match almost.
Hi, I am doing dedup for my solr instance which works on the content and the url fields.My question is if I want to eliminate the records which are 80% matching or 90% matching in the content field then how I should proceed for that? Already I have changed my solrconfig.xml and have changed the part of file which is required for the dedup(update Request Processor chain) and that part is working fine. Regards, Vibhor -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-run-the-solr-dedup-for-the-document-which-match-80-or-match-almost-tp3614239p3614239.html Sent from the Solr - User mailing list archive at Nabble.com.