Re: Checking for similar text (duplicates)

Mikhail Khludnev Thu, 09 Jan 2014 08:29:38 -0800

On Thu, Jan 9, 2014 at 5:39 PM, Cristian Bichis <cri...@imagis.ro> wrote:


> Hi Mikhail,
>
> I seen deduplication part as well but I have some concerns:
>
> 1. Is deduplication supposed to work as well into a check-only (not try to
> actually add new record to index) request ? So if I just check to see if
> "could be" some duplicates of some text ?
>
> that wiki mention special signature field which is added to documents, try
to search for it.


> 2. As far as I seen the deduplication has some bottlenecks when comparing
> extremely similar items (eg just one character difference). I cant find now
> the pages mentioning this but I am concerned this might not be reliable
>
I suppose that MD5Signature is sensitive to single char difference and
TextProfileSignature ignores small diffs. try to experiment with them


>
> Cristian
>
>  Hello Cristian,
>>
>> Have you seen http://wiki.apache.org/solr/Deduplication ?
>>
>>
>> On Thu, Jan 9, 2014 at 5:01 PM, Cristian Bichis <cri...@imagis.ro> wrote:
>>
>>  Hi,
>>>
>>> I have one app where the search part is based currently on something else
>>> than Solr. However, as the scale/demand and complexity grows I am looking
>>> at Solr for a potential better fit, including for some features currently
>>> implemented into scripting layer (so which are not on search currently).
>>> I
>>> am not quite familiar with Solr at this point, I am into early checking
>>> stage.
>>>
>>> One of the current app features is to detect /if there are/ similar
>>> records into index comparing with a potential new record and /which are
>>> these records/. In other words to check for duplicates (which are not
>>> necessary identical but would be very close to original). The comparison
>>> is
>>> made checking on a description field, which could contain couple hundreds
>>> words (and the words are NOT in English) for each record. Of course the
>>> comparison could be made more complex in the future, to compare 2-3
>>> fields
>>> (a title, the description, additional keywords, etc).
>>>
>>> Currently this feature is implemented directly in PHP using similar_text,
>>> which for us has an advantage over levenshtein because it gives a
>>> straight
>>> % match score and we can decide if a record is a duplicate based on %
>>> score
>>> returned by similar_text (eg: if over 80% match then is a duplicate). The
>>> fact I have a score (filtering limit) for each record compared it helps
>>> me
>>> to decide/tweak the limit I consider is the milestone between duplicates
>>> and non-duplicates (I may decide the comparison is too strict and I may
>>> lower the threshold to 75%).
>>>
>>> Using levensthein (on php) would require additional processing so the
>>> performance benefit would be lost with this overhead. As well, on longer
>>> term any php implementation for this feature would be a performance
>>> bottleneck so this is not quite a solution.
>>>
>>> I am looking to move this "slow" operation into a more efficient
>>> environment, that's why I considered moving into search part this
>>> feature.
>>>
>>> I want to know if anyone has an efficient (working) solution based on
>>> Solr
>>> for this case. I am not sure if fuzzy search would be enough, I havent
>>> made
>>> a test case for this (yet).
>>>
>>> Thank you,
>>> Cristian
>>>
>>>
>>
>>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mkhlud...@griddynamics.com>

Re: Checking for similar text (duplicates)

Reply via email to