On Thu, Jan 9, 2014 at 5:39 PM, Cristian Bichis <cri...@imagis.ro> wrote:
> Hi Mikhail, > > I seen deduplication part as well but I have some concerns: > > 1. Is deduplication supposed to work as well into a check-only (not try to > actually add new record to index) request ? So if I just check to see if > "could be" some duplicates of some text ? > > that wiki mention special signature field which is added to documents, try to search for it. > 2. As far as I seen the deduplication has some bottlenecks when comparing > extremely similar items (eg just one character difference). I cant find now > the pages mentioning this but I am concerned this might not be reliable > I suppose that MD5Signature is sensitive to single char difference and TextProfileSignature ignores small diffs. try to experiment with them > > Cristian > > Hello Cristian, >> >> Have you seen http://wiki.apache.org/solr/Deduplication ? >> >> >> On Thu, Jan 9, 2014 at 5:01 PM, Cristian Bichis <cri...@imagis.ro> wrote: >> >> Hi, >>> >>> I have one app where the search part is based currently on something else >>> than Solr. However, as the scale/demand and complexity grows I am looking >>> at Solr for a potential better fit, including for some features currently >>> implemented into scripting layer (so which are not on search currently). >>> I >>> am not quite familiar with Solr at this point, I am into early checking >>> stage. >>> >>> One of the current app features is to detect /if there are/ similar >>> records into index comparing with a potential new record and /which are >>> these records/. In other words to check for duplicates (which are not >>> necessary identical but would be very close to original). The comparison >>> is >>> made checking on a description field, which could contain couple hundreds >>> words (and the words are NOT in English) for each record. Of course the >>> comparison could be made more complex in the future, to compare 2-3 >>> fields >>> (a title, the description, additional keywords, etc). >>> >>> Currently this feature is implemented directly in PHP using similar_text, >>> which for us has an advantage over levenshtein because it gives a >>> straight >>> % match score and we can decide if a record is a duplicate based on % >>> score >>> returned by similar_text (eg: if over 80% match then is a duplicate). The >>> fact I have a score (filtering limit) for each record compared it helps >>> me >>> to decide/tweak the limit I consider is the milestone between duplicates >>> and non-duplicates (I may decide the comparison is too strict and I may >>> lower the threshold to 75%). >>> >>> Using levensthein (on php) would require additional processing so the >>> performance benefit would be lost with this overhead. As well, on longer >>> term any php implementation for this feature would be a performance >>> bottleneck so this is not quite a solution. >>> >>> I am looking to move this "slow" operation into a more efficient >>> environment, that's why I considered moving into search part this >>> feature. >>> >>> I want to know if anyone has an efficient (working) solution based on >>> Solr >>> for this case. I am not sure if fuzzy search would be enough, I havent >>> made >>> a test case for this (yet). >>> >>> Thank you, >>> Cristian >>> >>> >> >> > -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics <http://www.griddynamics.com> <mkhlud...@griddynamics.com>