How do you define similarity? There are various different methods that work for different methods. In solr depending on which index time analyzer / tokenizer you are using, it will treat one company name as similar in one scenario and not in another.
This seems like a case of data deduplication — the join I’m pretty sure works on exact matches. Consider creating a “identity” collection where you map the different names to a unique identity key. This could then be technically be joined on two datasets and then those could be joined again. Rahul On Jul 11, 2018, 4:42 PM -0400, Aroop Ganguly <aroopgang...@icloud.com.invalid>, wrote: > Hi Team > > This is what I want to do: > 1. I have 2 datasets of the schema id-number and company-name > 2. I want to ultimately be able to link (join or any other means) the 2 data > sets based on the similarity between the company-name fields of the 2 data > set. > > Example: > > Dataset 1 > ———————— > Id | Company Name > —| ————————————— > 1 | Aroop Inc > 2 | Ganguly & Ganguly Corp > > > Dataset 2 > ———————— > Yo Revenue | Company Name > — ————— |———————— > 1K | aroop and sons > 2K | Ganguly Corp > 3K | Ganguly and Ganguly > 2K | Aroop Inc. > 6K | Ganguly Corporation > > > > I want to be able to get a join in the end, based on a smart similarity score > between the company names in the 2 data sets. > > Final Dataset > —--- | —————————————| ————————|————————————————————— |———————————————————— > Id | Company Name | Revenue | Matched Company Name from Dataset2 | Similarity > Score > —--- | —————————————-----------------------—| ————————————————————— > |——————————————————— > 1 | Aroop Inc | 2K | Aroop Inc. | 99% > 2 | Ganguly & Ganguly Corp | 3K | Ganguly and Ganguly | 75% > —--- | —————————————| ————————|—————————————————————--- |———————————————————— > > How should I proceed? (I have preprocessed the data sets to lowercase it and > remove non essential words like pronouns and acronyms like LTD or Co. ) > > Thanks > Aroop