Thanks for your answer Rahul. I think I have explained similarity with the example, assuming the natural order. I would assume this is a common action for people who use solr and do search based systems. I am basically looking for any design patterns that people use to achieve the results as explained in the example below.
Please do not take join very literally. It has to be a smart join and I think yours approach seems like a step towards vectorizing each name. Thanks. Are there any other ways that people have tackled such problems ? > On Jul 15, 2018, at 2:51 PM, Rahul Singh <rahul.xavier.si...@gmail.com> wrote: > > How do you define similarity? There are various different methods that work > for different methods. In solr depending on which index time analyzer / > tokenizer you are using, it will treat one company name as similar in one > scenario and not in another. > > This seems like a case of data deduplication — the join I’m pretty sure works > on exact matches. > > Consider creating a “identity” collection where you map the different names > to a unique identity key. This could then be technically be joined on two > datasets and then those could be joined again. > > Rahul > On Jul 11, 2018, 4:42 PM -0400, Aroop Ganguly > <aroopgang...@icloud.com.invalid>, wrote: >> Hi Team >> >> This is what I want to do: >> 1. I have 2 datasets of the schema id-number and company-name >> 2. I want to ultimately be able to link (join or any other means) the 2 data >> sets based on the similarity between the company-name fields of the 2 data >> set. >> >> Example: >> >> Dataset 1 >> ———————— >> Id | Company Name >> —| ————————————— >> 1 | Aroop Inc >> 2 | Ganguly & Ganguly Corp >> >> >> Dataset 2 >> ———————— >> Yo Revenue | Company Name >> — ————— |———————— >> 1K | aroop and sons >> 2K | Ganguly Corp >> 3K | Ganguly and Ganguly >> 2K | Aroop Inc. >> 6K | Ganguly Corporation >> >> >> >> I want to be able to get a join in the end, based on a smart similarity >> score between the company names in the 2 data sets. >> >> Final Dataset >> —--- | —————————————| ————————|————————————————————— |———————————————————— >> Id | Company Name | Revenue | Matched Company Name from Dataset2 | >> Similarity Score >> —--- | —————————————-----------------------—| ————————————————————— >> |——————————————————— >> 1 | Aroop Inc | 2K | Aroop Inc. | 99% >> 2 | Ganguly & Ganguly Corp | 3K | Ganguly and Ganguly | 75% >> —--- | —————————————| ————————|—————————————————————--- |———————————————————— >> >> How should I proceed? (I have preprocessed the data sets to lowercase it and >> remove non essential words like pronouns and acronyms like LTD or Co. ) >> >> Thanks >> Aroop