Thanks for your answer Rahul. I think I have explained similarity with the 
example, assuming the natural order.
I would assume this is a common action for people who use solr and do search 
based systems.
I am basically looking for any design patterns that people use to achieve the 
results as explained in the example below.

Please do not take join very literally. It has to be a smart join and I think 
yours approach seems like a step towards vectorizing each name. Thanks.

Are there any other ways that people have tackled such problems ?


> On Jul 15, 2018, at 2:51 PM, Rahul Singh <rahul.xavier.si...@gmail.com> wrote:
> 
> How do you define similarity? There are various different methods that work 
> for different methods. In solr depending on which index time analyzer / 
> tokenizer you are using, it will treat one company name as similar in one 
> scenario and not in another.
> 
> This seems like a case of data deduplication — the join I’m pretty sure works 
> on exact matches.
> 
> Consider creating a “identity” collection where you map the different names 
> to a unique identity key. This could then be technically be joined on two 
> datasets and then those could be joined again.
> 
> Rahul
> On Jul 11, 2018, 4:42 PM -0400, Aroop Ganguly 
> <aroopgang...@icloud.com.invalid>, wrote:
>> Hi Team
>> 
>> This is what I want to do:
>> 1. I have 2 datasets of the schema id-number and company-name
>> 2. I want to ultimately be able to link (join or any other means) the 2 data 
>> sets based on the similarity between the company-name fields of the 2 data 
>> set.
>> 
>> Example:
>> 
>> Dataset 1
>> ————————
>> Id | Company Name
>> —| —————————————
>> 1 | Aroop Inc
>> 2 | Ganguly & Ganguly Corp
>> 
>> 
>> Dataset 2
>> ————————
>> Yo Revenue | Company Name
>> — ————— |————————
>> 1K | aroop and sons
>> 2K | Ganguly Corp
>> 3K | Ganguly and Ganguly
>> 2K | Aroop Inc.
>> 6K | Ganguly Corporation
>> 
>> 
>> 
>> I want to be able to get a join in the end, based on a smart similarity 
>> score between the company names in the 2 data sets.
>> 
>> Final Dataset
>> —--- | —————————————| ————————|————————————————————— |————————————————————
>> Id | Company Name | Revenue | Matched Company Name from Dataset2 | 
>> Similarity Score
>> —--- | —————————————-----------------------—| ————————————————————— 
>> |———————————————————
>> 1 | Aroop Inc | 2K | Aroop Inc. | 99%
>> 2 | Ganguly & Ganguly Corp | 3K | Ganguly and Ganguly | 75%
>> —--- | —————————————| ————————|—————————————————————--- |————————————————————
>> 
>> How should I proceed? (I have preprocessed the data sets to lowercase it and 
>> remove non essential words like pronouns and acronyms like LTD or Co. )
>> 
>> Thanks
>> Aroop

Reply via email to