How do you define similarity? There are various different methods that work for 
different methods. In solr depending on which index time analyzer / tokenizer 
you are using, it will treat one company name as similar in one scenario and 
not in another.

This seems like a case of data deduplication — the join I’m pretty sure works 
on exact matches.

Consider creating a “identity” collection where you map the different names to 
a unique identity key. This could then be technically be joined on two datasets 
and then those could be joined again.

Rahul
On Jul 11, 2018, 4:42 PM -0400, Aroop Ganguly 
<aroopgang...@icloud.com.invalid>, wrote:
> Hi Team
>
> This is what I want to do:
> 1. I have 2 datasets of the schema id-number and company-name
> 2. I want to ultimately be able to link (join or any other means) the 2 data 
> sets based on the similarity between the company-name fields of the 2 data 
> set.
>
> Example:
>
> Dataset 1
> ————————
> Id | Company Name
> —| —————————————
> 1 | Aroop Inc
> 2 | Ganguly & Ganguly Corp
>
>
> Dataset 2
> ————————
> Yo Revenue | Company Name
> — ————— |————————
> 1K | aroop and sons
> 2K | Ganguly Corp
> 3K | Ganguly and Ganguly
> 2K | Aroop Inc.
> 6K | Ganguly Corporation
>
>
>
> I want to be able to get a join in the end, based on a smart similarity score 
> between the company names in the 2 data sets.
>
> Final Dataset
> —--- | —————————————| ————————|————————————————————— |————————————————————
> Id | Company Name | Revenue | Matched Company Name from Dataset2 | Similarity 
> Score
> —--- | —————————————-----------------------—| ————————————————————— 
> |———————————————————
> 1 | Aroop Inc | 2K | Aroop Inc. | 99%
> 2 | Ganguly & Ganguly Corp | 3K | Ganguly and Ganguly | 75%
> —--- | —————————————| ————————|—————————————————————--- |————————————————————
>
> How should I proceed? (I have preprocessed the data sets to lowercase it and 
> remove non essential words like pronouns and acronyms like LTD or Co. )
>
> Thanks
> Aroop

Reply via email to