I think you need some sort of fuzzy join ?
Is it always the case that one title is a substring of another ?

On Tue, Mar 15, 2016 at 6:46 AM, Suniti Singh <suniti.si...@gmail.com>
wrote:

> Hi All,
>
> I have two tables with same schema but different data. I have to join the
> tables based on one column and then do a group by the same column name.
>
> now the data in that column in two table might/might not exactly match.
> (Ex - column name is "title". Table1. title = "doctor"   and Table2. title
> = "doc") doctor and doc are actually same titles.
>
> From performance point of view where i have data volume in TB , i am not
> sure if i can achieve this using the sql statement. What would be the best
> approach of solving this problem. Should i look for MLLIB apis?
>
> Spark Gurus any pointers?
>
> Thanks,
> Suniti
>
>
>


-- 

*Regards,*
Wail Alkowaileet

Reply via email to