I think you need some sort of fuzzy join ? Is it always the case that one title is a substring of another ?
On Tue, Mar 15, 2016 at 6:46 AM, Suniti Singh <suniti.si...@gmail.com> wrote: > Hi All, > > I have two tables with same schema but different data. I have to join the > tables based on one column and then do a group by the same column name. > > now the data in that column in two table might/might not exactly match. > (Ex - column name is "title". Table1. title = "doctor" and Table2. title > = "doc") doctor and doc are actually same titles. > > From performance point of view where i have data volume in TB , i am not > sure if i can achieve this using the sql statement. What would be the best > approach of solving this problem. Should i look for MLLIB apis? > > Spark Gurus any pointers? > > Thanks, > Suniti > > > -- *Regards,* Wail Alkowaileet