You can achieve this with the normal RDD way. Have one extra stage in the
pipeline where you will properly standardize all the values (like replacing
doc with doctor) for all the columns before the join.

Thanks
Best Regards

On Tue, Mar 15, 2016 at 9:16 AM, Suniti Singh <suniti.si...@gmail.com>
wrote:

> Hi All,
>
> I have two tables with same schema but different data. I have to join the
> tables based on one column and then do a group by the same column name.
>
> now the data in that column in two table might/might not exactly match.
> (Ex - column name is "title". Table1. title = "doctor"   and Table2. title
> = "doc") doctor and doc are actually same titles.
>
> From performance point of view where i have data volume in TB , i am not
> sure if i can achieve this using the sql statement. What would be the best
> approach of solving this problem. Should i look for MLLIB apis?
>
> Spark Gurus any pointers?
>
> Thanks,
> Suniti
>
>
>

Reply via email to