Hi All,
I have two tables with same schema but different data. I have to join the
tables based on one column and then do a group by the same column name.
now the data in that column in two table might/might not exactly match. (Ex
- column name is "title". Table1. title = "doctor" and Table2. ti
You can achieve this with the normal RDD way. Have one extra stage in the
pipeline where you will properly standardize all the values (like replacing
doc with doctor) for all the columns before the join.
Thanks
Best Regards
On Tue, Mar 15, 2016 at 9:16 AM, Suniti Singh
wrote:
> Hi All,
>
> I ha
Is it always the case that one title is a substring of another ? -- Not
always. One title can have values like D.O.C, doctor_{areacode},
doc_{dep,areacode}
On Mon, Mar 14, 2016 at 10:39 PM, Wail Alkowaileet
wrote:
> I think you need some sort of fuzzy join ?
> Is it always the case that one titl
The data in the title is different, so to correct the data in the column
requires to find out what is the correct data and then replace.
To find the correct data could be tedious but if some mechanism is in place
which can help to group the partially matched data then it might help to do
the furt