Re: Compare a column in two different tables/find the distance between column data
The data in the title is different, so to correct the data in the column requires to find out what is the correct data and then replace. To find the correct data could be tedious but if some mechanism is in place which can help to group the partially matched data then it might help to do the further processing. I am kind of stuck. On Tue, Mar 15, 2016 at 10:50 AM, Suniti Singhwrote: > Is it always the case that one title is a substring of another ? -- Not > always. One title can have values like D.O.C, doctor_{areacode}, > doc_{dep,areacode} > > On Mon, Mar 14, 2016 at 10:39 PM, Wail Alkowaileet > wrote: > >> I think you need some sort of fuzzy join ? >> Is it always the case that one title is a substring of another ? >> >> On Tue, Mar 15, 2016 at 6:46 AM, Suniti Singh >> wrote: >> >>> Hi All, >>> >>> I have two tables with same schema but different data. I have to join >>> the tables based on one column and then do a group by the same column name. >>> >>> now the data in that column in two table might/might not exactly match. >>> (Ex - column name is "title". Table1. title = "doctor" and Table2. title >>> = "doc") doctor and doc are actually same titles. >>> >>> From performance point of view where i have data volume in TB , i am not >>> sure if i can achieve this using the sql statement. What would be the best >>> approach of solving this problem. Should i look for MLLIB apis? >>> >>> Spark Gurus any pointers? >>> >>> Thanks, >>> Suniti >>> >>> >>> >> >> >> -- >> >> *Regards,* >> Wail Alkowaileet >> > >
Re: Compare a column in two different tables/find the distance between column data
Is it always the case that one title is a substring of another ? -- Not always. One title can have values like D.O.C, doctor_{areacode}, doc_{dep,areacode} On Mon, Mar 14, 2016 at 10:39 PM, Wail Alkowaileetwrote: > I think you need some sort of fuzzy join ? > Is it always the case that one title is a substring of another ? > > On Tue, Mar 15, 2016 at 6:46 AM, Suniti Singh > wrote: > >> Hi All, >> >> I have two tables with same schema but different data. I have to join the >> tables based on one column and then do a group by the same column name. >> >> now the data in that column in two table might/might not exactly match. >> (Ex - column name is "title". Table1. title = "doctor" and Table2. title >> = "doc") doctor and doc are actually same titles. >> >> From performance point of view where i have data volume in TB , i am not >> sure if i can achieve this using the sql statement. What would be the best >> approach of solving this problem. Should i look for MLLIB apis? >> >> Spark Gurus any pointers? >> >> Thanks, >> Suniti >> >> >> > > > -- > > *Regards,* > Wail Alkowaileet >
Re: Compare a column in two different tables/find the distance between column data
You can achieve this with the normal RDD way. Have one extra stage in the pipeline where you will properly standardize all the values (like replacing doc with doctor) for all the columns before the join. Thanks Best Regards On Tue, Mar 15, 2016 at 9:16 AM, Suniti Singhwrote: > Hi All, > > I have two tables with same schema but different data. I have to join the > tables based on one column and then do a group by the same column name. > > now the data in that column in two table might/might not exactly match. > (Ex - column name is "title". Table1. title = "doctor" and Table2. title > = "doc") doctor and doc are actually same titles. > > From performance point of view where i have data volume in TB , i am not > sure if i can achieve this using the sql statement. What would be the best > approach of solving this problem. Should i look for MLLIB apis? > > Spark Gurus any pointers? > > Thanks, > Suniti > > >
Compare a column in two different tables/find the distance between column data
Hi All, I have two tables with same schema but different data. I have to join the tables based on one column and then do a group by the same column name. now the data in that column in two table might/might not exactly match. (Ex - column name is "title". Table1. title = "doctor" and Table2. title = "doc") doctor and doc are actually same titles. >From performance point of view where i have data volume in TB , i am not sure if i can achieve this using the sql statement. What would be the best approach of solving this problem. Should i look for MLLIB apis? Spark Gurus any pointers? Thanks, Suniti