How about this - select a.*, nvl(b.col,nvl(c.col,'some default')) from driving_table a left outer join lookup1 b on a.id=b.id left outer join lookup2 c on a.id=c,id
? On Fri, Dec 30, 2016 at 9:55 AM, Sesterhenn, Mike <msesterh...@cars.com> wrote: > Hi all, > > > I'm writing an ETL process with Spark 1.5, and I was wondering the best > way to do something. > > > A lot of the fields I am processing require an algorithm similar to this: > > > Join input dataframe to a lookup table. > > if (that lookup fails (the joined fields are null)) { > > Lookup into some other table to join some other fields. > > } > > > With Dataframes, it seems the only way to do this is to do something like > this: > > > Join input dataframe to a lookup table. > > if (that lookup fails (the joined fields are null)) { > > *SPLIT the dataframe into two DFs via DataFrame.filter(), > > one group with successful lookup, the other failed).* > > For failed lookup: { > > Lookup into some other table to grab some other fields. > > } > > *MERGE the dataframe splits back together via DataFrame.unionAll().* > } > > > I'm seeing some really large execution plans as you might imagine in the > Spark Ui, and the processing time seems way out of proportion with the size > of the dataset. (~250GB in 9 hours). > > > Is this the best approach to implement an algorithm like this? Note also > that some fields I am implementing require multiple staged split/merge > steps due to cascading lookup joins. > > > Thanks, > > > *Michael Sesterhenn* > > > *msesterh...@cars.com <msesterh...@cars.com> * > > -- Best Regards, Ayan Guha