Re: Best way to process lookup ETL with Dataframes

Sesterhenn, Mike Fri, 30 Dec 2016 07:47:51 -0800

Thanks, but is nvl() in Spark 1.5?  I can't find it in spark.sql.functions 
(http://spark.apache.org/docs/1.5.0/api/scala/index.html#org.apache.spark.sql.functions$)

Reading about the Oracle nvl function, it seems it is similar to the na 
functions.  Not sure it will help though, because what I need is to join after 
the first join fails.

________________________________
From: ayan guha <guha.a...@gmail.com>
Sent: Thursday, December 29, 2016 11:06 PM
To: Sesterhenn, Mike
Cc: user@spark.apache.org
Subject: Re: Best way to process lookup ETL with Dataframes

How about this -

select a.*, nvl(b.col,nvl(c.col,'some default'))
from driving_table a
left outer join lookup1 b on a.id<http://a.id>=b.id<http://b.id>
left outer join lookup2 c on a.id<http://a.id>=c,id

?

On Fri, Dec 30, 2016 at 9:55 AM, Sesterhenn, Mike 
<msesterh...@cars.com<mailto:msesterh...@cars.com>> wrote:

Hi all,

I'm writing an ETL process with Spark 1.5, and I was wondering the best way to 
do something.

A lot of the fields I am processing require an algorithm similar to this:

Join input dataframe to a lookup table.

if (that lookup fails (the joined fields are null)) {

    Lookup into some other table to join some other fields.

}

With Dataframes, it seems the only way to do this is to do something like this:

Join input dataframe to a lookup table.

if (that lookup fails (the joined fields are null)) {

   *SPLIT the dataframe into two DFs via DataFrame.filter(),

      one group with successful lookup, the other failed).*

   For failed lookup:  {

       Lookup into some other table to grab some other fields.

   }

   *MERGE the dataframe splits back together via DataFrame.unionAll().*

}

I'm seeing some really large execution plans as you might imagine in the Spark 
Ui, and the processing time seems way out of proportion with the size of the 
dataset.  (~250GB in 9 hours).

Is this the best approach to implement an algorithm like this?  Note also that 
some fields I am implementing require multiple staged split/merge steps due to 
cascading lookup joins.

Thanks,

Michael Sesterhenn

msesterh...@cars.com<mailto:msesterh...@cars.com>

--
Best Regards,
Ayan Guha

Re: Best way to process lookup ETL with Dataframes

Reply via email to