Re: Etl with spark

Sam Elamin Sun, 12 Feb 2017 09:41:51 -0800

thanks Ayan but i was hoping to remove the dependency on a file and just
use in memory list or dictionary


So from the reading I've done today it seems.the concept of a bespoke async
method doesn't really apply in spsrk since the cluster deals with
distributing the work load


Am I mistaken?

Regards
Sam
On Sun, 12 Feb 2017 at 12:13, ayan guha <guha.a...@gmail.com> wrote:

You can store the list of keys (I believe you use them in source file path,
right?) in a file, one key per line. Then you can read the file using
sc.textFile (So you will get a RDD of file paths) and then apply your
function as a map.

r = sc.textFile(list_file).map(your_function)

HTH

On Sun, Feb 12, 2017 at 10:04 PM, Sam Elamin <hussam.ela...@gmail.com>
wrote:

Hey folks

Really simple question here. I currently have an etl pipeline that reads
from s3 and saves the data to an endstore


I have to read from a list of keys in s3 but I am doing a raw extract then
saving. Only some of the extracts have a simple transformation but overall
the code looks the same


I abstracted away this logic into a method that takes in an s3 path does
the common transformations and saves to source


But the job takes about 10 mins or so because I'm iteratively going down a
list of keys

Is it possible to asynchronously do this?

FYI I'm using spark.read.json to read from s3 because it infers my schema

Regards
Sam




-- 
Best Regards,
Ayan Guha

Re: Etl with spark

Reply via email to