Yup I ended up doing just that thank you both On Sun, 12 Feb 2017 at 18:33, Miguel Morales <[email protected]> wrote:
> You can parallelize the collection of s3 keys and then pass that to your > map function so that files are read in parallel. > > Sent from my iPhone > > On Feb 12, 2017, at 9:41 AM, Sam Elamin <[email protected]> wrote: > > thanks Ayan but i was hoping to remove the dependency on a file and just > use in memory list or dictionary > > So from the reading I've done today it seems.the concept of a bespoke > async method doesn't really apply in spsrk since the cluster deals with > distributing the work load > > > Am I mistaken? > > Regards > Sam > On Sun, 12 Feb 2017 at 12:13, ayan guha <[email protected]> wrote: > > You can store the list of keys (I believe you use them in source file > path, right?) in a file, one key per line. Then you can read the file using > sc.textFile (So you will get a RDD of file paths) and then apply your > function as a map. > > r = sc.textFile(list_file).map(your_function) > > HTH > > On Sun, Feb 12, 2017 at 10:04 PM, Sam Elamin <[email protected]> > wrote: > > Hey folks > > Really simple question here. I currently have an etl pipeline that reads > from s3 and saves the data to an endstore > > > I have to read from a list of keys in s3 but I am doing a raw extract then > saving. Only some of the extracts have a simple transformation but overall > the code looks the same > > > I abstracted away this logic into a method that takes in an s3 path does > the common transformations and saves to source > > > But the job takes about 10 mins or so because I'm iteratively going down a > list of keys > > Is it possible to asynchronously do this? > > FYI I'm using spark.read.json to read from s3 because it infers my schema > > Regards > Sam > > > > > -- > Best Regards, > Ayan Guha > >
