You can just write some files out directly (and idempotently) in your map/mapPartitions functions. It is just a function that you can run arbitrary code after all.
On Thu, Dec 15, 2016 at 11:33 AM, Chawla,Sumit <sumitkcha...@gmail.com> wrote: > Any suggestions on this one? > > Regards > Sumit Chawla > > > On Tue, Dec 13, 2016 at 8:31 AM, Chawla,Sumit <sumitkcha...@gmail.com> > wrote: > >> Hi All >> >> I have a workflow with different steps in my program. Lets say these are >> steps A, B, C, D. Step B produces some temp files on each executor node. >> How can i add another step E which consumes these files? >> >> I understand the easiest choice is to copy all these temp files to any >> shared location, and then step E can create another RDD from it and work on >> that. But i am trying to avoid this copy. I was wondering if there is any >> way i can queue up these files for E as they are getting generated on >> executors. Is there any possibility of creating a dummy RDD in start of >> program, and then push these files into this RDD from each executor. >> >> I take my inspiration from the concept of Side Outputs in Google Dataflow: >> >> https://cloud.google.com/dataflow/model/par-do#emitting-to- >> side-outputs-in-your-dofn >> >> >> >> Regards >> Sumit Chawla >> >> >