Hi All

I have a workflow with different steps in my program. Lets say these are
steps A, B, C, D.  Step B produces some temp files on each executor node.
How can i add another step E which consumes these files?

I understand the easiest choice is  to copy all these temp files to any
shared location, and then step E can create another RDD from it and work on
that.  But i am trying to avoid this copy.  I was wondering if there is any
way i can queue up these files for E as they are getting generated on
executors.  Is there any possibility of creating a dummy RDD in start of
program, and then push these files into this RDD from each executor.

I take my inspiration from the concept of Side Outputs in Google Dataflow:

https://cloud.google.com/dataflow/model/par-do#emitting-to-side-outputs-in-your-dofn



Regards
Sumit Chawla

Reply via email to