Hi All I have a workflow with different steps in my program. Lets say these are steps A, B, C, D. Step B produces some temp files on each executor node. How can i add another step E which consumes these files?
I understand the easiest choice is to copy all these temp files to any shared location, and then step E can create another RDD from it and work on that. But i am trying to avoid this copy. I was wondering if there is any way i can queue up these files for E as they are getting generated on executors. Is there any possibility of creating a dummy RDD in start of program, and then push these files into this RDD from each executor. I take my inspiration from the concept of Side Outputs in Google Dataflow: https://cloud.google.com/dataflow/model/par-do#emitting-to-side-outputs-in-your-dofn Regards Sumit Chawla