Re: Output Side Effects for different chain of operations
I am already creating these files on slave. How can i create an RDD from these slaves? Regards Sumit Chawla On Thu, Dec 15, 2016 at 11:42 AM, Reynold Xin wrote: > You can just write some files out directly (and idempotently) in your > map/mapPartitions functions. It is just a function that you can run > arbitrary code after all. > > > On Thu, Dec 15, 2016 at 11:33 AM, Chawla,Sumit > wrote: > >> Any suggestions on this one? >> >> Regards >> Sumit Chawla >> >> >> On Tue, Dec 13, 2016 at 8:31 AM, Chawla,Sumit >> wrote: >> >>> Hi All >>> >>> I have a workflow with different steps in my program. Lets say these are >>> steps A, B, C, D. Step B produces some temp files on each executor node. >>> How can i add another step E which consumes these files? >>> >>> I understand the easiest choice is to copy all these temp files to any >>> shared location, and then step E can create another RDD from it and work on >>> that. But i am trying to avoid this copy. I was wondering if there is any >>> way i can queue up these files for E as they are getting generated on >>> executors. Is there any possibility of creating a dummy RDD in start of >>> program, and then push these files into this RDD from each executor. >>> >>> I take my inspiration from the concept of Side Outputs in Google >>> Dataflow: >>> >>> https://cloud.google.com/dataflow/model/par-do#emitting-to-s >>> ide-outputs-in-your-dofn >>> >>> >>> >>> Regards >>> Sumit Chawla >>> >>> >> >
Re: Output Side Effects for different chain of operations
You can just write some files out directly (and idempotently) in your map/mapPartitions functions. It is just a function that you can run arbitrary code after all. On Thu, Dec 15, 2016 at 11:33 AM, Chawla,Sumit wrote: > Any suggestions on this one? > > Regards > Sumit Chawla > > > On Tue, Dec 13, 2016 at 8:31 AM, Chawla,Sumit > wrote: > >> Hi All >> >> I have a workflow with different steps in my program. Lets say these are >> steps A, B, C, D. Step B produces some temp files on each executor node. >> How can i add another step E which consumes these files? >> >> I understand the easiest choice is to copy all these temp files to any >> shared location, and then step E can create another RDD from it and work on >> that. But i am trying to avoid this copy. I was wondering if there is any >> way i can queue up these files for E as they are getting generated on >> executors. Is there any possibility of creating a dummy RDD in start of >> program, and then push these files into this RDD from each executor. >> >> I take my inspiration from the concept of Side Outputs in Google Dataflow: >> >> https://cloud.google.com/dataflow/model/par-do#emitting-to- >> side-outputs-in-your-dofn >> >> >> >> Regards >> Sumit Chawla >> >> >
Re: Output Side Effects for different chain of operations
Any suggestions on this one? Regards Sumit Chawla On Tue, Dec 13, 2016 at 8:31 AM, Chawla,Sumit wrote: > Hi All > > I have a workflow with different steps in my program. Lets say these are > steps A, B, C, D. Step B produces some temp files on each executor node. > How can i add another step E which consumes these files? > > I understand the easiest choice is to copy all these temp files to any > shared location, and then step E can create another RDD from it and work on > that. But i am trying to avoid this copy. I was wondering if there is any > way i can queue up these files for E as they are getting generated on > executors. Is there any possibility of creating a dummy RDD in start of > program, and then push these files into this RDD from each executor. > > I take my inspiration from the concept of Side Outputs in Google Dataflow: > > https://cloud.google.com/dataflow/model/par-do# > emitting-to-side-outputs-in-your-dofn > > > > Regards > Sumit Chawla > >
Output Side Effects for different chain of operations
Hi All I have a workflow with different steps in my program. Lets say these are steps A, B, C, D. Step B produces some temp files on each executor node. How can i add another step E which consumes these files? I understand the easiest choice is to copy all these temp files to any shared location, and then step E can create another RDD from it and work on that. But i am trying to avoid this copy. I was wondering if there is any way i can queue up these files for E as they are getting generated on executors. Is there any possibility of creating a dummy RDD in start of program, and then push these files into this RDD from each executor. I take my inspiration from the concept of Side Outputs in Google Dataflow: https://cloud.google.com/dataflow/model/par-do#emitting-to-side-outputs-in-your-dofn Regards Sumit Chawla