Re: Pipeline design including (sub / nested)-pipelines

Lukasz Cwik Fri, 17 Aug 2018 08:40:41 -0700

Contributions in this space are welcome. Reach out to [email protected]
once your ready.


On Fri, Aug 17, 2018 at 3:02 AM Pascal Gula <[email protected]> wrote:

> Hi Lukasz,
> thanks for the proposed solution. This was also one of the alternative
> implementations that I thought of.
> When you are talking about launching a job from another job, I understand
> doing a system call from another python job and getting result by some
> means (reading synchronously the output of child jobs), am I correct?
> I'll first test this with the DirectRunner calling other DirectRunner(s),
> and afterwards doing it on GCP with DataFlow.
> Regarding nesting pipeline, I can provide support to build a demonstrator
> if I can have some support from the community.
> Thanks again and very best regards,
> Pascal
>
>
> On Thu, Aug 16, 2018 at 8:43 PM, Lukasz Cwik <[email protected]> wrote:
>
>> You can launch another Dataflow job from within an existing Dataflow job.
>> For all intensive purposes, Dataflow won't know that the jobs are related
>> in any way so they will only be "nested" because your outer pipeline knows
>> about the inner pipeline.
>>
>> You should be able to do this for all runners (granted you need to
>> propagate all runner/pipeline configuration through) and you should be able
>> to take a job from one runner and launch a job on a different runner
>> (you'll have to deal with the complexities of having two runners and their
>> dependencies somehow though).
>>
>> There was some work investigating supporting nested graphs within Apache
>> Beam and to support dynamic graph expansion during execution as a general
>> concept. This was to support use cases such as recursion and loops but this
>> didn't progress much more then the idea generation phase.
>>
>> On Thu, Aug 16, 2018 at 9:47 AM Pascal Gula <[email protected]> wrote:
>>
>>> Hi Robin,
>>> this is unfortunate news, but I already anticipated such answer with an
>>> alternative implementation.
>>> It would be however interesting to support such feature since I am
>>> probably not the first person asking for this.
>>> Best regards,
>>> Pascal
>>>
>>> On Thu, Aug 16, 2018 at 6:20 PM, Robin Qiu <[email protected]> wrote:
>>>
>>>> Hi Pascal,
>>>>
>>>> As far as I know, you can't create sub-pipeline within a DoFn, i.e.
>>>> nested pipelines are not supported.
>>>>
>>>> Best,
>>>> Robin
>>>>
>>>> On Thu, Aug 16, 2018 at 7:03 AM Pascal Gula <[email protected]> wrote:
>>>>
>>>>> As a bonus, here is a simplified diagram view of the use-case:
>>>>>
>>>>> Cheers,
>>>>> Pascal
>>>>>
>>>>>
>>>>> On Thu, Aug 16, 2018 at 3:12 PM, Pascal Gula <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>> I am currently evaluating Apache Beam (later executing on Google
>>>>>> DataFlow), and for the first use-case I am working on, I have a kinda
>>>>>> design question to see if any of you already had a similar one.
>>>>>> Namely, we have a DB describing dashboards views, and for each views,
>>>>>> we would like to perform some aggregation transform.
>>>>>> My first approach would be to create a higher level pipeline that
>>>>>> will fetch all view configurations from our mongoDB (BTW, we released a
>>>>>> mongoDB IO connector here: https://pypi.org/project/beam-extended/).
>>>>>> With this views PColl, the idea is to have a ParDo, with a DoFn that will
>>>>>> create sub-pipleine to perform the aggregation on data from our plant
>>>>>> database with a qurey derived from the view configuration. Afterwards, 
>>>>>> the
>>>>>> idea is to save for the higher level pipeline, some performance/data
>>>>>> metrics related to the execution of the array of sub-pipeline.
>>>>>> The main question is: are nested pipeline supported by the runner?
>>>>>> I hope that my description was clear enough. I will work on a diagram
>>>>>> view meanwhile.
>>>>>> Very best regards,
>>>>>> Pascal
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Pascal Gula
>>>>>> Senior Data Engineer / Scientist
>>>>>> +49 (0)176 34232684www.plantix.net <http://plantix.net/>
>>>>>>  PEAT GmbH
>>>>>> Kastanienallee 4
>>>>>> 10435 Berlin // Germany
>>>>>>  
>>>>>> <https://play.google.com/store/apps/details?id=com.peat.GartenBank>Download
>>>>>>  the App! 
>>>>>> <https://play.google.com/store/apps/details?id=com.peat.GartenBank>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Pascal Gula
>>>>> Senior Data Engineer / Scientist
>>>>> +49 (0)176 34232684www.plantix.net <http://plantix.net/>
>>>>>  PEAT GmbH
>>>>> Kastanienallee 4
>>>>> 10435 Berlin // Germany
>>>>>  
>>>>> <https://play.google.com/store/apps/details?id=com.peat.GartenBank>Download
>>>>>  the App! 
>>>>> <https://play.google.com/store/apps/details?id=com.peat.GartenBank>
>>>>>
>>>>>
>>>
>>>
>>> --
>>>
>>> Pascal Gula
>>> Senior Data Engineer / Scientist
>>> +49 (0)176 34232684www.plantix.net <http://plantix.net/>
>>>  PEAT GmbH
>>> Kastanienallee 4
>>> 10435 Berlin // Germany
>>>  
>>> <https://play.google.com/store/apps/details?id=com.peat.GartenBank>Download 
>>> the App! <https://play.google.com/store/apps/details?id=com.peat.GartenBank>
>>>
>>>
>
>
> --
>
> Pascal Gula
> Senior Data Engineer / Scientist
> +49 (0)176 34232684www.plantix.net <http://plantix.net/>
>  PEAT GmbH
> Kastanienallee 4
> 10435 Berlin // Germany
>  <https://play.google.com/store/apps/details?id=com.peat.GartenBank>Download 
> the App! <https://play.google.com/store/apps/details?id=com.peat.GartenBank>
>
>

Re: Pipeline design including (sub / nested)-pipelines

Reply via email to