I have invited you to the slack channel.

2 million data points doesn't seem like it should be an issue.
Have you considered trying a simpler combiner like Count to see if the
bottleneck is with the combiner that you are supplying?
Also, could you share the code for what resample_function does?

On Mon, Aug 14, 2017 at 2:43 AM, Tristan Marechaux <
[email protected]> wrote:

> Hi all,
>
> I wrote a Beam Pipeline written with the python SDK that resample a
> timeseries containing data points everery minute to a 5-minutes timeserie.
>
> My pipeline looks like:
> input_data | 
> WindowInto(FixedWindows(size=timedelta(minutes=5).total_seconds()))
> | CombineGlobaly(resample_function)
>
> When I run it with the local or DataFlow runner with a small dataset, it
> works and does what I want.
>
> But when I try to run it on the DataFlow runner with a bigger dataset (1
> 700 000 datapoints timestamped over 15 years) it stay stuck for hours on
> the GroupByKey step of CombineGlobaly.
>
> My question is : Did I do something wrong with the design of my pipeline?
>
> PS: Can someone invite me to the slack channel?
> --
>
> Tristan Marechaux
>
> Data Scientist | *Walnut Algorithms*
>
> Mobile : +33 627804399 <+33627804399>
>
> Email: [email protected]
>
> Web: www.walnutalgorithms.com
>

Reply via email to