I have invited you to the slack channel. 2 million data points doesn't seem like it should be an issue. Have you considered trying a simpler combiner like Count to see if the bottleneck is with the combiner that you are supplying? Also, could you share the code for what resample_function does?
On Mon, Aug 14, 2017 at 2:43 AM, Tristan Marechaux < [email protected]> wrote: > Hi all, > > I wrote a Beam Pipeline written with the python SDK that resample a > timeseries containing data points everery minute to a 5-minutes timeserie. > > My pipeline looks like: > input_data | > WindowInto(FixedWindows(size=timedelta(minutes=5).total_seconds())) > | CombineGlobaly(resample_function) > > When I run it with the local or DataFlow runner with a small dataset, it > works and does what I want. > > But when I try to run it on the DataFlow runner with a bigger dataset (1 > 700 000 datapoints timestamped over 15 years) it stay stuck for hours on > the GroupByKey step of CombineGlobaly. > > My question is : Did I do something wrong with the design of my pipeline? > > PS: Can someone invite me to the slack channel? > -- > > Tristan Marechaux > > Data Scientist | *Walnut Algorithms* > > Mobile : +33 627804399 <+33627804399> > > Email: [email protected] > > Web: www.walnutalgorithms.com >
