Slowness / Lag in beam streaming pipeline in group by key stage

Anand Singh Kunwar Fri, 06 Mar 2020 05:40:45 -0800

Hi
Context

Hi all, I have been using apache beam pipelines to generate columnar DB to
store in GCS, I have a datastream coming in from kafka and have windows of
1m.


I want to transform all data of that 1m window into a columnar DB file (ORC
in my case, can be parquet or anything else), I have written a pipeline for
this transformation.
Problem

I am experiencing general slowness. I suspect it could be due to the group
by key transformation as I have only key. Is there really a need to do
that? If not, what should be done instead? I read that combine isn't very
useful for this as my pipeline isn't really aggregating the data but
creating a merged file. What I exactly need is an iterable list of objects
per window which will be transformed to ORC files.
Pipeline Representation

input -> window -> group by key (only 1 key) -> pardo (to create DB) -> IO
> (to write to GCS)
>
What I have tried

I have tried using the profiler, scaling horizontally/vertically. Using the
profiler I saw more than 50% of the time going into group by key operation.
I do believe the problem is of hot keys but I am unable to find a solution
on what should be done. When I removed the group by key operation, my
pipeline keeps up with the kafka lag (ie, it doesn't seem to be an issue at
kafka end).

Best
Anand Singh Kunwar

Slowness / Lag in beam streaming pipeline in group by key stage

Reply via email to