Hi
Context

Hi all, I have been using apache beam pipelines to generate columnar DB to
store in GCS, I have a datastream coming in from kafka and have windows of
1m.

I want to transform all data of that 1m window into a columnar DB file (ORC
in my case, can be parquet or anything else), I have written a pipeline for
this transformation.
Problem

I am experiencing general slowness. I suspect it could be due to the group
by key transformation as I have only key. Is there really a need to do
that? If not, what should be done instead? I read that combine isn't very
useful for this as my pipeline isn't really aggregating the data but
creating a merged file. What I exactly need is an iterable list of objects
per window which will be transformed to ORC files.
Pipeline Representation

input -> window -> group by key (only 1 key) -> pardo (to create DB) -> IO
> (to write to GCS)
>
What I have tried

I have tried using the profiler, scaling horizontally/vertically. Using the
profiler I saw more than 50% of the time going into group by key operation.
I do believe the problem is of hot keys but I am unable to find a solution
on what should be done. When I removed the group by key operation, my
pipeline keeps up with the kafka lag (ie, it doesn't seem to be an issue at
kafka end).

Best
Anand Singh Kunwar

Reply via email to