Hi Context Hi all, I have been using apache beam pipelines to generate columnar DB to store in GCS, I have a datastream coming in from kafka and have windows of 1m.
I want to transform all data of that 1m window into a columnar DB file (ORC in my case, can be parquet or anything else), I have written a pipeline for this transformation. Problem I am experiencing general slowness. I suspect it could be due to the group by key transformation as I have only key. Is there really a need to do that? If not, what should be done instead? I read that combine isn't very useful for this as my pipeline isn't really aggregating the data but creating a merged file. What I exactly need is an iterable list of objects per window which will be transformed to ORC files. Pipeline Representation input -> window -> group by key (only 1 key) -> pardo (to create DB) -> IO > (to write to GCS) > What I have tried I have tried using the profiler, scaling horizontally/vertically. Using the profiler I saw more than 50% of the time going into group by key operation. I do believe the problem is of hot keys but I am unable to find a solution on what should be done. When I removed the group by key operation, my pipeline keeps up with the kafka lag (ie, it doesn't seem to be an issue at kafka end). Best Anand Singh Kunwar