Hi
The slowness is consumption not matching the rate of production in kafka.
In case I just consume messages from kafka and do nothing (no group by key)
the consumption matches up.
My watermark is one minute behind the kafka message.

Best
Anand Singh Kunwar

On Fri, Mar 6, 2020, 22:47 Luke Cwik <lc...@google.com> wrote:

> Slowness how?
> Is the pipeline getting backed up so that the pipeline is falling behind
> compared to where the Kafka source is?
> Is the watermark associated with Kafka advancing?
>
> On Fri, Mar 6, 2020 at 5:39 AM Anand Singh Kunwar <anandkunwa...@gmail.com>
> wrote:
>
>> Hi
>> Context
>>
>> Hi all, I have been using apache beam pipelines to generate columnar DB
>> to store in GCS, I have a datastream coming in from kafka and have windows
>> of 1m.
>>
>> I want to transform all data of that 1m window into a columnar DB file
>> (ORC in my case, can be parquet or anything else), I have written a
>> pipeline for this transformation.
>> Problem
>>
>> I am experiencing general slowness. I suspect it could be due to the
>> group by key transformation as I have only key. Is there really a need to
>> do that? If not, what should be done instead? I read that combine isn't
>> very useful for this as my pipeline isn't really aggregating the data but
>> creating a merged file. What I exactly need is an iterable list of objects
>> per window which will be transformed to ORC files.
>> Pipeline Representation
>>
>> input -> window -> group by key (only 1 key) -> pardo (to create DB) ->
>>> IO (to write to GCS)
>>>
>> What I have tried
>>
>> I have tried using the profiler, scaling horizontally/vertically. Using
>> the profiler I saw more than 50% of the time going into group by key
>> operation. I do believe the problem is of hot keys but I am unable to find
>> a solution on what should be done. When I removed the group by key
>> operation, my pipeline keeps up with the kafka lag (ie, it doesn't seem to
>> be an issue at kafka end).
>>
>> Best
>> Anand Singh Kunwar
>>
>

Reply via email to