Re: Slowness / Lag in beam streaming pipeline in group by key stage

2020-03-06 Thread Anand Singh Kunwar
Hi
The slowness is consumption not matching the rate of production in kafka.
In case I just consume messages from kafka and do nothing (no group by key)
the consumption matches up.
My watermark is one minute behind the kafka message.

Best
Anand Singh Kunwar

On Fri, Mar 6, 2020, 22:47 Luke Cwik  wrote:

> Slowness how?
> Is the pipeline getting backed up so that the pipeline is falling behind
> compared to where the Kafka source is?
> Is the watermark associated with Kafka advancing?
>
> On Fri, Mar 6, 2020 at 5:39 AM Anand Singh Kunwar 
> wrote:
>
>> Hi
>> Context
>>
>> Hi all, I have been using apache beam pipelines to generate columnar DB
>> to store in GCS, I have a datastream coming in from kafka and have windows
>> of 1m.
>>
>> I want to transform all data of that 1m window into a columnar DB file
>> (ORC in my case, can be parquet or anything else), I have written a
>> pipeline for this transformation.
>> Problem
>>
>> I am experiencing general slowness. I suspect it could be due to the
>> group by key transformation as I have only key. Is there really a need to
>> do that? If not, what should be done instead? I read that combine isn't
>> very useful for this as my pipeline isn't really aggregating the data but
>> creating a merged file. What I exactly need is an iterable list of objects
>> per window which will be transformed to ORC files.
>> Pipeline Representation
>>
>> input -> window -> group by key (only 1 key) -> pardo (to create DB) ->
>>> IO (to write to GCS)
>>>
>> What I have tried
>>
>> I have tried using the profiler, scaling horizontally/vertically. Using
>> the profiler I saw more than 50% of the time going into group by key
>> operation. I do believe the problem is of hot keys but I am unable to find
>> a solution on what should be done. When I removed the group by key
>> operation, my pipeline keeps up with the kafka lag (ie, it doesn't seem to
>> be an issue at kafka end).
>>
>> Best
>> Anand Singh Kunwar
>>
>


Slowness / Lag in beam streaming pipeline in group by key stage

2020-03-06 Thread Anand Singh Kunwar
Hi
Context

Hi all, I have been using apache beam pipelines to generate columnar DB to
store in GCS, I have a datastream coming in from kafka and have windows of
1m.

I want to transform all data of that 1m window into a columnar DB file (ORC
in my case, can be parquet or anything else), I have written a pipeline for
this transformation.
Problem

I am experiencing general slowness. I suspect it could be due to the group
by key transformation as I have only key. Is there really a need to do
that? If not, what should be done instead? I read that combine isn't very
useful for this as my pipeline isn't really aggregating the data but
creating a merged file. What I exactly need is an iterable list of objects
per window which will be transformed to ORC files.
Pipeline Representation

input -> window -> group by key (only 1 key) -> pardo (to create DB) -> IO
> (to write to GCS)
>
What I have tried

I have tried using the profiler, scaling horizontally/vertically. Using the
profiler I saw more than 50% of the time going into group by key operation.
I do believe the problem is of hot keys but I am unable to find a solution
on what should be done. When I removed the group by key operation, my
pipeline keeps up with the kafka lag (ie, it doesn't seem to be an issue at
kafka end).

Best
Anand Singh Kunwar