Hi Jason,

Thanks for the reply. I think you overcome the jumbled records by sorting
(around here
https://github.com/ubiquibit-inc/sensor-failure/blob/20cda6c245022fb99685c1eec0878e0da4cc0ced/src/main/scala/com/ubiquibit/buoy/jobs/StationInterrupts.scala#L82
)

I also found that we can sort the data within the batch using the Kafka
offset value comes from the Kafka source.
I think theoretically there cannot be jumbles between batches and this can
happen only within a batch due to shuffling.

I think it's better if we can get rid of groupByKey for kafka sources
altogether if the data is already partitioned in the way we want at Kafka
level.
This is an experimental feature in Flink with this functionality
https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/experimental.html

Thanks,
Akila

On Tue, Apr 9, 2019 at 9:25 PM Jason Nerothin <jasonnerot...@gmail.com>
wrote:

> I had that identical problem. Here’s what I came up with:
>
> https://github.com/ubiquibit-inc/sensor-failure
>
>
> On Tue, Apr 9, 2019 at 04:37 Akila Wajirasena <akila.wajiras...@gmail.com>
> wrote:
>
>> Hi
>>
>> I have a Kafka topic  which is already loaded with data. I use a stateful
>> structured streaming pipeline using flatMapGroupWithState to consume the
>> data in kafka in a streaming manner.
>>
>> However when I set shuffle partition count > 1 I get some out of order
>> messages in to each of my GroupState. Is this the expected behavior or is
>> the message ordering guaranteed when using flatMapGroupWithState with Kafka
>> source?
>>
>> This my pipline;
>>
>> Kafka => GroupByKey(key from Kafka schema) => flatMapGroupWithState =>
>> parquet
>>
>> When I printed out the Kafka offset for each key inside my state update
>> function they are not in order. I am using spark 2.3.3.
>>
>> Thanks & Regards,
>> Akila
>>
>>
>>
>> --
> Thanks,
> Jason
>

Reply via email to