Kafka stream join scenarios

Srikanth Fri, 20 May 2016 20:25:48 -0700

Hello,

I'm writing a workflow using kafka streams where an incoming stream needs
to be denormalized and joined with a few dimension table.
It will be written back to another kafka topic. Fairly typical I believe.


1) Can I do broadcast join if my dimension table is small enough to be held
in each processor's local state?
Not doing this will force KStream to be shuffled with an internal topic.
If not should I write a transformer that reads the table in init() and
cache's it. I can then do a map like transformation in transform() with
cache lookup instead of a join.


2) When joining a KStream withe KTable for bigger tables, how do I stall
reading KStream until KTable is completely materialized?
I know completely read is a very loose term in stream processing :-) The
KTable is going to be fairly static and needs the "current content" to be
read completely before it can be used for joins.
The only option for synchronizing streams I see is TimestampExtractor. I
can't figure out how to use that because

  i) The join between KStream and KTable is time agnostic and doesn't
really fit into a window operation.
  ii) TimestampExtractor is set as a stream config at global level.
Timestamp extraction logic on the other hand will be specific to each
stream.
      How does one write a generic extractor?

Thanks,
Srikanth

Kafka stream join scenarios

Reply via email to