Spark Streaming: routing by key without groupByKey

Lin Zhao Fri, 15 Jan 2016 09:49:03 -0800

I have requirement to route a paired DStream to a series of map and flatMap 
such that entries with the same key goes to the same thread within the same 
batch. Closest I can come up with is groupByKey().flatMap(_._2). But this kills 
throughput by 50%.


When I think about it groupByKey is more than I need. With groupByKey the same 
thread sees all events with key Alice at a time, and only Alice. For my 
requirement if there are Bob, Charlie in between it's still OK. This seems to 
be a common routing requirement and shouldn't cause 50% performance hit. Is 
there a way to construct the stream in such way that I'm not aware of?

I have read 
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
 but reduceByKey isn't the solution since we are not doing aggregation. Our 
stream is a chain of map and flatMap[withState]

Spark Streaming: routing by key without groupByKey

Reply via email to