We have a use case where there's a stream of events while every event has an
ID and its current state with a timestamp:

…
111,ready,1532949947
111,offline,1532949955
111,ongoing,1532949955
111,offline,1532949973
333,offline,1532949981
333,ongoing,1532949987
…

We want to ask questions about the current state of the *whole dataset*,
from the beginning of time, such as:
  "how many items are now in ongoing state"

(but bear in mind that there are more complicated questions, and all of them
are asking about the _current_ state of the dataset, from the beginning of
time)

I haven't found any simple, performant way of doing it.

The ways I've found are:
1. Using mapGroupsWithState, where I groupByKey on the ID, and update the
state always for the latest event by timestamp
2. Using groupByKey on the ID, and leaving only the matched event whose
timestamp is the latest

Both methods are not good because the first one involves state which means
checkpointing, memory, etc., and the second involves shuffling and sorting.

We will have a lot of such queries in order to populate a real-time
dashboard.

I wonder, as a general question, what is the correct way to process this
type of data in Spark Streaming?




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to