Hi All,

I am using Spark 2.2.0 & I have below use case:

*Reading from Kafka using Spark Streaming and updating(not just inserting)
the records into downstream database*

I understand that the way Spark read messages from Kafka will not be in a
order of timestamp as stored in Kafka partitions rather, in the order of
offsets of the partitions. So, for suppose if there are two messages in
kafka with the same key but one message with timestamp which is latest and
is placed in the smallest offset, one more message with oldest timestamp
placed in at earliest offset. In this case, as Spark reads from smallest ->
earliest offset, the latest timestamp will be processed first and then
oldest timestamp resulting in an unordered ingestion into the DB.

If both these messages fell into the same rdd, then applying a reduce
function we can ignore the message with oldest timestamp and process the
latest timestamp message. But, I am not quite sure how to handle if these
messages fall into different RDD's in the stream. An approach I was trying
is to hit the DB and retrieve the timestamp in DB for that key and compare
and ignore if old timestamp. But, this is not an efficient way when handling
millions of messages as DB handling is expensive.

Is there a better way of solving this problem?




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to