Hi All, I am using Spark 2.2.0 & I have below use case:
*Reading from Kafka using Spark Streaming and updating(not just inserting) the records into downstream database* I understand that the way Spark read messages from Kafka will not be in a order of timestamp as stored in Kafka partitions rather, in the order of offsets of the partitions. So, for suppose if there are two messages in kafka with the same key but one message with timestamp which is latest and is placed in the smallest offset, one more message with oldest timestamp placed in at earliest offset. In this case, as Spark reads from smallest -> earliest offset, the latest timestamp will be processed first and then oldest timestamp resulting in an unordered ingestion into the DB. If both these messages fell into the same rdd, then applying a reduce function we can ignore the message with oldest timestamp and process the latest timestamp message. But, I am not quite sure how to handle if these messages fall into different RDD's in the stream. An approach I was trying is to hit the DB and retrieve the timestamp in DB for that key and compare and ignore if old timestamp. But, this is not an efficient way when handling millions of messages as DB handling is expensive. Is there a better way of solving this problem? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org