I have a use case where I cannot figure out the spark streaming way to do it.
Given two kafka topics corresponding to two different types of events A and B. For each element from topic A correspond an element from topic B. Unfortunately elements can arrive separately by hours. The aggregation operation is not deterministic so I do not have a common key but instead a list of rules which will select the best candidates already arrived (most commonly one or two elements). The other candidates should not be discarded as they have also a corresponding elements. Once the two elements are aggregated they will be published to a kafka topic. Is the spark streaming way to do it, a group by key over all the RDD and then applying our aggregation function? How would I remove the aggregate elements with a tag filtering? will that be costly? Same question how would I remove candidates older than two days? If a worker fails for a short time and then come back on line what would happen? Thank you.