Hello All. I have a newbie question.
We have a use case where huge amount of data will be coming in streams or micro-batches of streams and we want to process these streams according to some business logic. We don't have to provide extremely low latency guarantees but batch M/R will still be slow. Now the business logic is such that at the time of emitting the data, we might have to hold on to some tuples until we get more information. This 'more' information is essentially will be coming in streams of future streams. You can say that this is kind of *word count* use case where we have to *aggregate and maintain state across batches of streams.* One thing different here is that we might have to* maintain the state or data for a day or two* until rest of the data comes in and then we can complete our output. 1- Questions is that is such is use cases supported in Spark and/or Spark Streaming? 2- Will we be able to persist partially aggregated data until the rest of the information comes in later in time? I am mentioning *persistence* here that given that the delay can be spanned over a day or two we won't want to keep the partial data in memory for so long. I know this can be done in Storm but I am really interested in Spark because of its close integration with Hadoop. We might not even want to use Spark Streaming (which is more of a direct comparison with Storm/Trident) given our application does not have to be real-time in split-second. Feel free to direct me to any document or resource. Thanks a lot. Regards, Shahab