Hello All.

I have a newbie question.

We have a use case where huge amount of data will be coming in streams or
micro-batches of streams and we want to process these streams according to
some business logic. We don't have to provide extremely low latency
guarantees but batch M/R will still be slow.

Now the business logic is such that at the time of emitting the data, we
might have to hold on to some tuples until we get more information. This
'more' information is essentially will be coming in streams of future
streams.

You can say that this is kind of *word count* use case where we have
to *aggregate
and maintain state across batches of streams.* One thing different here is
that we might have to* maintain the state or data for a day or two* until
rest of the data comes in and then we can complete our output.

1- Questions is that is such is use cases supported in Spark and/or Spark
Streaming?
2- Will we be able to persist partially aggregated data until the rest of
the information comes in later in time? I am mentioning *persistence* here
that given that the delay can be spanned over a day or two we won't want to
keep the partial data in memory for so long.

I know this can be done in Storm but I am really interested in Spark
because of its close integration with Hadoop. We might not even want to use
Spark Streaming (which is more of a direct comparison with Storm/Trident)
given our  application does not have to be real-time in split-second.

Feel free to direct me to any document or resource.

Thanks a lot.

Regards,
Shahab

Reply via email to