Spark Usecase

2014-06-04 Thread Shahab Yunus
Hello All.

I have a newbie question.

We have a use case where huge amount of data will be coming in streams or
micro-batches of streams and we want to process these streams according to
some business logic. We don't have to provide extremely low latency
guarantees but batch M/R will still be slow.

Now the business logic is such that at the time of emitting the data, we
might have to hold on to some tuples until we get more information. This
'more' information is essentially will be coming in streams of future
streams.

You can say that this is kind of *word count* use case where we have
to *aggregate
and maintain state across batches of streams.* One thing different here is
that we might have to* maintain the state or data for a day or two* until
rest of the data comes in and then we can complete our output.

1- Questions is that is such is use cases supported in Spark and/or Spark
Streaming?
2- Will we be able to persist partially aggregated data until the rest of
the information comes in later in time? I am mentioning *persistence* here
that given that the delay can be spanned over a day or two we won't want to
keep the partial data in memory for so long.

I know this can be done in Storm but I am really interested in Spark
because of its close integration with Hadoop. We might not even want to use
Spark Streaming (which is more of a direct comparison with Storm/Trident)
given our  application does not have to be real-time in split-second.

Feel free to direct me to any document or resource.

Thanks a lot.

Regards,
Shahab


Re: Spark Usecase

2014-06-04 Thread Krishna Sankar
Shahab,
   Interesting question. Couple of points (based on the information from
your e-mail)

   1. One can support the use case in Spark as a set of transformations on
   a WIP TDD over a span of time and the final transformation outputting to a
   processed TDD
  - Spark streaming would be a good data ingestion mechanism - look at
  the system as a pipeline that spans a time window
  - Depending on the cardinality, you would need a correlation id to
  transform the pipeline as you get more data
  2.  Having said that, you do have to understand what value spark
   provides,  then design the topology to support that.
  - For example, you could potentially keep all the WIP in HBase  the
  final transformations in Spark TDD.
  - Or may be you keep all the WIP in Spark and the final processed
  records in HBase. There is nothing wrong in keeping WIP in Spark, if
  response time to process the incoming data set is important.
   3. Naturally start with a set of ideas, make a few assumptions and do an
   e2e POC. That will clear many of the questions and firm up the design.

HTH.
Cheers
k/


On Wed, Jun 4, 2014 at 6:57 AM, Shahab Yunus shahab.yu...@gmail.com wrote:

 Hello All.

 I have a newbie question.

 We have a use case where huge amount of data will be coming in streams or
 micro-batches of streams and we want to process these streams according to
 some business logic. We don't have to provide extremely low latency
 guarantees but batch M/R will still be slow.

 Now the business logic is such that at the time of emitting the data, we
 might have to hold on to some tuples until we get more information. This
 'more' information is essentially will be coming in streams of future
 streams.

 You can say that this is kind of *word count* use case where we have to 
 *aggregate
 and maintain state across batches of streams.* One thing different here
 is that we might have to* maintain the state or data for a day or two*
 until rest of the data comes in and then we can complete our output.

 1- Questions is that is such is use cases supported in Spark and/or Spark
 Streaming?
 2- Will we be able to persist partially aggregated data until the rest of
 the information comes in later in time? I am mentioning *persistence*
 here that given that the delay can be spanned over a day or two we won't
 want to keep the partial data in memory for so long.

 I know this can be done in Storm but I am really interested in Spark
 because of its close integration with Hadoop. We might not even want to use
 Spark Streaming (which is more of a direct comparison with Storm/Trident)
 given our  application does not have to be real-time in split-second.

 Feel free to direct me to any document or resource.

 Thanks a lot.

 Regards,
 Shahab