Shahab,
   Interesting question. Couple of points (based on the information from
your e-mail)

   1. One can support the use case in Spark as a set of transformations on
   a WIP TDD over a span of time and the final transformation outputting to a
   processed TDD
      - Spark streaming would be a good data ingestion mechanism - look at
      the system as a pipeline that spans a time window
      - Depending on the cardinality, you would need a correlation id to
      transform the pipeline as you get more data
      2.  Having said that, you do have to understand what value spark
   provides, & then design the topology to support that.
      - For example, you could potentially keep all the WIP in HBase & the
      final transformations in Spark TDD.
      - Or may be you keep all the WIP in Spark and the final processed
      records in HBase. There is nothing wrong in keeping WIP in Spark, if
      response time to process the incoming data set is important.
   3. Naturally start with a set of ideas, make a few assumptions and do an
   e2e POC. That will clear many of the questions and firm up the design.

HTH.
Cheers
<k/>


On Wed, Jun 4, 2014 at 6:57 AM, Shahab Yunus <shahab.yu...@gmail.com> wrote:

> Hello All.
>
> I have a newbie question.
>
> We have a use case where huge amount of data will be coming in streams or
> micro-batches of streams and we want to process these streams according to
> some business logic. We don't have to provide extremely low latency
> guarantees but batch M/R will still be slow.
>
> Now the business logic is such that at the time of emitting the data, we
> might have to hold on to some tuples until we get more information. This
> 'more' information is essentially will be coming in streams of future
> streams.
>
> You can say that this is kind of *word count* use case where we have to 
> *aggregate
> and maintain state across batches of streams.* One thing different here
> is that we might have to* maintain the state or data for a day or two*
> until rest of the data comes in and then we can complete our output.
>
> 1- Questions is that is such is use cases supported in Spark and/or Spark
> Streaming?
> 2- Will we be able to persist partially aggregated data until the rest of
> the information comes in later in time? I am mentioning *persistence*
> here that given that the delay can be spanned over a day or two we won't
> want to keep the partial data in memory for so long.
>
> I know this can be done in Storm but I am really interested in Spark
> because of its close integration with Hadoop. We might not even want to use
> Spark Streaming (which is more of a direct comparison with Storm/Trident)
> given our  application does not have to be real-time in split-second.
>
> Feel free to direct me to any document or resource.
>
> Thanks a lot.
>
> Regards,
> Shahab
>

Reply via email to