My $0.02: If you are simply reading input records, running a model,
and outputting the result, then it's a simple "map-only" problem and
you're mostly looking for a process to baby-sit these operations. Lots
of things work -- Spark, M/R (+ Crunch), Hadoop Streaming, etc. I'd
choose whatever is simplest to integrate with the RDBMS and analytical
model; all could work.

Keep in mind that the failure recovery processes in these various
frameworks don't necessarily interact cleanly with your external
systems. For example, if a Spark worker dies while doing some work on
some your IDs, it will happily be restarted, but if your job inserts
results into another table, you may find it has inserted them twice of
course.

On Tue, May 20, 2014 at 6:26 PM, pcutil <puneet.ma...@gmail.com> wrote:
> Hi -
>
> We have a use case for batch processing for which we are trying to figure
> out if Apache Spark would be a good fit or not.
>
> We have a universe of identifiers sitting in RDBMS for which we need to go
> get input data from RDBMS and then pass that input to analytical models that
> generate some output numbers and store it back to the database. This is one
> unit of work for us.
>
> So basically we are looking where we can do this processing in parallel for
> the universe of identifiers that we have. All the data is in RDBMS and is
> not sitting in file system.
>
> Can we use spark for this kind of work and would it be a good fit for that?
>
> Thanks for your help.
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Evaluating-Spark-just-for-Cluster-Computing-tp6110.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to