My $0.02: If you are simply reading input records, running a model, and outputting the result, then it's a simple "map-only" problem and you're mostly looking for a process to baby-sit these operations. Lots of things work -- Spark, M/R (+ Crunch), Hadoop Streaming, etc. I'd choose whatever is simplest to integrate with the RDBMS and analytical model; all could work.
Keep in mind that the failure recovery processes in these various frameworks don't necessarily interact cleanly with your external systems. For example, if a Spark worker dies while doing some work on some your IDs, it will happily be restarted, but if your job inserts results into another table, you may find it has inserted them twice of course. On Tue, May 20, 2014 at 6:26 PM, pcutil <puneet.ma...@gmail.com> wrote: > Hi - > > We have a use case for batch processing for which we are trying to figure > out if Apache Spark would be a good fit or not. > > We have a universe of identifiers sitting in RDBMS for which we need to go > get input data from RDBMS and then pass that input to analytical models that > generate some output numbers and store it back to the database. This is one > unit of work for us. > > So basically we are looking where we can do this processing in parallel for > the universe of identifiers that we have. All the data is in RDBMS and is > not sitting in file system. > > Can we use spark for this kind of work and would it be a good fit for that? > > Thanks for your help. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Evaluating-Spark-just-for-Cluster-Computing-tp6110.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.