Hi Danellis For point 1 , spark streaming is something to look at. For point 2 , you can create DAO from cassandra on each stream processing.This may be costly operation though , but to do real time processing of data , you have to live with t. Point 3 is covered in point 2 above. Since you are starting fresh , i would suggest going with 2.0 as they have many features such as dataset /structured querying of streams etc over previous releases. Thanks Deepak
On Mon, Aug 8, 2016 at 11:52 AM, danellis <d...@danellis.me> wrote: > Spark n00b here. > > Working with online retailers, I start with a list of their products in > Cassandra (with prices, stock levels, descriptions, etc) and then receive > an > HTTP request every time one of them changes. For each change, I update the > product in Cassandra and store the change with the old and new values. > > What I'd like to do is provide a dashboard with various metrics. Some of > them are trivial, such as "last n changes". Others, like number of > in-stock/out-of-stock products would be more complex to retrieve from > Cassandra, because they're an aggregate of the whole product set. > > I'm thinking about streaming the changes into Spark (via RabbitMQ) to > generate the data needed for the aggregate metrics, and either storing the > results in Cassandra or publishing them back to RabbitMQ (depending on > whether I have the dashboard poll or use a WebSocket). > > I have a few questions: > > 1) Does this seem like a good use case for Spark? > > 2) How much work is it appropriate for a transformation to do? For example, > my API service currently checks the update against the current data and > only > publishes a change if they differ. That sounds to me like it could be a > filter operation on a stream of all the updates, but it would require > accessing data from Cassandra inside the filter transformation. Is that > okay, or something to be avoided? The changes that make it through the > filter would also have to be logged in Cassandra. Is that crossing concerns > too much? > > 3) If I'm starting out with existing data, how do I take that into account > when starting to do stream processing? Would I write something to take my > logged changes from Cassandra and publish them to RabbitMQ before I start > my > real streaming? Seems like the switch-over might be tricky. (Note: I don't > necessarily need to do this, depending on how things go.) > > 4) Is it a good idea to start with 2.0 now? I see there's an AMQP module > with 2.0 support and the Cassandra one supports 2.0 with a little work. > > Thanks for any feedback. > > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Is-Spark-right-for-my-use-case-tp27491.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Thanks Deepak www.bigdatabig.com www.keosha.net