Spark n00b here. Working with online retailers, I start with a list of their products in Cassandra (with prices, stock levels, descriptions, etc) and then receive an HTTP request every time one of them changes. For each change, I update the product in Cassandra and store the change with the old and new values.
What I'd like to do is provide a dashboard with various metrics. Some of them are trivial, such as "last n changes". Others, like number of in-stock/out-of-stock products would be more complex to retrieve from Cassandra, because they're an aggregate of the whole product set. I'm thinking about streaming the changes into Spark (via RabbitMQ) to generate the data needed for the aggregate metrics, and either storing the results in Cassandra or publishing them back to RabbitMQ (depending on whether I have the dashboard poll or use a WebSocket). I have a few questions: 1) Does this seem like a good use case for Spark? 2) How much work is it appropriate for a transformation to do? For example, my API service currently checks the update against the current data and only publishes a change if they differ. That sounds to me like it could be a filter operation on a stream of all the updates, but it would require accessing data from Cassandra inside the filter transformation. Is that okay, or something to be avoided? The changes that make it through the filter would also have to be logged in Cassandra. Is that crossing concerns too much? 3) If I'm starting out with existing data, how do I take that into account when starting to do stream processing? Would I write something to take my logged changes from Cassandra and publish them to RabbitMQ before I start my real streaming? Seems like the switch-over might be tricky. (Note: I don't necessarily need to do this, depending on how things go.) 4) Is it a good idea to start with 2.0 now? I see there's an AMQP module with 2.0 support and the Cassandra one supports 2.0 with a little work. Thanks for any feedback. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-my-use-case-tp27491.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org