Hey Dev,
I have started my GSoC here @ Indiana University. I have chosen to investigate Spark over Storm/Flink for our distributed model. This is because Storm/Flink are generally more better suited for live event streaming. We are analyzing the batch processing case first and then potentially considering live streaming. Spark is best suited for this because it allows for batch processing through the core engine and live processing through the Spark Streaming library. Over the past 4 days I configured the Spark standalone cluster manager to work with worker node virtual machines on AWS EC2. As Amazon was paid, we have decided to switch to the JetStream/OpenStack API. As of now, I am using Spark Standalone for the cluster manager between the core engine and workers. In addition to this, I'm investigating the use of Mesos/Yarn via Hadoop for future Airavata cluster managers. Any suggestions would be good. Apoorv Palkar