[GSoC Plan of Attack] Choosing Apache Spark

Apoorv Palkar Wed, 17 May 2017 18:14:00 -0700

Hey Dev,


I have started my GSoC here @ Indiana University. I have chosen to investigate 
Spark over Storm/Flink for our distributed model. This is because Storm/Flink 
are generally more better suited for live event streaming. We are analyzing the 
batch processing case first and then potentially considering live streaming. 
Spark is best suited for this because it allows for batch processing through 
the core engine and live processing through the Spark Streaming library. Over 
the past 4 days I configured the Spark standalone cluster manager to work with 
worker node virtual machines on AWS EC2. As Amazon was paid, we have decided to 
switch to the JetStream/OpenStack API. As of now, I am using Spark Standalone 
for the cluster manager between the core engine and workers. In addition to 
this, I'm investigating the use of Mesos/Yarn via Hadoop for future Airavata 
cluster managers.


Any suggestions would be good.


Apoorv Palkar

[GSoC Plan of Attack] Choosing Apache Spark

Reply via email to