Hi, First of all, I would like to thank all. As you know that I've been accepted to GSoC 2015 with my proposal for developing a Spark Backend Support for Gora (GORA-386) and it is the time for midterm evaluations. I want to share my current progress of project and my midterm proposal as well.
During my GSoC period, I've blogged at my personal website ( http://furkankamaci.com/) and created a fork from Apache Gora's master branch and worked on it: https://github.com/kamaci/gora At community bonding period, I've read Apache Gora documentation and Apache Gora source code to be more familiar with project. I've analyzed related projects including Apache Flink and Apache Crunch to implement a Spark backend into Apache Gora. I've picked up an issue from Jira (https://issues.apache.org/jira/browse/GORA-262) and fixed. At coding period, due to implementing this project needs an infrastructure about Apache Spark, I've started with analyzing Spark's first papers. I've analyzed “Spark: Cluster Computing with Working” ( http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf) and “Resilient Distributed Datasets: A Fault-Tolerant Abstraction forIn-Memory Cluster Computing” (https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf). I've published two posts about Spark and Cluster Computing (http://furkankamaci.com/spark-and-cluster-computing/) and Resilient Distributed Datasets ( http://furkankamaci.com/resilient-distributed-datasets-rdds/) at my personal blog. I've followed Apache Spark documentation and developed examples to analyze RDDs. I've analyzed Apache Gora's GoraInputFormat class and Spark's newHadoopRDD method. I've implemented an example application to read data from Hbase. Apache Gora supports reading/writing data from/to Hadoop files. Spark has a method for generating an RDD compatible with Hadoop files. So, an architecture is designed which creates a bridge between GoraInputFormat and RDD due to both of them support Hadoop files. I've created a base class for Apache Gora and Spark integration named as: GoraSparkEngine. It has initialize methods that takes Spark context, data store, optional Hadoop configuration and returns an RDD. After implementing a base for GoraSpark engine, I've developed a new example aligned to LogAnalytics named as: LogAnalyticsSpark. I've developed map and reduce parts (except for writing results into database) which does the same thing as LogAnalytics and also something more i.e. printing number of lines in tables. When we get an RDD from GoraSpark engine, we can do the operations over it as like making operations on any other RDDs which is not created over Apache Gora. Whole code can be checked from code base: https://github.com/kamaci/gora Project progress is ahead from the proposed timeline up to now. GoraInputFormat and RDD transformation is done and it is shown that map, reduce and other methods can properly work on that kind of RDDs. Before the next steps, I am planning to design an overall architecture according to feedbacks from community (there are some prerequisites when designing an architecture: i.e. configuration of a context at Spark cannot be changed after context has been initialized). When necessary functionalities are implemented examples, tests and documentations will be done. After that if I have extra time, I'm planning to make a performance benchmark of Apache Gora with Hadoop MapReduce, Hadoop MapReduce, Apache Spark and Apache Gora with Spark as well. Special thanks to Lewis and Talat. I should also mention that it is a real chance to be able to talk with your mentor face to face. We met with Talat many times and he helped me a lot about how Hadoop and Apache Gora works. PS: I've attached my midterm report and my previous reports can be found here: https://cwiki.apache.org/confluence/display/GORA/Spark+Backend+Support+for+Gora+%28GORA-386%29+Reports Kind Regards, Furkan KAMACI

