Spark Backend Support for Gora (GORA-386) Midterm Report

Furkan KAMACI Wed, 01 Jul 2015 00:31:20 -0700

Hi,

First of all, I would like to thank all. As you know that I've been
accepted to GSoC 2015 with my proposal for developing a Spark Backend
Support for Gora (GORA-386) and it is the time for midterm evaluations. I
want to share my current progress of project and my midterm proposal as
well.

During my GSoC period, I've blogged at my personal website (
http://furkankamaci.com/) and created a fork from Apache Gora's master
branch and worked on it: https://github.com/kamaci/gora

At community bonding period, I've read Apache Gora documentation and Apache
Gora source code to be more familiar
with project. I've analyzed related projects including Apache Flink and
Apache Crunch to implement a Spark backend into Apache Gora. I've picked up
an issue from Jira (https://issues.apache.org/jira/browse/GORA-262) and
fixed.

At coding period, due to implementing this project needs an infrastructure
about Apache Spark, I've started with analyzing Spark's first papers. I've
analyzed “Spark: Cluster Computing with Working” (
http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf) and
“Resilient
Distributed Datasets: A Fault-Tolerant Abstraction forIn-Memory Cluster
Computing”
(https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf). I've
published two posts about Spark and Cluster Computing
(http://furkankamaci.com/spark-and-cluster-computing/) and Resilient
Distributed Datasets (
http://furkankamaci.com/resilient-distributed-datasets-rdds/) at my
personal blog. I've followed Apache Spark documentation and developed
examples to analyze RDDs.

I've analyzed Apache Gora's GoraInputFormat class and Spark's newHadoopRDD
method. I've implemented an example application to read data from Hbase.

Apache Gora supports reading/writing data from/to Hadoop files. Spark has a
method for generating an RDD compatible with Hadoop files. So, an
architecture is designed which creates a bridge between GoraInputFormat and
RDD due to both of them support Hadoop files.

I've created a base class for Apache Gora and Spark integration named as:
GoraSparkEngine. It has initialize methods that takes Spark context, data
store, optional Hadoop configuration and returns an RDD.

After implementing a base for GoraSpark engine, I've developed a new
example aligned to LogAnalytics named as:
LogAnalyticsSpark. I've developed map and reduce parts (except for writing
results into database) which does the same thing as
LogAnalytics and also something more i.e. printing number of lines in
tables.

When we get an RDD from GoraSpark engine, we can do the operations over it
as like making operations on any other RDDs which is not created over
Apache Gora. Whole code can be checked from code base:
https://github.com/kamaci/gora

Project progress is ahead from the proposed timeline up to now.
GoraInputFormat and RDD transformation is done and it is shown that map,
reduce and other methods can properly work on that kind of RDDs.

Before the next steps, I am planning to design an overall architecture
according to feedbacks from community (there are some
prerequisites when designing an architecture: i.e. configuration of a
context at Spark cannot be changed after context has been initialized).

When necessary functionalities are implemented examples, tests and
documentations will be done. After that if I have extra time, I'm planning
to make a performance benchmark of Apache Gora with Hadoop MapReduce,
Hadoop MapReduce, Apache Spark and Apache Gora with Spark as well.

Special thanks to Lewis and Talat. I should also mention that it is a real
chance to be able to talk with your mentor face to face. We met with Talat
many times and he helped me a lot about how Hadoop and Apache Gora works.

PS: I've attached my midterm report and my previous reports can be found
here:
https://cwiki.apache.org/confluence/display/GORA/Spark+Backend+Support+for+Gora+%28GORA-386%29+Reports

Kind Regards,
Furkan KAMACI

Spark Backend Support for Gora (GORA-386) Midterm Report

Reply via email to