Hi Lewis, I am talking in context of GORA-418 and GORA-386, we can say GSoC. I've talked with Talat about design of that implementation. I just wanted to check other projects for does any of them such kind of feature.
Here is what is in my mind for Apache Gora for Spark supoort: developing a layer which abstracts functionality of Spark, Tez, etc (GORA-418). There will be implementations for each of them (and Spark will be one of them: GORA-386) i.e. you will write a word count example as Gora style, you will use one of implementation and run it (as like storing data at Solr or Mongo via Gora). When I check Crunch I realize that: "*Every Crunch job begins with a Pipeline instance that manages the execution lifecycle of your data pipeline. As of the 0.9.0 release, there are three implementations of the Pipeline interface:* *MRPipeline: Executes a pipeline as a series of MapReduce jobs that can run locally or on a Hadoop cluster.* *MemPipeline: Executes a pipeline in-memory on the client.* *SparkPipeline: Executes a pipeline by running a series of Apache Spark jobs, either locally or on a Hadoop cluster.*" So, I am curious about that supporting Crunch may help us what we want with Spark support at Gora? Actually, I am new to such projects, I want to learn what should be achieved with GORA-386 and not to be get lost because of overthinking :) I see that you can use Gora for storing your data with Gora-style, running jobs with Gora-style but have a flexibility of using either HDFS, Solr, MongoDB, etc. or MaprReduce, Spark, Tez, etc. PS: I know there is a similar issue at Apache Gora for Cascading support: https://issues.apache.org/jira/browse/GORA-112 Kind Regards, Furkan KAMACI On Sat, Mar 21, 2015 at 8:14 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi Furkan, > In what context are we talking here? > GSoC or Just development? > I am very keen to essentially work towards what we can release as Gora 1.0 > Thank you Furkan > > > On Saturday, March 21, 2015, Furkan KAMACI <furkankam...@gmail.com> wrote: > >> As you know that there is an issue for integration Apache Spark and >> Apache Gora [1]. Apache Spark is a popular project and in contrast to >> Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory >> primitives provide performance up to 100 times faster for certain >> applications [2]. There are also some alternatives to Apache Spark, i.e. >> Apache Tez [3]. >> >> When implementing an integration for Spark, it should be considered to >> have an abstraction for such kind of projects as an architectural design >> and there is a related issue for it: [4]. >> >> There is another Apache project which aims to provide a framework named >> as Apache Crunch [5] for writing, testing, and running MapReduce pipelines. >> Its goal is to make pipelines that are composed of many user-defined >> functions simple to write, easy to test, and efficient to run. It is an >> high-level tool for writing data pipelines, as opposed to developing >> against the MapReduce, Spark, Tez APIs or etc. directly [6]. >> >> I would like to learn how Apache Crunch fits with creating a multi >> execution engine for Gora [4]? What kind of benefits we can get with >> integrating Apache Gora and Apache Crunch and what kind of gaps we still >> can have instead of developing a custom engine for our purpose? >> >> Kind Regards, >> Furkan KAMACI >> >> [1] https://issues.apache.org/jira/browse/GORA-386 >> [2] Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael; >> Shenker, Scott; Stoica, Ion (June 2013). >> [3] http://tez.apache.org/ >> [4] https://issues.apache.org/jira/browse/GORA-418 >> [5] https://crunch.apache.org/ >> [6] https://crunch.apache.org/user-guide.html#motivation >> > > > -- > *Lewis* > >