W00t On Fri, Mar 27, 2015 at 10:20 AM, Furkan KAMACI <furkankam...@gmail.com> wrote:
> Hi Henry, > > I've submitted a proposal for Spark backend support. Most important key > point is that: Gora input format will have an RDD transformation ability > and so one able to access either Hadoop Map/Reduce or Spark. > > On Tue, Mar 24, 2015 at 11:43 PM, Henry Saputra <henry.sapu...@gmail.com> > wrote: > >> HI Furkan, >> >> Yes, you are right. In the code execution for Spark or Flink, Gora >> should be part of the data ingest and storing. >> >> So, is the idea is to make data store in Spark to access Gora instead >> of default store options? >> >> - Henry >> >> On Mon, Mar 23, 2015 at 11:34 AM, Furkan KAMACI <furkankam...@gmail.com> >> wrote: >> > Hi Henry, >> > >> > So, as far as I understand instead of wrapping Apache Spark within Gora >> with >> > full functionality, I have to wrap its functionality of storing and >> > accessing data. I mean one will use Gora input/output format and at the >> > backend it will me mapped to RDD and will able to run Map/Reduce via >> Apache >> > Spark etc. over Gora. >> > >> > Kind Regards, >> > Furkan KAMACI >> > >> > On Mon, Mar 23, 2015 at 8:21 PM, Henry Saputra <henry.sapu...@gmail.com >> > >> > wrote: >> >> >> >> Integration with Gora will mostly in the data ingest portion of the >> flow. >> >> >> >> Distributed processing frameworks like Spark, or Flink, already >> >> support Hadoop input format as data sources so Gora should be able to >> >> be used directly with Gor input format. >> >> >> >> The interesting portion is probably tighter integration such as custom >> >> RDD or custom Data Manager to store and get data from Gora directly. >> >> >> >> - Henry >> >> >> >> On Sat, Mar 21, 2015 at 1:42 PM, Lewis John Mcgibbney >> >> <lewis.mcgibb...@gmail.com> wrote: >> >> > Henry mentored Crunch through incubation... Maybe he can tell you >> more >> >> > context. >> >> > For me, Gora is essentially an extremely easy storage abstraction >> >> > framework. >> >> > I do not currently use the Query API meaning that the analysis of >> data >> >> > is >> >> > delegated to Gora data store. >> >> > This is my current usage of the code base. >> >> > >> >> > >> >> > On Saturday, March 21, 2015, Furkan KAMACI <furkankam...@gmail.com> >> >> > wrote: >> >> >> >> >> >> Hi Lewis, >> >> >> >> >> >> I am talking in context of GORA-418 and GORA-386, we can say GSoC. >> I've >> >> >> talked with Talat about design of that implementation. I just >> wanted to >> >> >> check other projects for does any of them such kind of feature. >> >> >> >> >> >> Here is what is in my mind for Apache Gora for Spark supoort: >> >> >> developing a >> >> >> layer which abstracts functionality of Spark, Tez, etc (GORA-418). >> >> >> There >> >> >> will be implementations for each of them (and Spark will be one of >> >> >> them: >> >> >> GORA-386) >> >> >> >> >> >> i.e. you will write a word count example as Gora style, you will use >> >> >> one >> >> >> of implementation and run it (as like storing data at Solr or Mongo >> via >> >> >> Gora). >> >> >> >> >> >> When I check Crunch I realize that: >> >> >> >> >> >> "Every Crunch job begins with a Pipeline instance that manages the >> >> >> execution lifecycle of your data pipeline. As of the 0.9.0 release, >> >> >> there >> >> >> are three implementations of the Pipeline interface: >> >> >> >> >> >> MRPipeline: Executes a pipeline as a series of MapReduce jobs that >> can >> >> >> run >> >> >> locally or on a Hadoop cluster. >> >> >> MemPipeline: Executes a pipeline in-memory on the client. >> >> >> SparkPipeline: Executes a pipeline by running a series of Apache >> Spark >> >> >> jobs, either locally or on a Hadoop cluster." >> >> >> >> >> >> So, I am curious about that supporting Crunch may help us what we >> want >> >> >> with Spark support at Gora? Actually, I am new to such projects, I >> want >> >> >> to >> >> >> learn what should be achieved with GORA-386 and not to be get lost >> >> >> because >> >> >> of overthinking :) I see that you can use Gora for storing your data >> >> >> with >> >> >> Gora-style, running jobs with Gora-style but have a flexibility of >> >> >> using >> >> >> either HDFS, Solr, MongoDB, etc. or MaprReduce, Spark, Tez, etc. >> >> >> >> >> >> PS: I know there is a similar issue at Apache Gora for Cascading >> >> >> support: >> >> >> https://issues.apache.org/jira/browse/GORA-112 >> >> >> >> >> >> Kind Regards, >> >> >> Furkan KAMACI >> >> >> >> >> >> On Sat, Mar 21, 2015 at 8:14 PM, Lewis John Mcgibbney >> >> >> <lewis.mcgibb...@gmail.com> wrote: >> >> >>> >> >> >>> Hi Furkan, >> >> >>> In what context are we talking here? >> >> >>> GSoC or Just development? >> >> >>> I am very keen to essentially work towards what we can release as >> Gora >> >> >>> 1.0 >> >> >>> Thank you Furkan >> >> >>> >> >> >>> >> >> >>> On Saturday, March 21, 2015, Furkan KAMACI <furkankam...@gmail.com >> > >> >> >>> wrote: >> >> >>>> >> >> >>>> As you know that there is an issue for integration Apache Spark >> and >> >> >>>> Apache Gora [1]. Apache Spark is a popular project and in >> contrast to >> >> >>>> Hadoop's two-stage disk-based MapReduce paradigm, Spark's >> in-memory >> >> >>>> primitives provide performance up to 100 times faster for certain >> >> >>>> applications [2]. There are also some alternatives to Apache >> Spark, >> >> >>>> i.e. >> >> >>>> Apache Tez [3]. >> >> >>>> >> >> >>>> When implementing an integration for Spark, it should be >> considered >> >> >>>> to >> >> >>>> have an abstraction for such kind of projects as an architectural >> >> >>>> design and >> >> >>>> there is a related issue for it: [4]. >> >> >>>> >> >> >>>> There is another Apache project which aims to provide a framework >> >> >>>> named >> >> >>>> as Apache Crunch [5] for writing, testing, and running MapReduce >> >> >>>> pipelines. >> >> >>>> Its goal is to make pipelines that are composed of many >> user-defined >> >> >>>> functions simple to write, easy to test, and efficient to run. It >> is >> >> >>>> an >> >> >>>> high-level tool for writing data pipelines, as opposed to >> developing >> >> >>>> against >> >> >>>> the MapReduce, Spark, Tez APIs or etc. directly [6]. >> >> >>>> >> >> >>>> I would like to learn how Apache Crunch fits with creating a multi >> >> >>>> execution engine for Gora [4]? What kind of benefits we can get >> with >> >> >>>> integrating Apache Gora and Apache Crunch and what kind of gaps we >> >> >>>> still can >> >> >>>> have instead of developing a custom engine for our purpose? >> >> >>>> >> >> >>>> Kind Regards, >> >> >>>> Furkan KAMACI >> >> >>>> >> >> >>>> [1] https://issues.apache.org/jira/browse/GORA-386 >> >> >>>> [2] Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael; >> >> >>>> Shenker, Scott; Stoica, Ion (June 2013). >> >> >>>> [3] http://tez.apache.org/ >> >> >>>> [4] https://issues.apache.org/jira/browse/GORA-418 >> >> >>>> [5] https://crunch.apache.org/ >> >> >>>> [6] https://crunch.apache.org/user-guide.html#motivation >> >> >>> >> >> >>> >> >> >>> >> >> >>> -- >> >> >>> Lewis >> >> >>> >> >> >> >> >> > >> >> > >> >> > -- >> >> > Lewis >> >> > >> > >> > >> > > -- *Lewis*