Re: Gora Spark Backend Support (GORA-386) and Apache Crunch

Lewis John Mcgibbney Fri, 27 Mar 2015 10:52:57 -0700

W00t

On Fri, Mar 27, 2015 at 10:20 AM, Furkan KAMACI <furkankam...@gmail.com>
wrote:


> Hi Henry,
>
> I've submitted a proposal for Spark backend support. Most important key
> point is that: Gora input format will have an RDD transformation ability
> and so one able to access either Hadoop Map/Reduce or Spark.
>
> On Tue, Mar 24, 2015 at 11:43 PM, Henry Saputra <henry.sapu...@gmail.com>
> wrote:
>
>> HI Furkan,
>>
>> Yes, you are right. In the code execution for Spark or Flink, Gora
>> should be part of the data ingest and storing.
>>
>> So, is  the idea is to make data store in Spark to access Gora instead
>> of default store options?
>>
>> - Henry
>>
>> On Mon, Mar 23, 2015 at 11:34 AM, Furkan KAMACI <furkankam...@gmail.com>
>> wrote:
>> > Hi Henry,
>> >
>> > So, as far as I understand instead of wrapping Apache Spark within Gora
>> with
>> > full functionality, I have to wrap its functionality of storing and
>> > accessing data. I mean one will use Gora input/output format  and at the
>> > backend it will me mapped to RDD and will able to run Map/Reduce via
>> Apache
>> > Spark etc. over Gora.
>> >
>> > Kind Regards,
>> > Furkan KAMACI
>> >
>> > On Mon, Mar 23, 2015 at 8:21 PM, Henry Saputra <henry.sapu...@gmail.com
>> >
>> > wrote:
>> >>
>> >> Integration with Gora will mostly in the data ingest portion of the
>> flow.
>> >>
>> >> Distributed processing frameworks like Spark, or Flink, already
>> >> support Hadoop input format as data sources so Gora should be able to
>> >> be used directly with Gor input format.
>> >>
>> >> The interesting portion is probably tighter integration such as custom
>> >> RDD or custom Data Manager to store and get data from Gora directly.
>> >>
>> >> - Henry
>> >>
>> >> On Sat, Mar 21, 2015 at 1:42 PM, Lewis John Mcgibbney
>> >> <lewis.mcgibb...@gmail.com> wrote:
>> >> > Henry mentored Crunch through incubation... Maybe he can tell you
>> more
>> >> > context.
>> >> > For me, Gora is essentially an extremely easy storage abstraction
>> >> > framework.
>> >> > I do not currently use the Query API meaning that the analysis of
>> data
>> >> > is
>> >> > delegated to Gora data store.
>> >> > This is my current usage of the code base.
>> >> >
>> >> >
>> >> > On Saturday, March 21, 2015, Furkan KAMACI <furkankam...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Hi Lewis,
>> >> >>
>> >> >> I am talking in context of GORA-418 and GORA-386, we can say GSoC.
>> I've
>> >> >> talked with Talat about design of that implementation. I just
>> wanted to
>> >> >> check other projects for does any of them such kind of feature.
>> >> >>
>> >> >> Here is what is in my mind for Apache Gora for Spark supoort:
>> >> >> developing a
>> >> >> layer which abstracts functionality of Spark, Tez, etc (GORA-418).
>> >> >> There
>> >> >> will be implementations for each of them (and Spark will be one of
>> >> >> them:
>> >> >> GORA-386)
>> >> >>
>> >> >> i.e. you will write a word count example as Gora style, you will use
>> >> >> one
>> >> >> of implementation and run it (as like storing data at Solr or Mongo
>> via
>> >> >> Gora).
>> >> >>
>> >> >> When I check Crunch I realize that:
>> >> >>
>> >> >> "Every Crunch job begins with a Pipeline instance that manages the
>> >> >> execution lifecycle of your data pipeline. As of the 0.9.0 release,
>> >> >> there
>> >> >> are three implementations of the Pipeline interface:
>> >> >>
>> >> >> MRPipeline: Executes a pipeline as a series of MapReduce jobs that
>> can
>> >> >> run
>> >> >> locally or on a Hadoop cluster.
>> >> >> MemPipeline: Executes a pipeline in-memory on the client.
>> >> >> SparkPipeline: Executes a pipeline by running a series of Apache
>> Spark
>> >> >> jobs, either locally or on a Hadoop cluster."
>> >> >>
>> >> >> So, I am curious about that supporting Crunch may help us what we
>> want
>> >> >> with Spark support at Gora? Actually, I am new to such projects, I
>> want
>> >> >> to
>> >> >> learn what should be achieved with GORA-386 and not to be get lost
>> >> >> because
>> >> >> of overthinking :) I see that you can use Gora for storing your data
>> >> >> with
>> >> >> Gora-style, running jobs with Gora-style but have a flexibility of
>> >> >> using
>> >> >> either HDFS, Solr, MongoDB, etc. or MaprReduce, Spark, Tez, etc.
>> >> >>
>> >> >> PS: I know there is a similar issue at Apache Gora for Cascading
>> >> >> support:
>> >> >> https://issues.apache.org/jira/browse/GORA-112
>> >> >>
>> >> >> Kind Regards,
>> >> >> Furkan KAMACI
>> >> >>
>> >> >> On Sat, Mar 21, 2015 at 8:14 PM, Lewis John Mcgibbney
>> >> >> <lewis.mcgibb...@gmail.com> wrote:
>> >> >>>
>> >> >>> Hi Furkan,
>> >> >>> In what context are we talking here?
>> >> >>> GSoC or Just development?
>> >> >>> I am very keen to essentially work towards what we can release as
>> Gora
>> >> >>> 1.0
>> >> >>> Thank you Furkan
>> >> >>>
>> >> >>>
>> >> >>> On Saturday, March 21, 2015, Furkan KAMACI <furkankam...@gmail.com
>> >
>> >> >>> wrote:
>> >> >>>>
>> >> >>>> As you know that there is an issue for integration Apache Spark
>> and
>> >> >>>> Apache Gora [1]. Apache Spark is a popular project and in
>> contrast to
>> >> >>>> Hadoop's two-stage disk-based MapReduce paradigm, Spark's
>> in-memory
>> >> >>>> primitives provide performance up to 100 times faster for certain
>> >> >>>> applications [2]. There are also some alternatives to Apache
>> Spark,
>> >> >>>> i.e.
>> >> >>>> Apache Tez [3].
>> >> >>>>
>> >> >>>> When implementing an integration for Spark, it should be
>> considered
>> >> >>>> to
>> >> >>>> have an abstraction for such kind of projects as an architectural
>> >> >>>> design and
>> >> >>>> there is a related issue for it: [4].
>> >> >>>>
>> >> >>>> There is another Apache project which aims to provide a framework
>> >> >>>> named
>> >> >>>> as Apache Crunch [5] for writing, testing, and running MapReduce
>> >> >>>> pipelines.
>> >> >>>> Its goal is to make pipelines that are composed of many
>> user-defined
>> >> >>>> functions simple to write, easy to test, and efficient to run. It
>> is
>> >> >>>> an
>> >> >>>> high-level tool for writing data pipelines, as opposed to
>> developing
>> >> >>>> against
>> >> >>>> the MapReduce, Spark, Tez APIs or etc. directly [6].
>> >> >>>>
>> >> >>>> I would like to learn how Apache Crunch fits with creating a multi
>> >> >>>> execution engine for Gora [4]? What kind of benefits we can get
>> with
>> >> >>>> integrating Apache Gora and Apache Crunch and what kind of gaps we
>> >> >>>> still can
>> >> >>>> have instead of developing a custom engine for our purpose?
>> >> >>>>
>> >> >>>> Kind Regards,
>> >> >>>> Furkan KAMACI
>> >> >>>>
>> >> >>>> [1] https://issues.apache.org/jira/browse/GORA-386
>> >> >>>> [2] Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael;
>> >> >>>> Shenker, Scott; Stoica, Ion (June 2013).
>> >> >>>> [3] http://tez.apache.org/
>> >> >>>> [4] https://issues.apache.org/jira/browse/GORA-418
>> >> >>>> [5] https://crunch.apache.org/
>> >> >>>> [6] https://crunch.apache.org/user-guide.html#motivation
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> --
>> >> >>> Lewis
>> >> >>>
>> >> >>
>> >> >
>> >> >
>> >> > --
>> >> > Lewis
>> >> >
>> >
>> >
>>
>
>


-- 
*Lewis*

Re: Gora Spark Backend Support (GORA-386) and Apache Crunch

Reply via email to