Hi Lewis,

I am talking in context of GORA-418 and GORA-386, we can say GSoC. I've
talked with Talat about design of that implementation. I just wanted to
check other projects for does any of them such kind of feature.

Here is what is in my mind for Apache Gora for Spark supoort: developing a
layer which abstracts functionality of Spark, Tez, etc (GORA-418). There
will be implementations for each of them (and Spark will be one of them:
GORA-386)

i.e. you will write a word count example as Gora style, you will use one of
implementation and run it (as like storing data at Solr or Mongo via Gora).

When I check Crunch I realize that:

"*Every Crunch job begins with a Pipeline instance that manages the
execution lifecycle of your data pipeline. As of the 0.9.0 release, there
are three implementations of the Pipeline interface:*

*MRPipeline: Executes a pipeline as a series of MapReduce jobs that can run
locally or on a Hadoop cluster.*
*MemPipeline: Executes a pipeline in-memory on the client.*
*SparkPipeline: Executes a pipeline by running a series of Apache Spark
jobs, either locally or on a Hadoop cluster.*"

So, I am curious about that supporting Crunch may help us what we want with
Spark support at Gora? Actually, I am new to such projects, I want to learn
what should be achieved with GORA-386 and not to be get lost because of
overthinking :) I see that you can use Gora for storing your data with
Gora-style, running jobs with Gora-style but have a flexibility of using
either HDFS, Solr, MongoDB, etc. or MaprReduce, Spark, Tez, etc.

PS: I know there is a similar issue at Apache Gora for Cascading support:
https://issues.apache.org/jira/browse/GORA-112

Kind Regards,
Furkan KAMACI

On Sat, Mar 21, 2015 at 8:14 PM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi Furkan,
> In what context are we talking here?
> GSoC or Just development?
> I am very keen to essentially work towards what we can release as Gora 1.0
> Thank you Furkan
>
>
> On Saturday, March 21, 2015, Furkan KAMACI <furkankam...@gmail.com> wrote:
>
>> As you know that there is an issue for integration Apache Spark and
>> Apache Gora [1]. Apache Spark is a popular project and in contrast to
>> Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory
>> primitives provide performance up to 100 times faster for certain
>> applications [2]. There are also some alternatives to Apache Spark, i.e.
>> Apache Tez [3].
>>
>> When implementing an integration for Spark, it should be considered to
>> have an abstraction for such kind of projects as an architectural design
>> and there is a related issue for it: [4].
>>
>> There is another Apache project which aims to provide a framework named
>> as Apache Crunch [5] for writing, testing, and running MapReduce pipelines.
>> Its goal is to make pipelines that are composed of many user-defined
>> functions simple to write, easy to test, and efficient to run. It is an
>> high-level tool for writing data pipelines, as opposed to developing
>> against the MapReduce, Spark, Tez APIs or etc. directly [6].
>>
>> I would like to learn how Apache Crunch fits with creating a multi
>> execution engine for Gora [4]? What kind of benefits we can get with
>> integrating Apache Gora and Apache Crunch and what kind of gaps we still
>> can have instead of developing a custom engine for our purpose?
>>
>> Kind Regards,
>> Furkan KAMACI
>>
>> [1] https://issues.apache.org/jira/browse/GORA-386
>> [2] Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael;
>> Shenker, Scott; Stoica, Ion (June 2013).
>> [3] http://tez.apache.org/
>> [4] https://issues.apache.org/jira/browse/GORA-418
>> [5] https://crunch.apache.org/
>> [6] https://crunch.apache.org/user-guide.html#motivation
>>
>
>
> --
> *Lewis*
>
>

Reply via email to