Re: [DISCUSS] Abstracting away Hadoop/MapReduce as Data Processing Layer

Henry Saputra Wed, 09 Jul 2014 20:55:26 -0700

For pluggable serialization, I think if there is not JIRA opened I
could open one as reccomended by Lewis.


As for low hanging fruit, I am currently not sure.
Maybe we could add Gora store manager to Spark to allow read and
persist from different NoSQL databases.

- Henry

On Wed, Jul 9, 2014 at 2:45 AM, Renato Marroquín Mogrovejo
<renatoj.marroq...@gmail.com> wrote:
> 2014-07-09 11:10 GMT+02:00 Henry Saputra <henry.sapu...@gmail.com>:
>
>> Internally, Apache Spark can use Hadoop input format for its
>> distributed data structure (a.k.a RDD).
>> So, I guess we could still join the cool kids with Spark via our input
>> format implementation.
>>
>
> Cool Henry! I didn't know about we could use Hadoop input formats for
> Spark's RDD :)
>
>
>> However, I could think of other improvements that could be useful
>> (apology to Lewis if I hijacked his discussion):
>> 1. Pluggable serialization mechanism to allow other like Thrift or
>> Protocol Buffer instead of just Avro.
>>
>
> Yes, we have been talking about this as well for quite some time. I think
> we have two options in here: a) Changing the way we hold objects in memory
> to make it not only Avro. b) Keeping the Avro objects for in-memory
> processing and serializing using different formats (including
> native/datastore format). I think both options should be doable at some
> point as well.
>
>
>> 2. Directly work with DAG frameworks like Spark or Flink (incubating)
>> to provide client module to directly use Gora via their abstraction,
>> i.e RDD for Spark and Dataset for Flink.
>>
>
> Yes! We have to continue integrating with other projects, specially with
> popular projects which could give Gora more visibility in the open source
> space.
> So what do you think is the "low hanging" fruit here Henry?I mean there is
> a lot to do, but we should start putting things into our Roadmap so at
> least we know what we have to do.
>
>
> Renato M.
>
>
>> - Henry
>>
>> On Mon, Jul 7, 2014 at 8:19 AM, Lewis John Mcgibbney
>> <lewis.mcgibb...@gmail.com> wrote:
>> > Hi Folks,
>> > Many people know the way that things are going with regards to in-memory
>> > computing being 'the' hot topic on the planet right now (outside of the
>> > world cup).
>> > We have made good strides in Gora to get it to where it is as a top level
>> > project. It has also become aparent to me that something we embrace very
>> > well is the notion of abstraction and flexability in the way we modules
>> are
>> > implemented via DataStore API.
>> > One thing which is apparent to me though, is that we may be restricting
>> the
>> > project scope and capablities if we do not embrace new technologies
>> within
>> > our development model.
>> > I am of course talking about embracing the Spark paradigm within Gora and
>> > abstracting ourselves away from the traditional MapReduce Input/Output
>> > Formats which we currently use.
>> > A colleague of mine was at Spark Summit last week in San Francisco and
>> > mentioned that there is ongoing work to move towards a connector-based
>> > approach for IO so that different datastores can be used within Spark
>> SQL.
>> > The point I want to pose here is where can we take advantage of this in
>> an
>> > attempt to further grow the Gora community and improve the project?
>> > Thanks in advance for any thoughts folks.
>> > Lewis
>> >
>> >
>> >
>> > --
>> > *Lewis*
>>

Re: [DISCUSS] Abstracting away Hadoop/MapReduce as Data Processing Layer

Reply via email to