Internally, Apache Spark can use Hadoop input format for its distributed data structure (a.k.a RDD). So, I guess we could still join the cool kids with Spark via our input format implementation.
However, I could think of other improvements that could be useful (apology to Lewis if I hijacked his discussion): 1. Pluggable serialization mechanism to allow other like Thrift or Protocol Buffer instead of just Avro. 2. Directly work with DAG frameworks like Spark or Flink (incubating) to provide client module to directly use Gora via their abstraction, i.e RDD for Spark and Dataset for Flink. - Henry On Mon, Jul 7, 2014 at 8:19 AM, Lewis John Mcgibbney <[email protected]> wrote: > Hi Folks, > Many people know the way that things are going with regards to in-memory > computing being 'the' hot topic on the planet right now (outside of the > world cup). > We have made good strides in Gora to get it to where it is as a top level > project. It has also become aparent to me that something we embrace very > well is the notion of abstraction and flexability in the way we modules are > implemented via DataStore API. > One thing which is apparent to me though, is that we may be restricting the > project scope and capablities if we do not embrace new technologies within > our development model. > I am of course talking about embracing the Spark paradigm within Gora and > abstracting ourselves away from the traditional MapReduce Input/Output > Formats which we currently use. > A colleague of mine was at Spark Summit last week in San Francisco and > mentioned that there is ongoing work to move towards a connector-based > approach for IO so that different datastores can be used within Spark SQL. > The point I want to pose here is where can we take advantage of this in an > attempt to further grow the Gora community and improve the project? > Thanks in advance for any thoughts folks. > Lewis > > > > -- > *Lewis*

