As you know that there is an issue for integration Apache Spark and Apache Gora [1]. Apache Spark is a popular project and in contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications [2]. There are also some alternatives to Apache Spark, i.e. Apache Tez [3].
When implementing an integration for Spark, it should be considered to have an abstraction for such kind of projects as an architectural design and there is a related issue for it: [4]. There is another Apache project which aims to provide a framework named as Apache Crunch [5] for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run. It is an high-level tool for writing data pipelines, as opposed to developing against the MapReduce, Spark, Tez APIs or etc. directly [6]. I would like to learn how Apache Crunch fits with creating a multi execution engine for Gora [4]? What kind of benefits we can get with integrating Apache Gora and Apache Crunch and what kind of gaps we still can have instead of developing a custom engine for our purpose? Kind Regards, Furkan KAMACI [1] https://issues.apache.org/jira/browse/GORA-386 [2] Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael; Shenker, Scott; Stoica, Ion (June 2013). [3] http://tez.apache.org/ [4] https://issues.apache.org/jira/browse/GORA-418 [5] https://crunch.apache.org/ [6] https://crunch.apache.org/user-guide.html#motivation