Re: ANN: clj-spark - A Clojure wrapper for the Spark API

Mark Hamstra Tue, 22 Jan 2013 08:09:29 -0800

Hmmm... a lot of duplicated work.  Sorry I didn't get my stuff in a more 
usable form for you, but I wasn't aware that anybody was even interested in 
it.  I've got some stuff that I want to rework a little, and I'm still 
thinking through the best way to integrate with the new reducers code in 
Clojure, but I haven't had the right combination of time and motivation to 
finish off what I started and document it.  At any rate, we should work at 
merging the two efforts, since I don't see any need for duplicate APIs.


In taking a quick first pass at it, I wasn't able to get your code and 
examples to work, but I'm curious what your reasoning is for 
using serializable.fn and avoiding use of 
clojure.core/fn or #().  I'm not sure that is strictly necessary.  For 
example, the following works just fine with my API:

(require 'spark.api.clojure.core)

(wrappers!) ; one of the pieces I want to re-work, but allows functions 
like map to work with  either Clojure collections or RDDs

(set-spark-context! "local[4]" "cljspark")

(def rdd (parallelize [1 2 3 4]))

(def mrdd1 (map #(+ 2 %) rdd))

(def result1 (collect mrdd1))

(def offset1 4)

(def mrdd2 (map #(+ offset %) rdd))

(def result2 (collect mrdd2))

(def mrdd3 (map (let [offset2 5] (+ offset %)) rdd))

(def result3 (collect mrdd3))


That will result in result1, result2, and result3 being [3 4 5 6], [5 6 7 
8], and [6 7 8 9] respectively, without any need for serializable-fn.


On Tuesday, January 22, 2013 6:55:53 AM UTC-8, Marc Limotte wrote:

> A Clojure api for the Spark Project.  I am aware that there is another 
> clojure spark wrapper project which looks very interesting,  This project 
> has similar goals.  And also similar to that project it is not absolutely 
> complete, but it is does have some documentation and examples.  And it is 
> useable and should be easy enough to extend as needed.  This is the result 
> of about three weeks of work.  It handles many of the initial problems like 
> serializing anonymous functions, converting back and forth between Scala 
> Tuples and Clojure seqs, and converting RDDs to PairRDDs.
>
> The project is available here:
>
> https://github.com/TheClimateCorporation/clj-spark
>
> Thanks to The Climate Corporation for allowing me to release it.  At 
> Climate, we do the majority of our Big Data work with Cascalog (on top of 
> Cascading).  I was looking into Spark for some of the benefits that it 
> provides.  I suspect we will explore Shark next, and may work it in to our 
> processes for some of our more adhoc/exploratory queries. 
>
> I think it would be interesting to see a Cascading planner on top of 
> Spark, which would enable Cascalog queries (mostly) for free.  I suspect 
> that might be a superior method of using Clojure on Spark.
>
> Marc Limotte
>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: ANN: clj-spark - A Clojure wrapper for the Spark API

Reply via email to