Hi julia-users, I have the beginnings of a Spark interface for Julia up at https://github.com/jey/Spock.jl. This implementation follows the design used in Spark's Python and R bindings. It so far has the core of the Spark RDD interface implemented, but does not yet implement the additional features of spark-core such as broadcast variables and accumulators. Adding interfaces to other Spark components such as spark-mllib would be an additional significant undertaking that should be revisited once the remainder of spark-core has been implemented.
If anyone is interested in working on this, I would be more than happy to provide assistance and guidance. -Jey On Sun, Apr 5, 2015 at 1:59 AM, Viral Shah <vi...@mayin.org> wrote: > It would be nice to co-ordinate these efforts under the JuliaParallel > organization. > > -viral > > > On Sunday, April 5, 2015 at 9:39:51 AM UTC+5:30, wil...@gmail.com wrote: >> >> Spark integration is a tricky thing. Python and R bindings go in a great >> length to map language specific functions into Spark JVM library calls. I >> guess same could be done with JavaCall.jl package in a manner similar to >> SparkR. Look at slide 20 from here: http://spark-summit.org/wp- >> content/uploads/2014/07/SparkR-SparkSummit.pdf. >> >> Spark is a clever distributed data access paradigm which grew from Hadoop >> slowness and limitations. I believe that Julia could provide competitive >> model for a distributed data storage given Julia's parallel computing >> approach. Right now, I am writing Julia bindings for Mesos. The idea is to >> provide, though ClusterManager, access to any Mesos-supervised distributed >> system and run Julia code that environment. In conjuction with >> DistributedArrays and DataFrames, it will create powerful toolbox for >> building distributed systems. >> >> After all, machine learning on JVM, really?!. >> >> On Saturday, April 4, 2015 at 11:21:35 AM UTC-4, Jeff Waller wrote: >>> >>> >>> >>> On Saturday, April 4, 2015 at 2:22:38 AM UTC-4, Viral Shah wrote: >>>> >>>> I am changing the subject of this thread from GSOC to Spark. I was just >>>> looking around and found this: >>>> >>>> https://github.com/d9w/Spark.jl >>>> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fd9w%2FSpark.jl&sa=D&sntz=1&usg=AFQjCNGQwuYJvzEZVDmzbAm0SsqeV2l46Q> >>>> >>> >>> Hey, wow, that's interesting, is this an attempt to reimplement Spark or >>> create a binding? >>> >>> >>>> >>>> >>> The real question is with all the various systems out there, what is the >>>> level of abstraction that julia should work with. Julia's DataFrames is one >>>> level of abstraction, which could also transparently map to csv files >>>> (rather than doing readtable), or a database table, or an HBase table. Why >>>> would Spark users want Julia, and why would Julia users want Spark? I guess >>>> if we can nail this down - the rest of the integration is probably easy to >>>> figure out. >>>> >>> >>> As a potential user, I will try to answer in a few parts >>> >>> There are currently 3 official language bindings (Java, Scala, Python) >>> and some unofficial ones as well >>> and R in the works; one thing that users would want is whatever the >>> others get but in the language they >>> desire with an abstraction similar to the other language bindings so >>> that examples in other languages >>> could be readily translated to theirs. >>> >>> Whatever the abstraction turns out the be, there are at least 3 big >>> things that Spark offers; simplification, >>> speed, and lazy evaluation. The abstraction should not make that >>> cumbersome. >>> >>> For me, the advantage of Julia is the syntax, the speed, and the >>> connection to all of the Julia packages >>> and because of that the community of Julia package authors. The >>> advantage of Spark is the machinery >>> of Spark, access to mlib and likewise the community of Spark users. >>> >>> How about an example? This is simply from Spark examples -- good old >>> K-means. This is assuming >>> the Python binding because probably Julia and Python are most alike, how >>> would we expect this to >>> look using Julia? >>> >>> from pyspark.mllib.clustering import KMeans >>> from numpy import array >>> from math import sqrt >>> >>> # Load and parse the data >>> data = sc.textFile("data/mllib/kmeans_data.txt") >>> parsedData = data.map(lambda line: array([float(x) for x in line.split(' >>> ')])) >>> >>> # Build the model (cluster the data) >>> clusters = KMeans.train(parsedData, 2, maxIterations=10, >>> runs=10, initializationMode="random") >>> >>> # Evaluate clustering by computing Within Set Sum of Squared Errors >>> def error(point): >>> center = clusters.centers[clusters.predict(point)] >>> return sqrt(sum([x**2 for x in (point - center)])) >>> >>> WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + >>> y) >>> print("Within Set Sum of Squared Error = " + str(WSSSE)) >>> >>> >>> >>> >>>