Hi Jey, Spock.jl looks interesting.
Am I right with my understanding that this gives us something like a streaming model for calling Julia methods from within Spark? I am yet to try it and just had a glance at the code, but is a JuliaRDD.java file missing there? It will be great if you can elaborate the interaction between Spark-Julia a little more, and probably a simple example. I have also started a new thread here https://groups.google.com/forum/#!topic/julia-users/S_Qcn0tVxJg to continue the discussions. Best, Tanmay On Tuesday, April 14, 2015 at 2:35:31 AM UTC+5:30, Jey Kottalam wrote: > > Hi julia-users, > > I have the beginnings of a Spark interface for Julia up at > https://github.com/jey/Spock.jl. This implementation follows the design > used in Spark's Python and R bindings. It so far has the core of the Spark > RDD interface implemented, but does not yet implement the additional > features of spark-core such as broadcast variables and accumulators. Adding > interfaces to other Spark components such as spark-mllib would be an > additional significant undertaking that should be revisited once the > remainder of spark-core has been implemented. > > If anyone is interested in working on this, I would be more than happy to > provide assistance and guidance. > > -Jey > > > > On Sun, Apr 5, 2015 at 1:59 AM, Viral Shah <vi...@mayin.org <javascript:>> > wrote: > >> It would be nice to co-ordinate these efforts under the JuliaParallel >> organization. >> >> -viral >> >> >> On Sunday, April 5, 2015 at 9:39:51 AM UTC+5:30, wil...@gmail.com wrote: >>> >>> Spark integration is a tricky thing. Python and R bindings go in a great >>> length to map language specific functions into Spark JVM library calls. I >>> guess same could be done with JavaCall.jl package in a manner similar to >>> SparkR. Look at slide 20 from here: http://spark-summit.org/wp- >>> content/uploads/2014/07/SparkR-SparkSummit.pdf. >>> >>> Spark is a clever distributed data access paradigm which grew from >>> Hadoop slowness and limitations. I believe that Julia could provide >>> competitive model for a distributed data storage given Julia's parallel >>> computing approach. Right now, I am writing Julia bindings for Mesos. The >>> idea is to provide, though ClusterManager, access to any Mesos-supervised >>> distributed system and run Julia code that environment. In conjuction with >>> DistributedArrays and DataFrames, it will create powerful toolbox for >>> building distributed systems. >>> >>> After all, machine learning on JVM, really?!. >>> >>> On Saturday, April 4, 2015 at 11:21:35 AM UTC-4, Jeff Waller wrote: >>>> >>>> >>>> >>>> On Saturday, April 4, 2015 at 2:22:38 AM UTC-4, Viral Shah wrote: >>>>> >>>>> I am changing the subject of this thread from GSOC to Spark. I was >>>>> just looking around and found this: >>>>> >>>>> https://github.com/d9w/Spark.jl >>>>> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fd9w%2FSpark.jl&sa=D&sntz=1&usg=AFQjCNGQwuYJvzEZVDmzbAm0SsqeV2l46Q> >>>>> >>>> >>>> Hey, wow, that's interesting, is this an attempt to reimplement Spark >>>> or create a binding? >>>> >>>> >>>>> >>>>> >>>> The real question is with all the various systems out there, what is >>>>> the level of abstraction that julia should work with. Julia's DataFrames >>>>> is >>>>> one level of abstraction, which could also transparently map to csv files >>>>> (rather than doing readtable), or a database table, or an HBase table. >>>>> Why >>>>> would Spark users want Julia, and why would Julia users want Spark? I >>>>> guess >>>>> if we can nail this down - the rest of the integration is probably easy >>>>> to >>>>> figure out. >>>>> >>>> >>>> As a potential user, I will try to answer in a few parts >>>> >>>> There are currently 3 official language bindings (Java, Scala, Python) >>>> and some unofficial ones as well >>>> and R in the works; one thing that users would want is whatever the >>>> others get but in the language they >>>> desire with an abstraction similar to the other language bindings so >>>> that examples in other languages >>>> could be readily translated to theirs. >>>> >>>> Whatever the abstraction turns out the be, there are at least 3 big >>>> things that Spark offers; simplification, >>>> speed, and lazy evaluation. The abstraction should not make that >>>> cumbersome. >>>> >>>> For me, the advantage of Julia is the syntax, the speed, and the >>>> connection to all of the Julia packages >>>> and because of that the community of Julia package authors. The >>>> advantage of Spark is the machinery >>>> of Spark, access to mlib and likewise the community of Spark users. >>>> >>>> How about an example? This is simply from Spark examples -- good old >>>> K-means. This is assuming >>>> the Python binding because probably Julia and Python are most alike, >>>> how would we expect this to >>>> look using Julia? >>>> >>>> from pyspark.mllib.clustering import KMeans >>>> from numpy import array >>>> from math import sqrt >>>> >>>> # Load and parse the data >>>> data = sc.textFile("data/mllib/kmeans_data.txt") >>>> parsedData = data.map(lambda line: array([float(x) for x in line.split(' >>>> ')])) >>>> >>>> # Build the model (cluster the data) >>>> clusters = KMeans.train(parsedData, 2, maxIterations=10, >>>> runs=10, initializationMode="random") >>>> >>>> # Evaluate clustering by computing Within Set Sum of Squared Errors >>>> def error(point): >>>> center = clusters.centers[clusters.predict(point)] >>>> return sqrt(sum([x**2 for x in (point - center)])) >>>> >>>> WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + >>>> y) >>>> print("Within Set Sum of Squared Error = " + str(WSSSE)) >>>> >>>> >>>> >>>> >>>> >