Hi julia-users,

I have the beginnings of a Spark interface for Julia up at
https://github.com/jey/Spock.jl. This implementation follows the design
used in Spark's Python and R bindings. It so far has the core of the Spark
RDD interface implemented, but does not yet implement the additional
features of spark-core such as broadcast variables and accumulators. Adding
interfaces to other Spark components such as spark-mllib would be an
additional significant undertaking that should be revisited once the
remainder of spark-core has been implemented.

If anyone is interested in working on this, I would be more than happy to
provide assistance and guidance.

-Jey



On Sun, Apr 5, 2015 at 1:59 AM, Viral Shah <vi...@mayin.org> wrote:

> It would be nice to co-ordinate these efforts under the JuliaParallel
> organization.
>
> -viral
>
>
> On Sunday, April 5, 2015 at 9:39:51 AM UTC+5:30, wil...@gmail.com wrote:
>>
>> Spark integration is a tricky thing. Python and R bindings go in a great
>> length to map language specific functions into Spark JVM library calls. I
>> guess same could be done with JavaCall.jl package in a manner similar to
>> SparkR. Look at slide 20 from here: http://spark-summit.org/wp-
>> content/uploads/2014/07/SparkR-SparkSummit.pdf.
>>
>> Spark is a clever distributed data access paradigm which grew from Hadoop
>> slowness and limitations. I believe that Julia could provide competitive
>> model for a distributed data storage given Julia's parallel computing
>> approach. Right now, I am writing Julia bindings for Mesos. The idea is to
>> provide, though ClusterManager, access to any Mesos-supervised distributed
>> system and run Julia code that environment. In conjuction with
>> DistributedArrays and DataFrames, it will create powerful toolbox for
>> building distributed systems.
>>
>> After all, machine learning on JVM, really?!.
>>
>> On Saturday, April 4, 2015 at 11:21:35 AM UTC-4, Jeff Waller wrote:
>>>
>>>
>>>
>>> On Saturday, April 4, 2015 at 2:22:38 AM UTC-4, Viral Shah wrote:
>>>>
>>>> I am changing the subject of this thread from GSOC to Spark. I was just
>>>> looking around and found this:
>>>>
>>>> https://github.com/d9w/Spark.jl
>>>> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fd9w%2FSpark.jl&sa=D&sntz=1&usg=AFQjCNGQwuYJvzEZVDmzbAm0SsqeV2l46Q>
>>>>
>>>
>>> Hey, wow, that's interesting, is this an attempt to reimplement Spark or
>>> create a binding?
>>>
>>>
>>>>
>>>>
>>> The real question is with all the various systems out there, what is the
>>>> level of abstraction that julia should work with. Julia's DataFrames is one
>>>> level of abstraction, which could also transparently map to csv files
>>>> (rather than doing readtable), or a database table, or an HBase table. Why
>>>> would Spark users want Julia, and why would Julia users want Spark? I guess
>>>> if we can nail this down - the rest of the integration is probably easy to
>>>> figure out.
>>>>
>>>
>>> As a potential user, I will try to answer in a few parts
>>>
>>> There are currently 3 official language bindings (Java, Scala, Python)
>>> and some unofficial ones as well
>>> and R in the works; one thing that users would want is whatever the
>>> others get but in the language they
>>> desire with an abstraction similar to the other language bindings so
>>> that examples in other languages
>>> could be readily translated to theirs.
>>>
>>> Whatever the abstraction turns out the be, there are at least 3 big
>>> things that Spark offers; simplification,
>>> speed, and lazy evaluation.  The abstraction should not make that
>>> cumbersome.
>>>
>>> For me, the advantage of Julia is the syntax, the speed, and the
>>> connection to all of the Julia packages
>>> and because of that the community of Julia package authors.  The
>>> advantage of Spark is the machinery
>>> of Spark, access to mlib and likewise the community of Spark users.
>>>
>>> How about an example?  This is simply from Spark examples -- good old
>>> K-means.  This is assuming
>>> the Python binding because probably Julia and Python are most alike, how
>>> would we expect this to
>>> look using Julia?
>>>
>>> from pyspark.mllib.clustering import KMeans
>>> from numpy import array
>>> from math import sqrt
>>>
>>> # Load and parse the data
>>> data = sc.textFile("data/mllib/kmeans_data.txt")
>>> parsedData = data.map(lambda line: array([float(x) for x in line.split(' 
>>> ')]))
>>>
>>> # Build the model (cluster the data)
>>> clusters = KMeans.train(parsedData, 2, maxIterations=10,
>>>         runs=10, initializationMode="random")
>>>
>>> # Evaluate clustering by computing Within Set Sum of Squared Errors
>>> def error(point):
>>>     center = clusters.centers[clusters.predict(point)]
>>>     return sqrt(sum([x**2 for x in (point - center)]))
>>>
>>> WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + 
>>> y)
>>> print("Within Set Sum of Squared Error = " + str(WSSSE))
>>>
>>>
>>>
>>>
>>>

Reply via email to