Hi Jey,

Spock.jl looks interesting. 

Am I right with my understanding that this gives us something like a 
streaming model for calling Julia methods from within Spark?
I am yet to try it and just had a glance at the code, but is a 
JuliaRDD.java file missing there?

It will be great if you can elaborate the interaction between Spark-Julia a 
little more, and probably a simple example.
I have also started a new thread 
here https://groups.google.com/forum/#!topic/julia-users/S_Qcn0tVxJg to 
continue the discussions.

Best,
Tanmay

On Tuesday, April 14, 2015 at 2:35:31 AM UTC+5:30, Jey Kottalam wrote:
>
> Hi julia-users,
>
> I have the beginnings of a Spark interface for Julia up at 
> https://github.com/jey/Spock.jl. This implementation follows the design 
> used in Spark's Python and R bindings. It so far has the core of the Spark 
> RDD interface implemented, but does not yet implement the additional 
> features of spark-core such as broadcast variables and accumulators. Adding 
> interfaces to other Spark components such as spark-mllib would be an 
> additional significant undertaking that should be revisited once the 
> remainder of spark-core has been implemented.
>
> If anyone is interested in working on this, I would be more than happy to 
> provide assistance and guidance.
>
> -Jey
>
>
>
> On Sun, Apr 5, 2015 at 1:59 AM, Viral Shah <vi...@mayin.org <javascript:>> 
> wrote:
>
>> It would be nice to co-ordinate these efforts under the JuliaParallel 
>> organization.
>>
>> -viral
>>
>>
>> On Sunday, April 5, 2015 at 9:39:51 AM UTC+5:30, wil...@gmail.com wrote:
>>>
>>> Spark integration is a tricky thing. Python and R bindings go in a great 
>>> length to map language specific functions into Spark JVM library calls. I 
>>> guess same could be done with JavaCall.jl package in a manner similar to 
>>> SparkR. Look at slide 20 from here: http://spark-summit.org/wp-
>>> content/uploads/2014/07/SparkR-SparkSummit.pdf.
>>>
>>> Spark is a clever distributed data access paradigm which grew from 
>>> Hadoop slowness and limitations. I believe that Julia could provide 
>>> competitive model for a distributed data storage given Julia's parallel 
>>> computing approach. Right now, I am writing Julia bindings for Mesos. The 
>>> idea is to provide, though ClusterManager, access to any Mesos-supervised 
>>> distributed system and run Julia code that environment. In conjuction with 
>>> DistributedArrays and DataFrames, it will create powerful toolbox for 
>>> building distributed systems.
>>>        
>>> After all, machine learning on JVM, really?!.
>>>
>>> On Saturday, April 4, 2015 at 11:21:35 AM UTC-4, Jeff Waller wrote:
>>>>
>>>>
>>>>
>>>> On Saturday, April 4, 2015 at 2:22:38 AM UTC-4, Viral Shah wrote:
>>>>>
>>>>> I am changing the subject of this thread from GSOC to Spark. I was 
>>>>> just looking around and found this:
>>>>>
>>>>> https://github.com/d9w/Spark.jl 
>>>>> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fd9w%2FSpark.jl&sa=D&sntz=1&usg=AFQjCNGQwuYJvzEZVDmzbAm0SsqeV2l46Q>
>>>>>
>>>>
>>>> Hey, wow, that's interesting, is this an attempt to reimplement Spark 
>>>> or create a binding? 
>>>>  
>>>>
>>>>>  
>>>>>
>>>> The real question is with all the various systems out there, what is 
>>>>> the level of abstraction that julia should work with. Julia's DataFrames 
>>>>> is 
>>>>> one level of abstraction, which could also transparently map to csv files 
>>>>> (rather than doing readtable), or a database table, or an HBase table. 
>>>>> Why 
>>>>> would Spark users want Julia, and why would Julia users want Spark? I 
>>>>> guess 
>>>>> if we can nail this down - the rest of the integration is probably easy 
>>>>> to 
>>>>> figure out.
>>>>>
>>>>  
>>>> As a potential user, I will try to answer in a few parts
>>>>
>>>> There are currently 3 official language bindings (Java, Scala, Python) 
>>>> and some unofficial ones as well
>>>> and R in the works; one thing that users would want is whatever the 
>>>> others get but in the language they
>>>> desire with an abstraction similar to the other language bindings so 
>>>> that examples in other languages
>>>> could be readily translated to theirs.
>>>>
>>>> Whatever the abstraction turns out the be, there are at least 3 big 
>>>> things that Spark offers; simplification,
>>>> speed, and lazy evaluation.  The abstraction should not make that 
>>>> cumbersome.
>>>>
>>>> For me, the advantage of Julia is the syntax, the speed, and the 
>>>> connection to all of the Julia packages
>>>> and because of that the community of Julia package authors.  The 
>>>> advantage of Spark is the machinery
>>>> of Spark, access to mlib and likewise the community of Spark users.
>>>>
>>>> How about an example?  This is simply from Spark examples -- good old 
>>>> K-means.  This is assuming
>>>> the Python binding because probably Julia and Python are most alike, 
>>>> how would we expect this to 
>>>> look using Julia?
>>>>
>>>> from pyspark.mllib.clustering import KMeans
>>>> from numpy import array
>>>> from math import sqrt
>>>>
>>>> # Load and parse the data
>>>> data = sc.textFile("data/mllib/kmeans_data.txt")
>>>> parsedData = data.map(lambda line: array([float(x) for x in line.split(' 
>>>> ')]))
>>>>
>>>> # Build the model (cluster the data)
>>>> clusters = KMeans.train(parsedData, 2, maxIterations=10,
>>>>         runs=10, initializationMode="random")
>>>>
>>>> # Evaluate clustering by computing Within Set Sum of Squared Errors
>>>> def error(point):
>>>>     center = clusters.centers[clusters.predict(point)]
>>>>     return sqrt(sum([x**2 for x in (point - center)]))
>>>>
>>>> WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + 
>>>> y)
>>>> print("Within Set Sum of Squared Error = " + str(WSSSE))
>>>>
>>>>
>>>>
>>>>
>>>>
>

Reply via email to