On Saturday, April 4, 2015 at 2:22:38 AM UTC-4, Viral Shah wrote:
>
> I am changing the subject of this thread from GSOC to Spark. I was just 
> looking around and found this:
>
> https://github.com/d9w/Spark.jl 
> <https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fd9w%2FSpark.jl&sa=D&sntz=1&usg=AFQjCNGQwuYJvzEZVDmzbAm0SsqeV2l46Q>
>

Hey, wow, that's interesting, is this an attempt to reimplement Spark or 
create a binding? 
 

>  
>
The real question is with all the various systems out there, what is the 
> level of abstraction that julia should work with. Julia's DataFrames is one 
> level of abstraction, which could also transparently map to csv files 
> (rather than doing readtable), or a database table, or an HBase table. Why 
> would Spark users want Julia, and why would Julia users want Spark? I guess 
> if we can nail this down - the rest of the integration is probably easy to 
> figure out.
>
 
As a potential user, I will try to answer in a few parts

There are currently 3 official language bindings (Java, Scala, Python) and 
some unofficial ones as well
and R in the works; one thing that users would want is whatever the others 
get but in the language they
desire with an abstraction similar to the other language bindings so that 
examples in other languages
could be readily translated to theirs.

Whatever the abstraction turns out the be, there are at least 3 big things 
that Spark offers; simplification,
speed, and lazy evaluation.  The abstraction should not make that 
cumbersome.

For me, the advantage of Julia is the syntax, the speed, and the connection 
to all of the Julia packages
and because of that the community of Julia package authors.  The advantage 
of Spark is the machinery
of Spark, access to mlib and likewise the community of Spark users.

How about an example?  This is simply from Spark examples -- good old 
K-means.  This is assuming
the Python binding because probably Julia and Python are most alike, how 
would we expect this to 
look using Julia?

from pyspark.mllib.clustering import KMeans
from numpy import array
from math import sqrt

# Load and parse the data
data = sc.textFile("data/mllib/kmeans_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))

# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations=10,
        runs=10, initializationMode="random")

# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
    center = clusters.centers[clusters.predict(point)]
    return sqrt(sum([x**2 for x in (point - center)]))

WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))




Reply via email to