Re: Trying to use pyspark mllib NaiveBayes

John King Thu, 24 Apr 2014 16:05:38 -0700

I was able to run simple examples as well.

Which version of Spark? Did you use the most recent commit or from
branch-1.0?


Some background: I tried to build both on Amazon EC2, but the master kept
disconnecting from the client and executors failed after connecting. So I
tried to just use one machine with a lot of ram. I can set up a cluster on
released 0.9.1, but I need the sparse vector representations as my data is
very sparse. Any way I can access a version of 1.0 that doesn't have to be
compiled and is proven to work on EC2?

My code:

*import* numpy


*from* numpy *import* array, dot, shape

*from* pyspark *import* SparkContext

*from* math *import* exp, log

from pyspark.mllib.classification import NaiveBayes


from pyspark.mllib.linalg import SparseVector

from pyspark.mllib.regression import LabeledPoint, LinearModel

def isSpace(line):

if line.isspace() or not line.strip():

 return True

sizeOfDict = 2357815

def parsePoint(line):

        values = line.split('\t')

        feat = values[1].split(' ')

        features = {}

        for f in feat:

                f = f.split(':')

                if len(f) > 1:

                        features[f[0]] = f[1]

        return LabeledPoint(float(values[0]), SparseVector(sizeOfDict,
features))


data = sc.textFile(".../data.txt", 6)

empty = data.filter(lambda x: not isSpace(x)) // I had an extra new line
between each line

points = empty.map(parsePoint)

model = NaiveBayes.train(points)



On Thu, Apr 24, 2014 at 6:55 PM, Xiangrui Meng <men...@gmail.com> wrote:

> I tried locally with the example described in the latest guide:
> http://54.82.157.211:4000/mllib-naive-bayes.html , and it worked fine.
> Do you mind sharing the code you used? -Xiangrui
>
> On Thu, Apr 24, 2014 at 1:57 PM, John King <usedforprinting...@gmail.com>
> wrote:
> > Yes, I got it running for large RDD (~7 million lines) and mapping. Just
> > received this error when trying to classify.
> >
> >
> > On Thu, Apr 24, 2014 at 4:32 PM, Xiangrui Meng <men...@gmail.com> wrote:
> >>
> >> Is your Spark cluster running? Try to start with generating simple
> >> RDDs and counting. -Xiangrui
> >>
> >> On Thu, Apr 24, 2014 at 11:38 AM, John King
> >> <usedforprinting...@gmail.com> wrote:
> >> > I receive this error:
> >> >
> >> > Traceback (most recent call last):
> >> >
> >> >   File "<stdin>", line 1, in <module>
> >> >
> >> >   File
> >> > "/home/ubuntu/spark-1.0.0-rc2/python/pyspark/mllib/classification.py",
> >> > line
> >> > 178, in train
> >> >
> >> >     ans = sc._jvm.PythonMLLibAPI().trainNaiveBayes(dataBytes._jrdd,
> >> > lambda_)
> >> >
> >> >   File
> >> >
> >> >
> "/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
> >> > line 535, in __call__
> >> >
> >> >   File
> >> >
> >> >
> "/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
> >> > line 368, in send_command
> >> >
> >> >   File
> >> >
> >> >
> "/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
> >> > line 361, in send_command
> >> >
> >> >   File
> >> >
> >> >
> "/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
> >> > line 317, in _get_connection
> >> >
> >> >   File
> >> >
> >> >
> "/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
> >> > line 324, in _create_connection
> >> >
> >> >   File
> >> >
> >> >
> "/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
> >> > line 431, in start
> >> >
> >> > py4j.protocol.Py4JNetworkError: An error occurred while trying to
> >> > connect to
> >> > the Java server
> >
> >
>

Re: Trying to use pyspark mllib NaiveBayes

Reply via email to