Re: Running out of memory Naive Bayes

2014-04-27 Thread John King
I'm already using the SparseVector class.

~200 labels


On Sun, Apr 27, 2014 at 12:26 AM, Xiangrui Meng  wrote:

> How many labels does your dataset have? -Xiangrui
>
> On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai  wrote:
> > Which version of mllib are you using? For Spark 1.0, mllib will
> > support sparse feature vector which will improve performance a lot
> > when computing the distance between points and centroid.
> >
> > Sincerely,
> >
> > DB Tsai
> > ---
> > My Blog: https://www.dbtsai.com
> > LinkedIn: https://www.linkedin.com/in/dbtsai
> >
> >
> > On Sat, Apr 26, 2014 at 5:49 AM, John King 
> wrote:
> >> I'm just wondering are the SparkVector calculations really taking into
> >> account the sparsity or just converting to dense?
> >>
> >>
> >> On Fri, Apr 25, 2014 at 10:06 PM, John King <
> usedforprinting...@gmail.com>
> >> wrote:
> >>>
> >>> I've been trying to use the Naive Bayes classifier. Each example in the
> >>> dataset is about 2 million features, only about 20-50 of which are
> non-zero,
> >>> so the vectors are very sparse. I keep running out of memory though,
> even
> >>> for about 1000 examples on 30gb RAM while the entire dataset is 4
> million
> >>> examples. And I would also like to note that I'm using the sparse
> vector
> >>> class.
> >>
> >>
>


Re: Running out of memory Naive Bayes

2014-04-26 Thread John King
I'm just wondering are the SparkVector calculations really taking into
account the sparsity or just converting to dense?


On Fri, Apr 25, 2014 at 10:06 PM, John King wrote:

> I've been trying to use the Naive Bayes classifier. Each example in the
> dataset is about 2 million features, only about 20-50 of which are
> non-zero, so the vectors are very sparse. I keep running out of memory
> though, even for about 1000 examples on 30gb RAM while the entire dataset
> is 4 million examples. And I would also like to note that I'm using the
> sparse vector class.
>


Running out of memory Naive Bayes

2014-04-25 Thread John King
I've been trying to use the Naive Bayes classifier. Each example in the
dataset is about 2 million features, only about 20-50 of which are
non-zero, so the vectors are very sparse. I keep running out of memory
though, even for about 1000 examples on 30gb RAM while the entire dataset
is 4 million examples. And I would also like to note that I'm using the
sparse vector class.


Re: Spark mllib throwing error

2014-04-24 Thread John King
It just displayed this error and stopped on its own. Do the lines of code
mentioned in the error have anything to do with it?


On Thu, Apr 24, 2014 at 7:54 PM, Xiangrui Meng  wrote:

> I don't see anything wrong with your code. Could you do points.count()
> to see how many training examples you have? Also, make sure you don't
> have negative feature values. The error message you sent did not say
> NaiveBayes went wrong, but the Spark shell was killed. -Xiangrui
>
> On Thu, Apr 24, 2014 at 4:05 PM, John King 
> wrote:
> > In the other thread I had an issue with Python. In this issue, I tried
> > switching to Scala. The code is:
> >
> > import org.apache.spark.mllib.regression.LabeledPoint;
> >
> > import org.apache.spark.mllib.linalg.SparseVector;
> >
> > import org.apache.spark.mllib.classification.NaiveBayes;
> >
> > import scala.collection.mutable.ArrayBuffer
> >
> >
> >
> > def isEmpty(a: String): Boolean = a != null &&
> !a.replaceAll("""(?m)\s+$""",
> > "").isEmpty()
> >
> > def parsePoint(a: String): LabeledPoint = {
> >
> >val values = a.split('\t')
> >
> >val feat = values(1).split(' ')
> >
> >val indices = ArrayBuffer.empty[Int]
> >
> >val featValues = ArrayBuffer.empty[Double]
> >
> >for (f <- feat) {
> >
> >val q = f.split(':')
> >
> >if (q.length == 2) {
> >
> >   indices += (q(0).toInt)
> >
> >   featValues += (q(1).toDouble)
> >
> >}
> >
> >}
> >
> >val vector = new SparseVector(2357815, indices.toArray,
> > featValues.toArray)
> >
> >return LabeledPoint(values(0).toDouble, vector)
> >
> >}
> >
> >
> > val data = sc.textFile("data.txt")
> >
> > val empty = data.filter(isEmpty)
> >
> > val points = empty.map(parsePoint)
> >
> > points.cache()
> >
> > val model = new NaiveBayes().run(points)
> >
> >
> >
> > On Thu, Apr 24, 2014 at 6:57 PM, Xiangrui Meng  wrote:
> >>
> >> Do you mind sharing more code and error messages? The information you
> >> provided is too little to identify the problem. -Xiangrui
> >>
> >> On Thu, Apr 24, 2014 at 1:55 PM, John King <
> usedforprinting...@gmail.com>
> >> wrote:
> >> > Last command was:
> >> >
> >> > val model = new NaiveBayes().run(points)
> >> >
> >> >
> >> >
> >> > On Thu, Apr 24, 2014 at 4:27 PM, Xiangrui Meng 
> wrote:
> >> >>
> >> >> Could you share the command you used and more of the error message?
> >> >> Also, is it an MLlib specific problem? -Xiangrui
> >> >>
> >> >> On Thu, Apr 24, 2014 at 11:49 AM, John King
> >> >>  wrote:
> >> >> > ./spark-shell: line 153: 17654 Killed
> >> >> > $FWDIR/bin/spark-class org.apache.spark.repl.Main "$@"
> >> >> >
> >> >> >
> >> >> > Any ideas?
> >> >
> >> >
> >
> >
>


Re: Trying to use pyspark mllib NaiveBayes

2014-04-24 Thread John King
Also when will the official 1.0 be released?


On Thu, Apr 24, 2014 at 7:04 PM, John King wrote:

> I was able to run simple examples as well.
>
> Which version of Spark? Did you use the most recent commit or from
> branch-1.0?
>
> Some background: I tried to build both on Amazon EC2, but the master kept
> disconnecting from the client and executors failed after connecting. So I
> tried to just use one machine with a lot of ram. I can set up a cluster on
> released 0.9.1, but I need the sparse vector representations as my data is
> very sparse. Any way I can access a version of 1.0 that doesn't have to be
> compiled and is proven to work on EC2?
>
> My code:
>
> *import* numpy
>
>
> *from* numpy *import* array, dot, shape
>
> *from* pyspark *import* SparkContext
>
> *from* math *import* exp, log
>
> from pyspark.mllib.classification import NaiveBayes
>
>
> from pyspark.mllib.linalg import SparseVector
>
> from pyspark.mllib.regression import LabeledPoint, LinearModel
>
> def isSpace(line):
>
> if line.isspace() or not line.strip():
>
>  return True
>
> sizeOfDict = 2357815
>
> def parsePoint(line):
>
> values = line.split('\t')
>
> feat = values[1].split(' ')
>
> features = {}
>
> for f in feat:
>
> f = f.split(':')
>
> if len(f) > 1:
>
> features[f[0]] = f[1]
>
> return LabeledPoint(float(values[0]), SparseVector(sizeOfDict,
> features))
>
>
> data = sc.textFile(".../data.txt", 6)
>
> empty = data.filter(lambda x: not isSpace(x)) // I had an extra new line
> between each line
>
> points = empty.map(parsePoint)
>
> model = NaiveBayes.train(points)
>
>
>
> On Thu, Apr 24, 2014 at 6:55 PM, Xiangrui Meng  wrote:
>
>> I tried locally with the example described in the latest guide:
>> http://54.82.157.211:4000/mllib-naive-bayes.html , and it worked fine.
>> Do you mind sharing the code you used? -Xiangrui
>>
>> On Thu, Apr 24, 2014 at 1:57 PM, John King 
>> wrote:
>> > Yes, I got it running for large RDD (~7 million lines) and mapping. Just
>> > received this error when trying to classify.
>> >
>> >
>> > On Thu, Apr 24, 2014 at 4:32 PM, Xiangrui Meng 
>> wrote:
>> >>
>> >> Is your Spark cluster running? Try to start with generating simple
>> >> RDDs and counting. -Xiangrui
>> >>
>> >> On Thu, Apr 24, 2014 at 11:38 AM, John King
>> >>  wrote:
>> >> > I receive this error:
>> >> >
>> >> > Traceback (most recent call last):
>> >> >
>> >> >   File "", line 1, in 
>> >> >
>> >> >   File
>> >> >
>> "/home/ubuntu/spark-1.0.0-rc2/python/pyspark/mllib/classification.py",
>> >> > line
>> >> > 178, in train
>> >> >
>> >> > ans = sc._jvm.PythonMLLibAPI().trainNaiveBayes(dataBytes._jrdd,
>> >> > lambda_)
>> >> >
>> >> >   File
>> >> >
>> >> >
>> "/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
>> >> > line 535, in __call__
>> >> >
>> >> >   File
>> >> >
>> >> >
>> "/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
>> >> > line 368, in send_command
>> >> >
>> >> >   File
>> >> >
>> >> >
>> "/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
>> >> > line 361, in send_command
>> >> >
>> >> >   File
>> >> >
>> >> >
>> "/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
>> >> > line 317, in _get_connection
>> >> >
>> >> >   File
>> >> >
>> >> >
>> "/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
>> >> > line 324, in _create_connection
>> >> >
>> >> >   File
>> >> >
>> >> >
>> "/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
>> >> > line 431, in start
>> >> >
>> >> > py4j.protocol.Py4JNetworkError: An error occurred while trying to
>> >> > connect to
>> >> > the Java server
>> >
>> >
>>
>
>


Re: Spark mllib throwing error

2014-04-24 Thread John King
In the other thread I had an issue with Python. In this issue, I tried
switching to Scala. The code is:

*import* org.apache.spark.mllib.regression.*LabeledPoint**;*

*import org.apache.spark.mllib.linalg.SparseVector;*

*import org.apache.spark.mllib.classification.NaiveBayes;*

import scala.collection.mutable.ArrayBuffer



def isEmpty(a: String): Boolean = a != null &&
!a.replaceAll("""(?m)\s+$""", "").isEmpty()

def parsePoint(a: String): LabeledPoint = {

   val values = a.split('\t')

   val feat = values(1).split(' ')

   val indices = ArrayBuffer.empty[Int]

   val featValues = ArrayBuffer.empty[Double]

   for (f <- feat) {

   val q = f.split(':')

   if (q.length == 2) {

  indices += (q(0).toInt)

  featValues += (q(1).toDouble)

   }

   }

   val vector = new SparseVector(2357815, indices.toArray,
featValues.toArray)

   return LabeledPoint(values(0).toDouble, vector)

   }


val data = sc.textFile("data.txt")

val empty = data.filter(isEmpty)

val points = empty.map(parsePoint)

points.cache()

val model = new NaiveBayes().run(points)


On Thu, Apr 24, 2014 at 6:57 PM, Xiangrui Meng  wrote:

> Do you mind sharing more code and error messages? The information you
> provided is too little to identify the problem. -Xiangrui
>
> On Thu, Apr 24, 2014 at 1:55 PM, John King 
> wrote:
> > Last command was:
> >
> > val model = new NaiveBayes().run(points)
> >
> >
> >
> > On Thu, Apr 24, 2014 at 4:27 PM, Xiangrui Meng  wrote:
> >>
> >> Could you share the command you used and more of the error message?
> >> Also, is it an MLlib specific problem? -Xiangrui
> >>
> >> On Thu, Apr 24, 2014 at 11:49 AM, John King
> >>  wrote:
> >> > ./spark-shell: line 153: 17654 Killed
> >> > $FWDIR/bin/spark-class org.apache.spark.repl.Main "$@"
> >> >
> >> >
> >> > Any ideas?
> >
> >
>


Re: Trying to use pyspark mllib NaiveBayes

2014-04-24 Thread John King
I was able to run simple examples as well.

Which version of Spark? Did you use the most recent commit or from
branch-1.0?

Some background: I tried to build both on Amazon EC2, but the master kept
disconnecting from the client and executors failed after connecting. So I
tried to just use one machine with a lot of ram. I can set up a cluster on
released 0.9.1, but I need the sparse vector representations as my data is
very sparse. Any way I can access a version of 1.0 that doesn't have to be
compiled and is proven to work on EC2?

My code:

*import* numpy


*from* numpy *import* array, dot, shape

*from* pyspark *import* SparkContext

*from* math *import* exp, log

from pyspark.mllib.classification import NaiveBayes


from pyspark.mllib.linalg import SparseVector

from pyspark.mllib.regression import LabeledPoint, LinearModel

def isSpace(line):

if line.isspace() or not line.strip():

 return True

sizeOfDict = 2357815

def parsePoint(line):

values = line.split('\t')

feat = values[1].split(' ')

features = {}

for f in feat:

f = f.split(':')

if len(f) > 1:

features[f[0]] = f[1]

return LabeledPoint(float(values[0]), SparseVector(sizeOfDict,
features))


data = sc.textFile(".../data.txt", 6)

empty = data.filter(lambda x: not isSpace(x)) // I had an extra new line
between each line

points = empty.map(parsePoint)

model = NaiveBayes.train(points)



On Thu, Apr 24, 2014 at 6:55 PM, Xiangrui Meng  wrote:

> I tried locally with the example described in the latest guide:
> http://54.82.157.211:4000/mllib-naive-bayes.html , and it worked fine.
> Do you mind sharing the code you used? -Xiangrui
>
> On Thu, Apr 24, 2014 at 1:57 PM, John King 
> wrote:
> > Yes, I got it running for large RDD (~7 million lines) and mapping. Just
> > received this error when trying to classify.
> >
> >
> > On Thu, Apr 24, 2014 at 4:32 PM, Xiangrui Meng  wrote:
> >>
> >> Is your Spark cluster running? Try to start with generating simple
> >> RDDs and counting. -Xiangrui
> >>
> >> On Thu, Apr 24, 2014 at 11:38 AM, John King
> >>  wrote:
> >> > I receive this error:
> >> >
> >> > Traceback (most recent call last):
> >> >
> >> >   File "", line 1, in 
> >> >
> >> >   File
> >> > "/home/ubuntu/spark-1.0.0-rc2/python/pyspark/mllib/classification.py",
> >> > line
> >> > 178, in train
> >> >
> >> > ans = sc._jvm.PythonMLLibAPI().trainNaiveBayes(dataBytes._jrdd,
> >> > lambda_)
> >> >
> >> >   File
> >> >
> >> >
> "/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
> >> > line 535, in __call__
> >> >
> >> >   File
> >> >
> >> >
> "/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
> >> > line 368, in send_command
> >> >
> >> >   File
> >> >
> >> >
> "/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
> >> > line 361, in send_command
> >> >
> >> >   File
> >> >
> >> >
> "/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
> >> > line 317, in _get_connection
> >> >
> >> >   File
> >> >
> >> >
> "/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
> >> > line 324, in _create_connection
> >> >
> >> >   File
> >> >
> >> >
> "/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
> >> > line 431, in start
> >> >
> >> > py4j.protocol.Py4JNetworkError: An error occurred while trying to
> >> > connect to
> >> > the Java server
> >
> >
>


Re: Deploying a python code on a spark EC2 cluster

2014-04-24 Thread John King
This happens to me when using the EC2 scripts for v1.0.0rc2 recent release.
The Master connects and then disconnects immediately, eventually saying
Master disconnected from cluster.


On Thu, Apr 24, 2014 at 4:01 PM, Matei Zaharia wrote:

> Did you launch this using our EC2 scripts (
> http://spark.apache.org/docs/latest/ec2-scripts.html) or did you manually
> set up the daemons? My guess is that their hostnames are not being resolved
> properly on all nodes, so executor processes can’t connect back to your
> driver app. This error message indicates that:
>
> 14/04/24 09:00:49 WARN util.Utils: Your hostname, spark-node resolves to a
> loopback address: 127.0.0.1; using 10.74.149.251 instead (on interface
> eth0)
> 14/04/24 09:00:49 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind
> to
> another address
>
> If you launch with your EC2 scripts, or don’t manually change the
> hostnames, this should not happen.
>
> Matei
>
> On Apr 24, 2014, at 11:36 AM, John King 
> wrote:
>
> Same problem.
>
>
> On Thu, Apr 24, 2014 at 10:54 AM, Shubhabrata wrote:
>
>> Moreover it seems all the workers are registered and have sufficient
>> memory
>> (2.7GB where as I have asked for 512 MB). The UI also shows the jobs are
>> running on the slaves. But on the termial it is still the same error
>> "Initial job has not accepted any resources; check your cluster UI to
>> ensure
>> that workers are registered and have sufficient memory"
>>
>> Please see the screenshot. Thanks
>>
>> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n4761/33.png>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Deploying-a-python-code-on-a-spark-EC2-cluster-tp4758p4761.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>
>


Re: Trying to use pyspark mllib NaiveBayes

2014-04-24 Thread John King
Yes, I got it running for large RDD (~7 million lines) and mapping. Just
received this error when trying to classify.


On Thu, Apr 24, 2014 at 4:32 PM, Xiangrui Meng  wrote:

> Is your Spark cluster running? Try to start with generating simple
> RDDs and counting. -Xiangrui
>
> On Thu, Apr 24, 2014 at 11:38 AM, John King
>  wrote:
> > I receive this error:
> >
> > Traceback (most recent call last):
> >
> >   File "", line 1, in 
> >
> >   File
> > "/home/ubuntu/spark-1.0.0-rc2/python/pyspark/mllib/classification.py",
> line
> > 178, in train
> >
> > ans = sc._jvm.PythonMLLibAPI().trainNaiveBayes(dataBytes._jrdd,
> lambda_)
> >
> >   File
> >
> "/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
> > line 535, in __call__
> >
> >   File
> >
> "/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
> > line 368, in send_command
> >
> >   File
> >
> "/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
> > line 361, in send_command
> >
> >   File
> >
> "/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
> > line 317, in _get_connection
> >
> >   File
> >
> "/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
> > line 324, in _create_connection
> >
> >   File
> >
> "/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
> > line 431, in start
> >
> > py4j.protocol.Py4JNetworkError: An error occurred while trying to
> connect to
> > the Java server
>


Re: Spark mllib throwing error

2014-04-24 Thread John King
Last command was:

val model = new NaiveBayes().run(points)


On Thu, Apr 24, 2014 at 4:27 PM, Xiangrui Meng  wrote:

> Could you share the command you used and more of the error message?
> Also, is it an MLlib specific problem? -Xiangrui
>
> On Thu, Apr 24, 2014 at 11:49 AM, John King
>  wrote:
> > ./spark-shell: line 153: 17654 Killed
> > $FWDIR/bin/spark-class org.apache.spark.repl.Main "$@"
> >
> >
> > Any ideas?
>


Spark mllib throwing error

2014-04-24 Thread John King
./spark-shell: line 153: 17654 Killed
$FWDIR/bin/spark-class org.apache.spark.repl.Main "$@"


Any ideas?


Trying to use pyspark mllib NaiveBayes

2014-04-24 Thread John King
I receive this error:

Traceback (most recent call last):

  File "", line 1, in 

  File
"/home/ubuntu/spark-1.0.0-rc2/python/pyspark/mllib/classification.py", line
178, in train

ans = sc._jvm.PythonMLLibAPI().trainNaiveBayes(dataBytes._jrdd, lambda_)

  File
"/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
line 535, in __call__

  File
"/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
line 368, in send_command

  File
"/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
line 361, in send_command

  File
"/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
line 317, in _get_connection

  File
"/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
line 324, in _create_connection

  File
"/home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py",
line 431, in start

py4j.protocol.Py4JNetworkError: An error occurred while trying to connect
to the Java server


Re: Deploying a python code on a spark EC2 cluster

2014-04-24 Thread John King
Same problem.


On Thu, Apr 24, 2014 at 10:54 AM, Shubhabrata  wrote:

> Moreover it seems all the workers are registered and have sufficient memory
> (2.7GB where as I have asked for 512 MB). The UI also shows the jobs are
> running on the slaves. But on the termial it is still the same error
> "Initial job has not accepted any resources; check your cluster UI to
> ensure
> that workers are registered and have sufficient memory"
>
> Please see the screenshot. Thanks
>
> 
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Deploying-a-python-code-on-a-spark-EC2-cluster-tp4758p4761.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>