Re: evaluating classification accuracy

2014-07-30 Thread SK
I am using 1.0.1 and I am running locally (I am not providing any master
URL). But the zip() does not produce the correct count as I mentioned above.
So not sure if the issue has been fixed in 1.0.1. However, instead of using
zip, I am now using the code that Sean has mentioned and am getting the
correct count. So the issue is resolved.

thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/evaluating-classification-accuracy-tp10822p10980.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


evaluating classification accuracy

2014-07-28 Thread SK
Hi,

In order to evaluate the ML classification accuracy, I am zipping up the
prediction and test labels as follows and then comparing the pairs in
predictionAndLabel:

val prediction = model.predict(test.map(_.features))
val predictionAndLabel = prediction.zip(test.map(_.label))


However, I am finding that predictionAndLabel.count() has fewer elements
than test.count().  For example, my test vector has 43 elements, but
predictionAndLabel has only 38 pairs. I have tried other samples and always
get fewer elements after zipping. 

Does zipping the two vectors cause any compression? or is this because of
the distributed nature of the algorithm (I am running it in local mode on a
single machine). In order to get the correct accuracy, I need the above
comparison to be done by a single node on the entire test data (my data is
quite small). How can I ensure that?

thanks 






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/evaluating-classification-accuracy-tp10822.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: evaluating classification accuracy

2014-07-28 Thread Xiangrui Meng
Are you using 1.0.0? There was a bug, which was fixed in 1.0.1 and
master. If you don't want to switch to 1.0.1 or master, try to cache
and count test first. -Xiangrui

On Mon, Jul 28, 2014 at 6:07 PM, SK skrishna...@gmail.com wrote:
 Hi,

 In order to evaluate the ML classification accuracy, I am zipping up the
 prediction and test labels as follows and then comparing the pairs in
 predictionAndLabel:

 val prediction = model.predict(test.map(_.features))
 val predictionAndLabel = prediction.zip(test.map(_.label))


 However, I am finding that predictionAndLabel.count() has fewer elements
 than test.count().  For example, my test vector has 43 elements, but
 predictionAndLabel has only 38 pairs. I have tried other samples and always
 get fewer elements after zipping.

 Does zipping the two vectors cause any compression? or is this because of
 the distributed nature of the algorithm (I am running it in local mode on a
 single machine). In order to get the correct accuracy, I need the above
 comparison to be done by a single node on the entire test data (my data is
 quite small). How can I ensure that?

 thanks






 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/evaluating-classification-accuracy-tp10822.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.