Hello,

I a newbie to Spark MLlib and ran into a curious case when following the
instruction at the page below.

http://spark.apache.org/docs/latest/mllib-naive-bayes.html

I ran a test program on my local machine using some data.

val spConfig = (new
SparkConf).setMaster("local").setAppName("SparkNaiveBayes")
val sc = new SparkContext(spConfig)

The test data was as follows and there were three lableled categories I
wanted to predict.

 1  LabeledPoint(0.0, [4.9,3.0,1.4,0.2])
 2  LabeledPoint(0.0, [4.6,3.4,1.4,0.3])
 3  LabeledPoint(0.0, [5.7,4.4,1.5,0.4])
 4  LabeledPoint(0.0, [5.2,3.4,1.4,0.2])
 5  LabeledPoint(0.0, [4.7,3.2,1.6,0.2])
 6  LabeledPoint(0.0, [4.8,3.1,1.6,0.2])
 7  LabeledPoint(0.0, [5.1,3.8,1.9,0.4])
 8  LabeledPoint(0.0, [4.8,3.0,1.4,0.3])
 9  LabeledPoint(0.0, [5.0,3.3,1.4,0.2])
10  LabeledPoint(1.0, [6.6,2.9,4.6,1.3])
11  LabeledPoint(1.0, [5.2,2.7,3.9,1.4])
12  LabeledPoint(1.0, [5.6,2.5,3.9,1.1])
13  LabeledPoint(1.0, [6.4,2.9,4.3,1.3])
14  LabeledPoint(1.0, [6.6,3.0,4.4,1.4])
15  LabeledPoint(1.0, [6.0,2.7,5.1,1.6])
16  LabeledPoint(1.0, [5.5,2.6,4.4,1.2])
17  LabeledPoint(1.0, [5.8,2.6,4.0,1.2])
18  LabeledPoint(1.0, [5.7,2.9,4.2,1.3])
19  LabeledPoint(1.0, [5.7,2.8,4.1,1.3])
20  LabeledPoint(2.0, [6.3,2.9,5.6,1.8])
21  LabeledPoint(2.0, [6.5,3.0,5.8,2.2])
22  LabeledPoint(2.0, [6.5,3.0,5.5,1.8])
23  LabeledPoint(2.0, [6.7,3.3,5.7,2.1])
24  LabeledPoint(2.0, [7.4,2.8,6.1,1.9])
25  LabeledPoint(2.0, [6.3,3.4,5.6,2.4])
26  LabeledPoint(2.0, [6.0,3.0,4.8,1.8])
27  LabeledPoint(2.0, [6.8,3.2,5.9,2.3])

The predicted result via NaiveBayes is below. Comparing to test data, only
two predicted results(#11 and #15) were different.

 1  0.0
 2  0.0
 3  0.0
 4  0.0
 5  0.0
 6  0.0
 7  0.0
 8  0.0
 9  0.0
10  1.0
11  2.0
12  1.0
13  1.0
14  1.0
15  2.0
16  1.0
17  1.0
18  1.0
19  1.0
20  2.0
21  2.0
22  2.0
23  2.0
24  2.0
25  2.0
26  2.0
27  2.0

After grouping test RDD and predicted RDD via zip I got this.

 1  (0.0,0.0)
 2  (0.0,0.0)
 3  (0.0,0.0)
 4  (0.0,0.0)
 5  (0.0,0.0)
 6  (0.0,0.0)
 7  (0.0,0.0)
 8  (0.0,0.0)
 9  (0.0,1.0)
10  (0.0,1.0)
11  (0.0,1.0)
12  (1.0,1.0)
13  (1.0,1.0)
14  (2.0,1.0)
15  (1.0,1.0)
16  (1.0,2.0)
17  (1.0,2.0)
18  (1.0,2.0)
19  (1.0,2.0)
20  (2.0,2.0)
21  (2.0,2.0)
22  (2.0,2.0)
23  (2.0,2.0)
24  (2.0,2.0)
25  (2.0,2.0)

I expected there were 27 pairs but I saw two results were lost.
Could someone please point out what I missed something here?

Regards,
xj

Reply via email to