I tried my test case with Spark 1.0.1 and saw the same result(27 pairs
becomes 25 pairs after zip).

Could someone please check it?

Regards,
xj

On Thu, Jul 3, 2014 at 2:31 PM, Xiangrui Meng <men...@gmail.com> wrote:

> This is due to a bug in sampling, which was fixed in 1.0.1 and latest
> master. See https://github.com/apache/spark/pull/1234 . -Xiangrui
>
> On Wed, Jul 2, 2014 at 8:23 PM, x <wasedax...@gmail.com> wrote:
> > Hello,
> >
> > I a newbie to Spark MLlib and ran into a curious case when following the
> > instruction at the page below.
> >
> > http://spark.apache.org/docs/latest/mllib-naive-bayes.html
> >
> > I ran a test program on my local machine using some data.
> >
> > val spConfig = (new
> > SparkConf).setMaster("local").setAppName("SparkNaiveBayes")
> > val sc = new SparkContext(spConfig)
> >
> > The test data was as follows and there were three lableled categories I
> > wanted to predict.
> >
> >  1  LabeledPoint(0.0, [4.9,3.0,1.4,0.2])
> >  2  LabeledPoint(0.0, [4.6,3.4,1.4,0.3])
> >  3  LabeledPoint(0.0, [5.7,4.4,1.5,0.4])
> >  4  LabeledPoint(0.0, [5.2,3.4,1.4,0.2])
> >  5  LabeledPoint(0.0, [4.7,3.2,1.6,0.2])
> >  6  LabeledPoint(0.0, [4.8,3.1,1.6,0.2])
> >  7  LabeledPoint(0.0, [5.1,3.8,1.9,0.4])
> >  8  LabeledPoint(0.0, [4.8,3.0,1.4,0.3])
> >  9  LabeledPoint(0.0, [5.0,3.3,1.4,0.2])
> > 10  LabeledPoint(1.0, [6.6,2.9,4.6,1.3])
> > 11  LabeledPoint(1.0, [5.2,2.7,3.9,1.4])
> > 12  LabeledPoint(1.0, [5.6,2.5,3.9,1.1])
> > 13  LabeledPoint(1.0, [6.4,2.9,4.3,1.3])
> > 14  LabeledPoint(1.0, [6.6,3.0,4.4,1.4])
> > 15  LabeledPoint(1.0, [6.0,2.7,5.1,1.6])
> > 16  LabeledPoint(1.0, [5.5,2.6,4.4,1.2])
> > 17  LabeledPoint(1.0, [5.8,2.6,4.0,1.2])
> > 18  LabeledPoint(1.0, [5.7,2.9,4.2,1.3])
> > 19  LabeledPoint(1.0, [5.7,2.8,4.1,1.3])
> > 20  LabeledPoint(2.0, [6.3,2.9,5.6,1.8])
> > 21  LabeledPoint(2.0, [6.5,3.0,5.8,2.2])
> > 22  LabeledPoint(2.0, [6.5,3.0,5.5,1.8])
> > 23  LabeledPoint(2.0, [6.7,3.3,5.7,2.1])
> > 24  LabeledPoint(2.0, [7.4,2.8,6.1,1.9])
> > 25  LabeledPoint(2.0, [6.3,3.4,5.6,2.4])
> > 26  LabeledPoint(2.0, [6.0,3.0,4.8,1.8])
> > 27  LabeledPoint(2.0, [6.8,3.2,5.9,2.3])
> >
> > The predicted result via NaiveBayes is below. Comparing to test data,
> only
> > two predicted results(#11 and #15) were different.
> >
> >  1  0.0
> >  2  0.0
> >  3  0.0
> >  4  0.0
> >  5  0.0
> >  6  0.0
> >  7  0.0
> >  8  0.0
> >  9  0.0
> > 10  1.0
> > 11  2.0
> > 12  1.0
> > 13  1.0
> > 14  1.0
> > 15  2.0
> > 16  1.0
> > 17  1.0
> > 18  1.0
> > 19  1.0
> > 20  2.0
> > 21  2.0
> > 22  2.0
> > 23  2.0
> > 24  2.0
> > 25  2.0
> > 26  2.0
> > 27  2.0
> >
> > After grouping test RDD and predicted RDD via zip I got this.
> >
> >  1  (0.0,0.0)
> >  2  (0.0,0.0)
> >  3  (0.0,0.0)
> >  4  (0.0,0.0)
> >  5  (0.0,0.0)
> >  6  (0.0,0.0)
> >  7  (0.0,0.0)
> >  8  (0.0,0.0)
> >  9  (0.0,1.0)
> > 10  (0.0,1.0)
> > 11  (0.0,1.0)
> > 12  (1.0,1.0)
> > 13  (1.0,1.0)
> > 14  (2.0,1.0)
> > 15  (1.0,1.0)
> > 16  (1.0,2.0)
> > 17  (1.0,2.0)
> > 18  (1.0,2.0)
> > 19  (1.0,2.0)
> > 20  (2.0,2.0)
> > 21  (2.0,2.0)
> > 22  (2.0,2.0)
> > 23  (2.0,2.0)
> > 24  (2.0,2.0)
> > 25  (2.0,2.0)
> >
> > I expected there were 27 pairs but I saw two results were lost.
> > Could someone please point out what I missed something here?
> >
> > Regards,
> > xj
>

Reply via email to