Re: One question about RDD.zip function when trying Naive Bayes

x Fri, 11 Jul 2014 22:32:21 -0700

I tried my test case with Spark 1.0.1 and saw the same result(27 pairs
becomes 25 pairs after zip).


Could someone please check it?

Regards,
xj

On Thu, Jul 3, 2014 at 2:31 PM, Xiangrui Meng <[email protected]> wrote:

> This is due to a bug in sampling, which was fixed in 1.0.1 and latest
> master. See https://github.com/apache/spark/pull/1234 . -Xiangrui
>
> On Wed, Jul 2, 2014 at 8:23 PM, x <[email protected]> wrote:
> > Hello,
> >
> > I a newbie to Spark MLlib and ran into a curious case when following the
> > instruction at the page below.
> >
> > http://spark.apache.org/docs/latest/mllib-naive-bayes.html
> >
> > I ran a test program on my local machine using some data.
> >
> > val spConfig = (new
> > SparkConf).setMaster("local").setAppName("SparkNaiveBayes")
> > val sc = new SparkContext(spConfig)
> >
> > The test data was as follows and there were three lableled categories I
> > wanted to predict.
> >
> >  1  LabeledPoint(0.0, [4.9,3.0,1.4,0.2])
> >  2  LabeledPoint(0.0, [4.6,3.4,1.4,0.3])
> >  3  LabeledPoint(0.0, [5.7,4.4,1.5,0.4])
> >  4  LabeledPoint(0.0, [5.2,3.4,1.4,0.2])
> >  5  LabeledPoint(0.0, [4.7,3.2,1.6,0.2])
> >  6  LabeledPoint(0.0, [4.8,3.1,1.6,0.2])
> >  7  LabeledPoint(0.0, [5.1,3.8,1.9,0.4])
> >  8  LabeledPoint(0.0, [4.8,3.0,1.4,0.3])
> >  9  LabeledPoint(0.0, [5.0,3.3,1.4,0.2])
> > 10  LabeledPoint(1.0, [6.6,2.9,4.6,1.3])
> > 11  LabeledPoint(1.0, [5.2,2.7,3.9,1.4])
> > 12  LabeledPoint(1.0, [5.6,2.5,3.9,1.1])
> > 13  LabeledPoint(1.0, [6.4,2.9,4.3,1.3])
> > 14  LabeledPoint(1.0, [6.6,3.0,4.4,1.4])
> > 15  LabeledPoint(1.0, [6.0,2.7,5.1,1.6])
> > 16  LabeledPoint(1.0, [5.5,2.6,4.4,1.2])
> > 17  LabeledPoint(1.0, [5.8,2.6,4.0,1.2])
> > 18  LabeledPoint(1.0, [5.7,2.9,4.2,1.3])
> > 19  LabeledPoint(1.0, [5.7,2.8,4.1,1.3])
> > 20  LabeledPoint(2.0, [6.3,2.9,5.6,1.8])
> > 21  LabeledPoint(2.0, [6.5,3.0,5.8,2.2])
> > 22  LabeledPoint(2.0, [6.5,3.0,5.5,1.8])
> > 23  LabeledPoint(2.0, [6.7,3.3,5.7,2.1])
> > 24  LabeledPoint(2.0, [7.4,2.8,6.1,1.9])
> > 25  LabeledPoint(2.0, [6.3,3.4,5.6,2.4])
> > 26  LabeledPoint(2.0, [6.0,3.0,4.8,1.8])
> > 27  LabeledPoint(2.0, [6.8,3.2,5.9,2.3])
> >
> > The predicted result via NaiveBayes is below. Comparing to test data,
> only
> > two predicted results(#11 and #15) were different.
> >
> >  1  0.0
> >  2  0.0
> >  3  0.0
> >  4  0.0
> >  5  0.0
> >  6  0.0
> >  7  0.0
> >  8  0.0
> >  9  0.0
> > 10  1.0
> > 11  2.0
> > 12  1.0
> > 13  1.0
> > 14  1.0
> > 15  2.0
> > 16  1.0
> > 17  1.0
> > 18  1.0
> > 19  1.0
> > 20  2.0
> > 21  2.0
> > 22  2.0
> > 23  2.0
> > 24  2.0
> > 25  2.0
> > 26  2.0
> > 27  2.0
> >
> > After grouping test RDD and predicted RDD via zip I got this.
> >
> >  1  (0.0,0.0)
> >  2  (0.0,0.0)
> >  3  (0.0,0.0)
> >  4  (0.0,0.0)
> >  5  (0.0,0.0)
> >  6  (0.0,0.0)
> >  7  (0.0,0.0)
> >  8  (0.0,0.0)
> >  9  (0.0,1.0)
> > 10  (0.0,1.0)
> > 11  (0.0,1.0)
> > 12  (1.0,1.0)
> > 13  (1.0,1.0)
> > 14  (2.0,1.0)
> > 15  (1.0,1.0)
> > 16  (1.0,2.0)
> > 17  (1.0,2.0)
> > 18  (1.0,2.0)
> > 19  (1.0,2.0)
> > 20  (2.0,2.0)
> > 21  (2.0,2.0)
> > 22  (2.0,2.0)
> > 23  (2.0,2.0)
> > 24  (2.0,2.0)
> > 25  (2.0,2.0)
> >
> > I expected there were 27 pairs but I saw two results were lost.
> > Could someone please point out what I missed something here?
> >
> > Regards,
> > xj
>

Re: One question about RDD.zip function when trying Naive Bayes

Reply via email to