That's interesting. I would try
case class Mykey(uname:String) case class Mykey(uname:String, c1:Char) case class Mykey(uname:String, lo:String, f1:Char, f2:Char, f3:Char, f4:Char, f5:Char, f6:String) In that order. It seems like there is some issue with equality between keys. On Mon, Jan 4, 2016 at 5:05 PM Arun Luthra <arun.lut...@gmail.com> wrote: > If I simplify the key to String column with values lo1, lo2, lo3, lo4, it > works correctly. > > On Mon, Jan 4, 2016 at 4:49 PM, Daniel Imberman <daniel.imber...@gmail.com > > wrote: > >> Could you try simplifying the key and seeing if that makes any >> difference? Make it just a string or an int so we can count out any issues >> in object equality. >> >> On Mon, Jan 4, 2016 at 4:42 PM Arun Luthra <arun.lut...@gmail.com> wrote: >> >>> Spark 1.5.0 >>> >>> data: >>> >>> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0 >>> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0 >>> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0 >>> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0 >>> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0 >>> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0 >>> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0 >>> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0 >>> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0 >>> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0 >>> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0 >>> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0 >>> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0 >>> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0 >>> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0 >>> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0 >>> >>> spark-shell: >>> >>> spark-shell \ >>> --num-executors 2 \ >>> --driver-memory 1g \ >>> --executor-memory 10g \ >>> --executor-cores 8 \ >>> --master yarn-client >>> >>> >>> case class Mykey(uname:String, lo:String, f1:Char, f2:Char, f3:Char, >>> f4:Char, f5:Char, f6:String) >>> case class Myvalue(count1:Long, count2:Long, num:Double) >>> >>> val myrdd = sc.textFile("/user/al733a/mydata.txt").map { case line => { >>> val spl = line.split("\\|", -1) >>> val k = spl(0).split(",") >>> val v = spl(1).split(",") >>> (Mykey(k(0), k(1), k(2)(0).toChar, k(3)(0).toChar, k(4)(0).toChar, >>> k(5)(0).toChar, k(6)(0).toChar, k(7)), >>> Myvalue(v(0).toLong, v(1).toLong, v(2).toDouble) >>> ) >>> }} >>> >>> myrdd.groupByKey().map { case (mykey, val_iterable) => (mykey, 1) >>> }.collect().foreach(println) >>> >>> (Mykey(p1,lo1,8,0,4,0,5,20150901),1) >>> >>> (Mykey(p1,lo1,8,0,4,0,5,20150901),1) >>> (Mykey(p1,lo3,8,0,4,0,5,20150901),1) >>> (Mykey(p1,lo3,8,0,4,0,5,20150901),1) >>> (Mykey(p1,lo4,8,0,4,0,5,20150901),1) >>> (Mykey(p1,lo4,8,0,4,0,5,20150901),1) >>> (Mykey(p1,lo2,8,0,4,0,5,20150901),1) >>> (Mykey(p1,lo2,8,0,4,0,5,20150901),1) >>> >>> >>> >>> You can see that each key is repeated 2 times but each key should only >>> appear once. >>> >>> Arun >>> >>> On Mon, Jan 4, 2016 at 4:07 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>> >>>> Can you give a bit more information ? >>>> >>>> Release of Spark you're using >>>> Minimal dataset that shows the problem >>>> >>>> Cheers >>>> >>>> On Mon, Jan 4, 2016 at 3:55 PM, Arun Luthra <arun.lut...@gmail.com> >>>> wrote: >>>> >>>>> I tried groupByKey and noticed that it did not group all values into >>>>> the same group. >>>>> >>>>> In my test dataset (a Pair rdd) I have 16 records, where there are >>>>> only 4 distinct keys, so I expected there to be 4 records in the >>>>> groupByKey >>>>> object, but instead there were 8. Each of the 4 distinct keys appear 2 >>>>> times. >>>>> >>>>> Is this the expected behavior? I need to be able to get ALL values >>>>> associated with each key grouped into a SINGLE record. Is it possible? >>>>> >>>>> Arun >>>>> >>>>> p.s. reducebykey will not be sufficient for me >>>>> >>>> >>>> >>> >