That's interesting.

I would try

case class Mykey(uname:String)
case class Mykey(uname:String, c1:Char)
case class Mykey(uname:String, lo:String, f1:Char, f2:Char, f3:Char,
f4:Char, f5:Char, f6:String)

In that order. It seems like there is some issue with equality between keys.

On Mon, Jan 4, 2016 at 5:05 PM Arun Luthra <> wrote:

> If I simplify the key to String column with values lo1, lo2, lo3, lo4, it
> works correctly.
> On Mon, Jan 4, 2016 at 4:49 PM, Daniel Imberman <
> > wrote:
>> Could you try simplifying the key and seeing if that makes any
>> difference? Make it just a string or an int so we can count out any issues
>> in object equality.
>> On Mon, Jan 4, 2016 at 4:42 PM Arun Luthra <> wrote:
>>> Spark 1.5.0
>>> data:
>>> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
>>> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
>>> spark-shell:
>>> spark-shell \
>>>     --num-executors 2 \
>>>     --driver-memory 1g \
>>>     --executor-memory 10g \
>>>     --executor-cores 8 \
>>>     --master yarn-client
>>> case class Mykey(uname:String, lo:String, f1:Char, f2:Char, f3:Char,
>>> f4:Char, f5:Char, f6:String)
>>> case class Myvalue(count1:Long, count2:Long, num:Double)
>>> val myrdd = sc.textFile("/user/al733a/mydata.txt").map { case line => {
>>>     val spl = line.split("\\|", -1)
>>>     val k = spl(0).split(",")
>>>     val v = spl(1).split(",")
>>>     (Mykey(k(0), k(1), k(2)(0).toChar, k(3)(0).toChar, k(4)(0).toChar,
>>> k(5)(0).toChar, k(6)(0).toChar, k(7)),
>>>      Myvalue(v(0).toLong, v(1).toLong, v(2).toDouble)
>>>     )
>>> }}
>>> myrdd.groupByKey().map { case (mykey, val_iterable) => (mykey, 1)
>>> }.collect().foreach(println)
>>> (Mykey(p1,lo1,8,0,4,0,5,20150901),1)
>>> (Mykey(p1,lo1,8,0,4,0,5,20150901),1)
>>> (Mykey(p1,lo3,8,0,4,0,5,20150901),1)
>>> (Mykey(p1,lo3,8,0,4,0,5,20150901),1)
>>> (Mykey(p1,lo4,8,0,4,0,5,20150901),1)
>>> (Mykey(p1,lo4,8,0,4,0,5,20150901),1)
>>> (Mykey(p1,lo2,8,0,4,0,5,20150901),1)
>>> (Mykey(p1,lo2,8,0,4,0,5,20150901),1)
>>> You can see that each key is repeated 2 times but each key should only
>>> appear once.
>>> Arun
>>> On Mon, Jan 4, 2016 at 4:07 PM, Ted Yu <> wrote:
>>>> Can you give a bit more information ?
>>>> Release of Spark you're using
>>>> Minimal dataset that shows the problem
>>>> Cheers
>>>> On Mon, Jan 4, 2016 at 3:55 PM, Arun Luthra <>
>>>> wrote:
>>>>> I tried groupByKey and noticed that it did not group all values into
>>>>> the same group.
>>>>> In my test dataset (a Pair rdd) I have 16 records, where there are
>>>>> only 4 distinct keys, so I expected there to be 4 records in the 
>>>>> groupByKey
>>>>> object, but instead there were 8. Each of the 4 distinct keys appear 2
>>>>> times.
>>>>> Is this the expected behavior? I need to be able to get ALL values
>>>>> associated with each key grouped into a SINGLE record. Is it possible?
>>>>> Arun
>>>>> p.s. reducebykey will not be sufficient for me

Reply via email to