Re: Array[T].distinct doesn't work inside RDD

Imran Rashid Tue, 14 Apr 2015 12:45:13 -0700

Interesting, my gut instinct is the same as Sean's.  I'd suggest debugging
this in plain old scala first, without involving spark.  Even just in the
scala shell, create one of your Array[T], try calling .toSet and calling
.distinct.  If those aren't the same, then its got nothing to do with
spark.  If its still different even after you make hashCode() consistent w/
equals(), then you might have more luck asking on the scala-user list:
https://groups.google.com/forum/#!forum/scala-user


If it works fine in plain scala, but not in spark, then it would be worth
bringing up here again for us to look into.

On Tue, Apr 7, 2015 at 4:41 PM, Anny Chen <anny9...@gmail.com> wrote:

> Hi Sean,
>
> I didn't override hasCode. But the problem is that Array[T].toSet could
> work but Array[T].distinct couldn't. If it is because I didn't override
> hasCode, then toSet shouldn't work either right? I also tried using this
> Array[T].distinct outside RDD, and it is working alright also, returning me
> the same result as Array[T].toSet.
>
> Thanks!
> Anny
>
> On Tue, Apr 7, 2015 at 2:31 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> Did you override hashCode too?
>> On Apr 7, 2015 2:39 PM, "anny9699" <anny9...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a question about Array[T].distinct on customized class T. My data
>>> is
>>> a like RDD[(String, Array[T])] in which T is a class written by my class.
>>> There are some duplicates in each Array[T] so I want to remove them. I
>>> override the equals() method in T and use
>>>
>>> val dataNoDuplicates = dataDuplicates.map{case(id, arr) => (id,
>>> arr.distinct)}
>>>
>>> to remove duplicates inside RDD. However this doesn't work since I did
>>> some
>>> further tests by using
>>>
>>> val dataNoDuplicates = dataDuplicates.map{case(id, arr) =>
>>> val uniqArr = arr.distinct
>>> if(uniqArr.length > 1) println(uniqArr.head == uniqArr.last)
>>> (id, uniqArr)
>>> }
>>>
>>> And from the worker stdout I could see that it always returns "TRUE"
>>> results. I then tried removing duplicates by using Array[T].toSet
>>> instead of
>>> Array[T].distinct and it is working!
>>>
>>> Could anybody explain why the Array[T].toSet and Array[T].distinct
>>> behaves
>>> differently here? And Why is Array[T].distinct not working?
>>>
>>> Thanks a lot!
>>> Anny
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Array-T-distinct-doesn-t-work-inside-RDD-tp22412.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>

Re: Array[T].distinct doesn't work inside RDD

Reply via email to