I suppose it depends a lot on the implementations. In general,
distinct and toSet work when hashCode and equals are defined
correctly. When that isn't the case, the result isn't defined; it
might happen to work in some cases. This could well explain why you
see different results. Why not implement hashCode() to see if that's
the solution? certainly, in general, you must do this for correctness.

On Tue, Apr 7, 2015 at 5:41 PM, Anny Chen <anny9...@gmail.com> wrote:
> Hi Sean,
>
> I didn't override hasCode. But the problem is that Array[T].toSet could work
> but Array[T].distinct couldn't. If it is because I didn't override hasCode,
> then toSet shouldn't work either right? I also tried using this
> Array[T].distinct outside RDD, and it is working alright also, returning me
> the same result as Array[T].toSet.
>
> Thanks!
> Anny
>
> On Tue, Apr 7, 2015 at 2:31 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> Did you override hashCode too?
>>
>> On Apr 7, 2015 2:39 PM, "anny9699" <anny9...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I have a question about Array[T].distinct on customized class T. My data
>>> is
>>> a like RDD[(String, Array[T])] in which T is a class written by my class.
>>> There are some duplicates in each Array[T] so I want to remove them. I
>>> override the equals() method in T and use
>>>
>>> val dataNoDuplicates = dataDuplicates.map{case(id, arr) => (id,
>>> arr.distinct)}
>>>
>>> to remove duplicates inside RDD. However this doesn't work since I did
>>> some
>>> further tests by using
>>>
>>> val dataNoDuplicates = dataDuplicates.map{case(id, arr) =>
>>> val uniqArr = arr.distinct
>>> if(uniqArr.length > 1) println(uniqArr.head == uniqArr.last)
>>> (id, uniqArr)
>>> }
>>>
>>> And from the worker stdout I could see that it always returns "TRUE"
>>> results. I then tried removing duplicates by using Array[T].toSet instead
>>> of
>>> Array[T].distinct and it is working!
>>>
>>> Could anybody explain why the Array[T].toSet and Array[T].distinct
>>> behaves
>>> differently here? And Why is Array[T].distinct not working?
>>>
>>> Thanks a lot!
>>> Anny
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Array-T-distinct-doesn-t-work-inside-RDD-tp22412.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to