[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14282323#comment-14282323
 ] 

Frank Rosner commented on SPARK-2620:
-------------------------------------

The issue is caused by the fact that pattern matching of case classes does not 
work in the Spark shell.

Reducing by key (same as grouping, distinc operators, etc.) rely on equality 
checking of objects. Case classes implement equality checking by using pattern 
matching to perform type checking and conversion. When I implement a custom 
equals method for the case class that checks the type with {{isInstanceOf}} and 
casts with {{asInstanceOf}}, then {{==}} works and so do distinct and key-based 
operations.

Does it make sense to break this down and rework this issue to cover the actual 
problem rather than a symptom?

Are there any plans to fix this issue? It requires some extra effort when 
working with case classes and the REPL, especially blocking rapid data 
exploration and analytics. I will try to dig into the code and see whether I 
can find a solution but there seems to be an ongoing discussion for quite a 
while now.

> case class cannot be used as key for reduce
> -------------------------------------------
>
>                 Key: SPARK-2620
>                 URL: https://issues.apache.org/jira/browse/SPARK-2620
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Shell
>    Affects Versions: 1.0.0, 1.1.0
>         Environment: reproduced on spark-shell local[4]
>            Reporter: Gerard Maas
>            Assignee: Tobias Schlatter
>            Priority: Critical
>              Labels: case-class, core
>
> Using a case class as a key doesn't seem to work properly on Spark 1.0.0
> A minimal example:
> case class P(name:String)
> val ps = Array(P("alice"), P("bob"), P("charly"), P("bob"))
> sc.parallelize(ps).map(x=> (x,1)).reduceByKey((x,y) => x+y).collect
> [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
> (P(bob),1), (P(abe),1), (P(charly),1))
> In contrast to the expected behavior, that should be equivalent to:
> sc.parallelize(ps).map(x=> (x.name,1)).reduceByKey((x,y) => x+y).collect
> Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
> groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to