Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12157#discussion_r58572140
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -53,14 +53,22 @@ import org.apache.spark.util.{NextIterator, 
SerializableConfiguration, ShutdownH
     /**
      * A Spark split class that wraps around a Hadoop InputSplit.
      */
    -private[spark] class HadoopPartition(rddId: Int, idx: Int, s: InputSplit)
    +private[spark] class HadoopPartition(rddId: Int, override val index: Int, 
s: InputSplit)
       extends Partition {
     
       val inputSplit = new SerializableWritable[InputSplit](s)
     
    -  override def hashCode(): Int = 41 * (41 + rddId) + idx
    +  override def hashCode(): Int = 41 * (41 + rddId) + index
     
    -  override val index: Int = idx
    +  def canEqual(other: Any): Boolean = other.isInstanceOf[HadoopPartition]
    +
    +  override def equals(other: Any): Boolean = other match {
    +    case that: HadoopPartition =>
    +      super.equals(that) &&
    +        (that canEqual this) &&
    +        index == that.index
    --- End diff --
    
    It becomes a field if you use it outside the constructor, or should. It 
should stay private yes.
    
    This caused me to think about the partition changes a bit more. Defining 
`hashCode` without `equals` is technically correct. By default no two distinct 
objects are equal, so they can't violate the contract that equal objects have 
the same hash code no matter what the hash code is. 
    
    Here it's not clear that two partitions with the same index are 
semantically equivalent, since they can be from different RDDs. So the RDD's ID 
matters, but, maybe it's not right to implement a notion of equality here 
either. It does raise the question -- when do partitions get hashed and why is 
a non-default implementation important then? it could be vestigial. Maybe best 
not to add equals though, and unless we know for sure hash code isn't 
important, leave that.
    
    So maybe we end up leaving the partition classes alone and weakening the 
condition to require hashCode if equals exists but not vice versa?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to