[ https://issues.apache.org/jira/browse/SPARK-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-9098. ------------------------------ Resolution: Duplicate Target Version/s: (was: 1.6.0) I agree, I think this is a subset of the broader fix/issue in SPARK-9793 > Inconsistent Dense Vectors hashing between PySpark and Scala > ------------------------------------------------------------ > > Key: SPARK-9098 > URL: https://issues.apache.org/jira/browse/SPARK-9098 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark > Affects Versions: 1.3.1, 1.4.0 > Reporter: Maciej Szymkiewicz > Priority: Minor > > When using Scala it is possible to group a RDD using DenseVector as a key: > {code} > import org.apache.spark.mllib.linalg.Vectors > val rdd = sc.parallelize( > (Vectors.dense(1, 2, 3), 10) :: (Vectors.dense(1, 2, 3), 20) :: Nil) > rdd.groupByKey.count > {code} > returns 1 as expected. > In PySpark {{DenseVector}} {{___hash___}} seems to be inherited from the > {{object}} and based on memory address: > {code} > from pyspark.mllib.linalg import DenseVector > rdd = sc.parallelize( > [(DenseVector([1, 2, 3]), 10), (DenseVector([1, 2, 3]), 20)]) > rdd.groupByKey().count() > {code} > returns 2. > Since underlaying `numpy.ndarray` can be used to mutate DenseVector hashing > doesn't look meaningful at all: > {code} > >>> dv = DenseVector([1, 2, 3]) > >>> hdv1 = hash(dv) > >>> dv.array[0] = 3.0 > >>> hdv2 = hash(dv) > >>> hdv1 == hdv2 > True > >>> dv == DenseVector([1, 2, 3]) > False > {code} > In my opinion the best approach would be to enforce immutability and provide > a meaningful hashing. An alternative is to make {{DenseVector}} unhashable > same as {{numpy.ndarray}}. > Source: > http://stackoverflow.com/questions/31449412/how-to-groupbykey-a-rdd-with-densevector-as-key-in-spark/31451752 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org