Hi Richard,
One reason that could be happening is that the rows of your matrix are
using SparseVectors, but the entries in your vectors aren't sorted by
index. Is that the case? Sparse Vectors
<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala>
need sorted indices.
Reza

On Sat, May 9, 2015 at 8:51 AM, Richard Bolkey <rbol...@gmail.com> wrote:

> Hi Reza,
>
> After a bit of digging, I had my previous issue a little bit wrong. We're
> not getting duplicate (i,j) entries, but we are getting transposed entries
> (i,j) and (j,i) with potentially different scores. We assumed the output
> would be a triangular matrix. Still, let me know if that's expected. A
> transposed entry occurs for about 5% of our output entries.
>
> scala> matrix.entries.filter(x => (x.i,x.j) == (22769,539029)).collect()
> res23: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] =
> Array(MatrixEntry(22769,539029,0.00453050595770095))
>
> scala> matrix.entries.filter(x => (x.i,x.j) == (539029,22769)).collect()
> res24: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] =
> Array(MatrixEntry(539029,22769,0.002265252978850475))
>
> I saved a subset of vectors to object files that replicates the issue .
> It's about 300mb. Should I try to whittle that down some more? What would
> be the best way to get that to you.
>
> Many thanks,
> Rick
>
> On Thu, May 7, 2015 at 8:58 PM, Reza Zadeh <r...@databricks.com> wrote:
>
>> This shouldn't be happening, do you have an example to reproduce it?
>>
>> On Thu, May 7, 2015 at 4:17 PM, rbolkey <rbol...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a question regarding one of the oddities we encountered while
>>> running
>>> mllib's column similarities operation. When we examine the output, we
>>> find
>>> duplicate matrix entries (the same i,j). Sometimes the entries have the
>>> same
>>> value/similarity score, but they're frequently different too.
>>>
>>> Is this a known issue? An artifact of the probabilistic nature of the
>>> output? Which output score should we trust (lower vs higher one when
>>> different)? We're using a threshold of 0.3, and running Spark 1.3.1 on a
>>> 10
>>> node cluster.
>>>
>>> Thanks
>>> Rick
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-entries-in-output-of-mllib-column-similarities-tp22807.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Reply via email to