Re: Duplicate entries in output of mllib column similarities

Reza Zadeh Sat, 09 May 2015 15:17:34 -0700

Hi Richard,
One reason that could be happening is that the rows of your matrix are
using SparseVectors, but the entries in your vectors aren't sorted by
index. Is that the case? Sparse Vectors
<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala>
need sorted indices.
Reza


On Sat, May 9, 2015 at 8:51 AM, Richard Bolkey <rbol...@gmail.com> wrote:

> Hi Reza,
>
> After a bit of digging, I had my previous issue a little bit wrong. We're
> not getting duplicate (i,j) entries, but we are getting transposed entries
> (i,j) and (j,i) with potentially different scores. We assumed the output
> would be a triangular matrix. Still, let me know if that's expected. A
> transposed entry occurs for about 5% of our output entries.
>
> scala> matrix.entries.filter(x => (x.i,x.j) == (22769,539029)).collect()
> res23: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] =
> Array(MatrixEntry(22769,539029,0.00453050595770095))
>
> scala> matrix.entries.filter(x => (x.i,x.j) == (539029,22769)).collect()
> res24: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] =
> Array(MatrixEntry(539029,22769,0.002265252978850475))
>
> I saved a subset of vectors to object files that replicates the issue .
> It's about 300mb. Should I try to whittle that down some more? What would
> be the best way to get that to you.
>
> Many thanks,
> Rick
>
> On Thu, May 7, 2015 at 8:58 PM, Reza Zadeh <r...@databricks.com> wrote:
>
>> This shouldn't be happening, do you have an example to reproduce it?
>>
>> On Thu, May 7, 2015 at 4:17 PM, rbolkey <rbol...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a question regarding one of the oddities we encountered while
>>> running
>>> mllib's column similarities operation. When we examine the output, we
>>> find
>>> duplicate matrix entries (the same i,j). Sometimes the entries have the
>>> same
>>> value/similarity score, but they're frequently different too.
>>>
>>> Is this a known issue? An artifact of the probabilistic nature of the
>>> output? Which output score should we trust (lower vs higher one when
>>> different)? We're using a threshold of 0.3, and running Spark 1.3.1 on a
>>> 10
>>> node cluster.
>>>
>>> Thanks
>>> Rick
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-entries-in-output-of-mllib-column-similarities-tp22807.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Re: Duplicate entries in output of mllib column similarities

Reply via email to