Re: Duplicate entries in output of mllib column similarities

Reza Zadeh Tue, 12 May 2015 10:53:27 -0700

Great! Reza

On Tue, May 12, 2015 at 7:42 AM, Richard Bolkey <rbol...@gmail.com> wrote:


> Hi Reza,
>
> That was the fix we needed. After sorting, the transposed entries are gone!
>
> Thanks a bunch,
> rick
>
> On Sat, May 9, 2015 at 5:17 PM, Reza Zadeh <r...@databricks.com> wrote:
>
>> Hi Richard,
>> One reason that could be happening is that the rows of your matrix are
>> using SparseVectors, but the entries in your vectors aren't sorted by
>> index. Is that the case? Sparse Vectors
>> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala>
>> need sorted indices.
>> Reza
>>
>> On Sat, May 9, 2015 at 8:51 AM, Richard Bolkey <rbol...@gmail.com> wrote:
>>
>>> Hi Reza,
>>>
>>> After a bit of digging, I had my previous issue a little bit wrong.
>>> We're not getting duplicate (i,j) entries, but we are getting transposed
>>> entries (i,j) and (j,i) with potentially different scores. We assumed the
>>> output would be a triangular matrix. Still, let me know if that's expected.
>>> A transposed entry occurs for about 5% of our output entries.
>>>
>>> scala> matrix.entries.filter(x => (x.i,x.j) == (22769,539029)).collect()
>>> res23: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] =
>>> Array(MatrixEntry(22769,539029,0.00453050595770095))
>>>
>>> scala> matrix.entries.filter(x => (x.i,x.j) == (539029,22769)).collect()
>>> res24: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] =
>>> Array(MatrixEntry(539029,22769,0.002265252978850475))
>>>
>>> I saved a subset of vectors to object files that replicates the issue .
>>> It's about 300mb. Should I try to whittle that down some more? What would
>>> be the best way to get that to you.
>>>
>>> Many thanks,
>>> Rick
>>>
>>> On Thu, May 7, 2015 at 8:58 PM, Reza Zadeh <r...@databricks.com> wrote:
>>>
>>>> This shouldn't be happening, do you have an example to reproduce it?
>>>>
>>>> On Thu, May 7, 2015 at 4:17 PM, rbolkey <rbol...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a question regarding one of the oddities we encountered while
>>>>> running
>>>>> mllib's column similarities operation. When we examine the output, we
>>>>> find
>>>>> duplicate matrix entries (the same i,j). Sometimes the entries have
>>>>> the same
>>>>> value/similarity score, but they're frequently different too.
>>>>>
>>>>> Is this a known issue? An artifact of the probabilistic nature of the
>>>>> output? Which output score should we trust (lower vs higher one when
>>>>> different)? We're using a threshold of 0.3, and running Spark 1.3.1 on
>>>>> a 10
>>>>> node cluster.
>>>>>
>>>>> Thanks
>>>>> Rick
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-entries-in-output-of-mllib-column-similarities-tp22807.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Duplicate entries in output of mllib column similarities

Reply via email to