Great! Reza On Tue, May 12, 2015 at 7:42 AM, Richard Bolkey <rbol...@gmail.com> wrote:
> Hi Reza, > > That was the fix we needed. After sorting, the transposed entries are gone! > > Thanks a bunch, > rick > > On Sat, May 9, 2015 at 5:17 PM, Reza Zadeh <r...@databricks.com> wrote: > >> Hi Richard, >> One reason that could be happening is that the rows of your matrix are >> using SparseVectors, but the entries in your vectors aren't sorted by >> index. Is that the case? Sparse Vectors >> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala> >> need sorted indices. >> Reza >> >> On Sat, May 9, 2015 at 8:51 AM, Richard Bolkey <rbol...@gmail.com> wrote: >> >>> Hi Reza, >>> >>> After a bit of digging, I had my previous issue a little bit wrong. >>> We're not getting duplicate (i,j) entries, but we are getting transposed >>> entries (i,j) and (j,i) with potentially different scores. We assumed the >>> output would be a triangular matrix. Still, let me know if that's expected. >>> A transposed entry occurs for about 5% of our output entries. >>> >>> scala> matrix.entries.filter(x => (x.i,x.j) == (22769,539029)).collect() >>> res23: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = >>> Array(MatrixEntry(22769,539029,0.00453050595770095)) >>> >>> scala> matrix.entries.filter(x => (x.i,x.j) == (539029,22769)).collect() >>> res24: Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = >>> Array(MatrixEntry(539029,22769,0.002265252978850475)) >>> >>> I saved a subset of vectors to object files that replicates the issue . >>> It's about 300mb. Should I try to whittle that down some more? What would >>> be the best way to get that to you. >>> >>> Many thanks, >>> Rick >>> >>> On Thu, May 7, 2015 at 8:58 PM, Reza Zadeh <r...@databricks.com> wrote: >>> >>>> This shouldn't be happening, do you have an example to reproduce it? >>>> >>>> On Thu, May 7, 2015 at 4:17 PM, rbolkey <rbol...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I have a question regarding one of the oddities we encountered while >>>>> running >>>>> mllib's column similarities operation. When we examine the output, we >>>>> find >>>>> duplicate matrix entries (the same i,j). Sometimes the entries have >>>>> the same >>>>> value/similarity score, but they're frequently different too. >>>>> >>>>> Is this a known issue? An artifact of the probabilistic nature of the >>>>> output? Which output score should we trust (lower vs higher one when >>>>> different)? We're using a threshold of 0.3, and running Spark 1.3.1 on >>>>> a 10 >>>>> node cluster. >>>>> >>>>> Thanks >>>>> Rick >>>>> >>>>> >>>>> >>>>> -- >>>>> View this message in context: >>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-entries-in-output-of-mllib-column-similarities-tp22807.html >>>>> Sent from the Apache Spark User List mailing list archive at >>>>> Nabble.com. >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>> >>>>> >>>> >>> >> >