hi,
Just wanted to get your input how to avoid RDD shuffling in a join after
Distributed Matrix operation
spark 

Following is what my app would look like 

1. created a dense matrix as a input to calculate cosine distance between
columns


    val rowMarixIn = sc.textFile("input.csv").map{ line =>
    val values = line.split(" ").map(_.toDouble)
    Vectors.dense(values)
    }

2. Extracted set of entries from co-ordinated matrix after the cosine
calculations  

        val coMarix = new RowMatrix(rowMarixIn)
        val similerRows = coMatrix.columnSimilarities()
        
        //extract entires over a specific Threshold
        
        val rowIndices = similerRows.entries.map {case MatrixEntry(row:
Long, col: Long, sim: Double) =>
        if (sim > someTreshold )){
        col,sim
        }

2. We have a another RDD with rdd2(key,Val2) 

just want to join the two rdd's,  rowIndices(key,Val) , rdd2(key,Val2)

   val joinedRDD = rowIndices.join(rdd2)`

its evident that this will result in a shuffle 

What are best practices to follow in order to avoid shuffle,
Any suggestion on a better approach to handle a RowMarix calculation and
utilize the result after that would be much appreciated 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Avoid-RDD-shuffling-in-a-join-after-Distributed-Matrix-operation-tp27574.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to