hi,
Just wanted to get your input how to avoid RDD shuffling in a join after
Distributed Matrix operation
spark

Following is what my app would look like

1. created a dense matrix as a input to calculate cosine distance between
columns


    val rowMarixIn = sc.textFile("input.csv").map{ line =>
    val values = line.split(" ").map(_.toDouble)
    Vectors.dense(values)
    }

2. Extracted set of entries from co-ordinated matrix after the cosine
calculations

        val coMarix = new RowMatrix(rowMarixIn)
        val similerRows = coMatrix.columnSimilarities()

        //extract entires over a specific Threshold

        val rowIndices = similerRows.entries.map {case MatrixEntry(row:
Long, col: Long, sim: Double) =>
        if (sim > someTreshold )){
        col,sim
        }

2. We have a another RDD with rdd2(key,Val2)

just want to join the two rdd's,  rowIndices(key,Val) , rdd2(key,Val2)

   val joinedRDD = rowIndices.join(rdd2)`

its evident that this will result in a shuffle

What are best practices to follow in order to avoid shuffle,
Any suggestion on a better approach to handle a RowMarix calculation and
utilize the result after that would be much appreciated

Thanks,
Tharindu

Reply via email to