hi, Just wanted to get your input how to avoid RDD shuffling in a join after Distributed Matrix operation spark
Following is what my app would look like 1. created a dense matrix as a input to calculate cosine distance between columns val rowMarixIn = sc.textFile("input.csv").map{ line => val values = line.split(" ").map(_.toDouble) Vectors.dense(values) } 2. Extracted set of entries from co-ordinated matrix after the cosine calculations val coMarix = new RowMatrix(rowMarixIn) val similerRows = coMatrix.columnSimilarities() //extract entires over a specific Threshold val rowIndices = similerRows.entries.map {case MatrixEntry(row: Long, col: Long, sim: Double) => if (sim > someTreshold )){ col,sim } 2. We have a another RDD with rdd2(key,Val2) just want to join the two rdd's, rowIndices(key,Val) , rdd2(key,Val2) val joinedRDD = rowIndices.join(rdd2)` its evident that this will result in a shuffle What are best practices to follow in order to avoid shuffle, Any suggestion on a better approach to handle a RowMarix calculation and utilize the result after that would be much appreciated Thanks, Tharindu