I've been trying to achieve the same objective, coming up with approaches
similar to your method 1 and 2. Method 2 is the slowest for me due to
massive amount of data being shuffled around at each matrix operation
stage. Method 3 is new to me, so I can't comment much.
I ended up using an approach
There are several ways I can compute the cosine similarities between a Spark ML
vector to each ML vector in a Spark DataFrame column then sorting for the
highest results. However, I can't come up with a method that is faster than
replacing the `/data/` in a Spark ML Word2Vec model, then using