Github user srowen commented on the issue: https://github.com/apache/spark/pull/22893 So the pull request right now doesn't reflect what you tested, but you tested the version pasted above. You're saying that the optimization just never helps the dense-dense case, and sqdist is faster than a dot product. This doesn't make sense mathematically as it should be more math, but stranger things have happened. Still, I don't follow your test code here. You parallelize one vector, map it, collect it: why use Spark? and it's the same vector over and over, and it's not a big vector. Your sparse vectors aren't very sparse. How about more representative input -- larger vectors (100s of elements, probably), more sparse sparse vectors, and a large set of different inputs. I also don't see where the precision bound is changed here? This may be a good change but I'm just not yet convinced by the test methodology, and the result still doesn't make much intuitive sense.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org