I'm working on implementing LSH on Spark. I start with an implementation
provided by SoundCloud:
https://github.com/soundcloud/cosine-lsh-join-spark/blob/master/src/main/scala/com/soundcloud/lsh/Lsh.scala
when I check WebUI, I see that after call sortBy, the number of partitions
of RDD descreases from 30 to 2.
I'm also verify this by checking rdd.partitions.size
As I can see from the code of RDD class (
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala),
the default number of output partitions should equal the number of
partitions in the parent RDD, which in this case should be 30. Even when I
set it number explicitly, this problem still occurs.
However, when I try  simple code as follow but it works as I wish.
val d = Seq(1,2,5,6,3,4,2)
val data = sc.parallelize(d, 5)
val sortedData = data.sortBy(x => x)
println(sortedData.partitions.size) // return "5"

I'm using spark 1.6.1.
Thank you for your help.
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n26819/54.png> 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Something-wrong-with-sortBy-tp26819.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to