Inconsistent RDD Sample size

2014-05-21 Thread glxc
I have a graph and am trying to take a random sample of vertices without
replacement, using the RDD.sample() method

verts are the vertices in the graph

  val verts = graph.vertices

and executing this multiple times in a row 

  verts.sample(false, 1.toDouble/v1.count.toDouble,
 System.currentTimeMillis).count

yields different results roughly each time (albeit +/- a small % of the
target)

why does this happen? Looked at PartionwiseSampledRDD but can't figure it
out

Also, is there another method/technique to yield the same result each time? 
My understanding is that grabbing random indices may not be the best use of
the RDD model



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Inconsistent-RDD-Sample-size-tp6197.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Inconsistent RDD Sample size

2014-05-21 Thread Xiangrui Meng
It doesn't guarantee the exact sample size. If you fix the random
seed, it would return the same result every time. -Xiangrui

On Wed, May 21, 2014 at 2:05 PM, glxc r.ryan.mcc...@gmail.com wrote:
 I have a graph and am trying to take a random sample of vertices without
 replacement, using the RDD.sample() method

 verts are the vertices in the graph

  val verts = graph.vertices

 and executing this multiple times in a row

  verts.sample(false, 1.toDouble/v1.count.toDouble,
 System.currentTimeMillis).count

 yields different results roughly each time (albeit +/- a small % of the
 target)

 why does this happen? Looked at PartionwiseSampledRDD but can't figure it
 out

 Also, is there another method/technique to yield the same result each time?
 My understanding is that grabbing random indices may not be the best use of
 the RDD model



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Inconsistent-RDD-Sample-size-tp6197.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.