Hi, I am using SparkSQL on 1.2.1 branch. The problem comes froms the following 4-line code:
*val t1: SchemaRDD = hiveContext hql "select * from product where is_new = 0" val tb1: SchemaRDD = t1.sample(withReplacement = false, fraction = 0.05) tb1.registerTempTable("t1_tmp") (hiveContext sql "select count(*) from t1_tmp where is_new = 1") collect foreach println* We know that *t1* contains only rows whose "is_new" field is zero. After sampling t1 by taking 5% rows, normally, the sampled table should always contains only rows where "is_new" = 0. However, line 4 gives a number about 5 by chance. That means there are some rows where "is_new = 1" in the sampled table, which is not logically possible. I am not sure SchemaRDD.sample is doing his work well. Any idea ? Hao -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-sample-problem-tp20741.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org