SchemaRDD.sample problem

Hao Ren Wed, 17 Dec 2014 02:31:12 -0800

Hi,

I am using SparkSQL on 1.2.1 branch. The problem comes froms the following
4-line code:


*val t1: SchemaRDD = hiveContext hql "select * from product where is_new =
0"
val tb1: SchemaRDD = t1.sample(withReplacement = false, fraction = 0.05)
tb1.registerTempTable("t1_tmp")
(hiveContext sql "select count(*) from t1_tmp where is_new = 1") collect
foreach println*

We know that *t1* contains only rows whose "is_new" field is zero.
After sampling t1 by taking 5% rows, normally, the sampled table should
always contains only rows where "is_new" = 0. However, line 4 gives a number
about 5 by chance. That means there are some rows where "is_new = 1" in the
sampled table, which is not logically possible.

I am not sure SchemaRDD.sample is doing his work well.

Any idea ?

Hao



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-sample-problem-tp20741.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

SchemaRDD.sample problem

Reply via email to