Re: SchemaRDD.sample problem

madhu phatak Thu, 18 Dec 2014 21:51:07 -0800

Hi,
Can you clean up the code lil bit better, it's hard to read what's going
on. You can use pastebin or gist to put the code.


On Wed, Dec 17, 2014 at 3:58 PM, Hao Ren <inv...@gmail.com> wrote:
>
> Hi,
>
> I am using SparkSQL on 1.2.1 branch. The problem comes froms the following
> 4-line code:
>
> *val t1: SchemaRDD = hiveContext hql "select * from product where is_new =
> 0"
> val tb1: SchemaRDD = t1.sample(withReplacement = false, fraction = 0.05)
> tb1.registerTempTable("t1_tmp")
> (hiveContext sql "select count(*) from t1_tmp where is_new = 1") collect
> foreach println*
>
> We know that *t1* contains only rows whose "is_new" field is zero.
> After sampling t1 by taking 5% rows, normally, the sampled table should
> always contains only rows where "is_new" = 0. However, line 4 gives a
> number
> about 5 by chance. That means there are some rows where "is_new = 1" in the
> sampled table, which is not logically possible.
>
> I am not sure SchemaRDD.sample is doing his work well.
>
> Any idea ?
>
> Hao
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-sample-problem-tp20741.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

-- 
Regards,
Madhukara Phatak
http://www.madhukaraphatak.com

Re: SchemaRDD.sample problem

Reply via email to