Re: SchemaRDD.sample problem
Here is a more cleaned up version, can be used in |./sbt/sbt hive/console| to easily reproduce this issue: |sql("SELECT * FROM src WHERE key % 2 = 0"). sample(withReplacement =false, fraction =0.05). registerTempTable("sampled") println(table("sampled").queryExecution) val query = sql("SELECT * FROM sampled WHERE key % 2 = 1") println(query.queryExecution) // Should print `true' println((1 to10).map(_ => query.collect().isEmpty).reduce(_ && _)) | The first query plan is: |= Parsed Logical Plan == 'Subquery sampled 'Sample 0.05, false, 7800929008570881071 'Project [*] 'Filter (('key % 2) = 0) 'UnresolvedRelation None, src, None == Analyzed Logical Plan == Sample 0.05, false, 7800929008570881071 Project [key#12,value#13] Filter ((key#12 % 2) = 0) MetastoreRelation default, src, None == Optimized Logical Plan == Sample 0.05, false, 7800929008570881071 Filter ((key#12 % 2) = 0) MetastoreRelation default, src, None == Physical Plan == Sample 0.05, false, 7800929008570881071 Filter ((key#12 % 2) = 0) HiveTableScan [key#12,value#13], (MetastoreRelation default, src, None), None | The second query plan is: |== Parsed Logical Plan == 'Project [*] 'Filter (('key % 2) = 1) 'UnresolvedRelation None, sampled, None == Analyzed Logical Plan == Project [key#14,value#15] Filter ((key#14 % 2) = 1) Sample 0.05, false, 7800929008570881071 Project [key#14,value#15] Filter ((key#14 % 2) = 0) MetastoreRelation default, src, None == Optimized Logical Plan == Filter ((key#14 % 2) = 1) Sample 0.05, false, 7800929008570881071 Filter ((key#14 % 2) = 0) MetastoreRelation default, src, None == Physical Plan == Filter ((key#14 % 2) = 1) Sample 0.05, false, 7800929008570881071 Filter ((key#14 % 2) = 0) HiveTableScan [key#14,value#15], (MetastoreRelation default, src, None), None | Notice that when fraction is less than 0.4, |GapSamplingIterator| is used to do the sampling. I guess there’s something to do with the underlying mutable row objects used in |HiveTableScan|, but didn’t find a clue yet. On 12/24/14 12:39 AM, Hao Ren wrote: One observation is that: if fraction is big, say 50% - 80%, sampling is good, everything run as expected. But if fraction is small, for example, 5%, sampled data contains wrong rows which should have been filtered. The workaround is materializing t1 first: t1.cache t1.count These operations make sure that t1 is materialized correctly so that the following sample will work. This approach is tested, and works fine. But still dont know why SchemaRDD.sample will cause the problem when fraction is small. Any help is appreciated. Hao Hao. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-sample-problem-tp20741p20835.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SchemaRDD.sample problem
One observation is that: if fraction is big, say 50% - 80%, sampling is good, everything run as expected. But if fraction is small, for example, 5%, sampled data contains wrong rows which should have been filtered. The workaround is materializing t1 first: t1.cache t1.count These operations make sure that t1 is materialized correctly so that the following sample will work. This approach is tested, and works fine. But still dont know why SchemaRDD.sample will cause the problem when fraction is small. Any help is appreciated. Hao Hao. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-sample-problem-tp20741p20835.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SchemaRDD.sample problem
update: t1 is good. After collecting on t1, I find that all row is ok (is_new = 0) Just after sampling, there are some rows where is_new = 1 which should have been filtered by Where clause. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-sample-problem-tp20741p20833.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SchemaRDD.sample problem
Hi, Can you clean up the code lil bit better, it's hard to read what's going on. You can use pastebin or gist to put the code. On Wed, Dec 17, 2014 at 3:58 PM, Hao Ren wrote: > > Hi, > > I am using SparkSQL on 1.2.1 branch. The problem comes froms the following > 4-line code: > > *val t1: SchemaRDD = hiveContext hql "select * from product where is_new = > 0" > val tb1: SchemaRDD = t1.sample(withReplacement = false, fraction = 0.05) > tb1.registerTempTable("t1_tmp") > (hiveContext sql "select count(*) from t1_tmp where is_new = 1") collect > foreach println* > > We know that *t1* contains only rows whose "is_new" field is zero. > After sampling t1 by taking 5% rows, normally, the sampled table should > always contains only rows where "is_new" = 0. However, line 4 gives a > number > about 5 by chance. That means there are some rows where "is_new = 1" in the > sampled table, which is not logically possible. > > I am not sure SchemaRDD.sample is doing his work well. > > Any idea ? > > Hao > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-sample-problem-tp20741.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Regards, Madhukara Phatak http://www.madhukaraphatak.com
SchemaRDD.sample problem
Hi, I am using SparkSQL on 1.2.1 branch. The problem comes froms the following 4-line code: *val t1: SchemaRDD = hiveContext hql "select * from product where is_new = 0" val tb1: SchemaRDD = t1.sample(withReplacement = false, fraction = 0.05) tb1.registerTempTable("t1_tmp") (hiveContext sql "select count(*) from t1_tmp where is_new = 1") collect foreach println* We know that *t1* contains only rows whose "is_new" field is zero. After sampling t1 by taking 5% rows, normally, the sampled table should always contains only rows where "is_new" = 0. However, line 4 gives a number about 5 by chance. That means there are some rows where "is_new = 1" in the sampled table, which is not logically possible. I am not sure SchemaRDD.sample is doing his work well. Any idea ? Hao -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-sample-problem-tp20741.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org