Re: SchemaRDD.sample problem

2014-12-23 Thread Cheng Lian
Here is a more cleaned up version, can be used in |./sbt/sbt 
hive/console| to easily reproduce this issue:


|sql("SELECT * FROM src WHERE key % 2 = 0").
  sample(withReplacement =false, fraction =0.05).
  registerTempTable("sampled")

println(table("sampled").queryExecution)

val  query  =  sql("SELECT * FROM sampled WHERE key % 2 = 1")
println(query.queryExecution)

// Should print `true'
println((1  to10).map(_ => query.collect().isEmpty).reduce(_ && _))
|

The first query plan is:

|= Parsed Logical Plan ==
'Subquery sampled
 'Sample 0.05, false, 7800929008570881071
  'Project [*]
   'Filter (('key % 2) = 0)
'UnresolvedRelation None, src, None

== Analyzed Logical Plan ==
Sample 0.05, false, 7800929008570881071
 Project [key#12,value#13]
  Filter ((key#12 % 2) = 0)
   MetastoreRelation default, src, None

== Optimized Logical Plan ==
Sample 0.05, false, 7800929008570881071
 Filter ((key#12 % 2) = 0)
  MetastoreRelation default, src, None

== Physical Plan ==
Sample 0.05, false, 7800929008570881071
 Filter ((key#12 % 2) = 0)
  HiveTableScan [key#12,value#13], (MetastoreRelation default, src, None), None
|

The second query plan is:

|== Parsed Logical Plan ==
'Project [*]
 'Filter (('key % 2) = 1)
  'UnresolvedRelation None, sampled, None

== Analyzed Logical Plan ==
Project [key#14,value#15]
 Filter ((key#14 % 2) = 1)
  Sample 0.05, false, 7800929008570881071
   Project [key#14,value#15]
Filter ((key#14 % 2) = 0)
 MetastoreRelation default, src, None

== Optimized Logical Plan ==
Filter ((key#14 % 2) = 1)
 Sample 0.05, false, 7800929008570881071
  Filter ((key#14 % 2) = 0)
   MetastoreRelation default, src, None

== Physical Plan ==
Filter ((key#14 % 2) = 1)
 Sample 0.05, false, 7800929008570881071
  Filter ((key#14 % 2) = 0)
   HiveTableScan [key#14,value#15], (MetastoreRelation default, src, None), None
|

Notice that when fraction is less than 0.4, |GapSamplingIterator| is 
used to do the sampling. I guess there’s something to do with the 
underlying mutable row objects used in |HiveTableScan|, but didn’t find 
a clue yet.


On 12/24/14 12:39 AM, Hao Ren wrote:


One observation is that:
if fraction is big, say 50% - 80%, sampling is good, everything run as
expected.
But if fraction is small, for example, 5%, sampled data contains wrong rows
which should have been filtered.

The workaround is materializing t1 first:
t1.cache
t1.count

These operations make sure that t1 is materialized correctly so that the
following sample will work.

This approach is tested, and works fine. But still dont know why
SchemaRDD.sample will cause the problem when fraction is small.

Any help is appreciated.

Hao

Hao.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-sample-problem-tp20741p20835.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



​


Re: SchemaRDD.sample problem

2014-12-23 Thread Hao Ren
One observation is that:
if fraction is big, say 50% - 80%, sampling is good, everything run as
expected.
But if fraction is small, for example, 5%, sampled data contains wrong rows
which should have been filtered.

The workaround is materializing t1 first:
t1.cache
t1.count

These operations make sure that t1 is materialized correctly so that the
following sample will work.

This approach is tested, and works fine. But still dont know why
SchemaRDD.sample will cause the problem when fraction is small.

Any help is appreciated.

Hao

Hao.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-sample-problem-tp20741p20835.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SchemaRDD.sample problem

2014-12-23 Thread Hao Ren
update:

t1 is good. After collecting on t1, I find that all row is ok (is_new = 0)
Just after sampling, there are some rows where is_new = 1 which should have
been filtered by Where clause.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-sample-problem-tp20741p20833.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SchemaRDD.sample problem

2014-12-18 Thread madhu phatak
Hi,
Can you clean up the code lil bit better, it's hard to read what's going
on. You can use pastebin or gist to put the code.

On Wed, Dec 17, 2014 at 3:58 PM, Hao Ren  wrote:
>
> Hi,
>
> I am using SparkSQL on 1.2.1 branch. The problem comes froms the following
> 4-line code:
>
> *val t1: SchemaRDD = hiveContext hql "select * from product where is_new =
> 0"
> val tb1: SchemaRDD = t1.sample(withReplacement = false, fraction = 0.05)
> tb1.registerTempTable("t1_tmp")
> (hiveContext sql "select count(*) from t1_tmp where is_new = 1") collect
> foreach println*
>
> We know that *t1* contains only rows whose "is_new" field is zero.
> After sampling t1 by taking 5% rows, normally, the sampled table should
> always contains only rows where "is_new" = 0. However, line 4 gives a
> number
> about 5 by chance. That means there are some rows where "is_new = 1" in the
> sampled table, which is not logically possible.
>
> I am not sure SchemaRDD.sample is doing his work well.
>
> Any idea ?
>
> Hao
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-sample-problem-tp20741.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

-- 
Regards,
Madhukara Phatak
http://www.madhukaraphatak.com


SchemaRDD.sample problem

2014-12-17 Thread Hao Ren
Hi,

I am using SparkSQL on 1.2.1 branch. The problem comes froms the following
4-line code:

*val t1: SchemaRDD = hiveContext hql "select * from product where is_new =
0"
val tb1: SchemaRDD = t1.sample(withReplacement = false, fraction = 0.05)
tb1.registerTempTable("t1_tmp")
(hiveContext sql "select count(*) from t1_tmp where is_new = 1") collect
foreach println*

We know that *t1* contains only rows whose "is_new" field is zero.
After sampling t1 by taking 5% rows, normally, the sampled table should
always contains only rows where "is_new" = 0. However, line 4 gives a number
about 5 by chance. That means there are some rows where "is_new = 1" in the
sampled table, which is not logically possible.

I am not sure SchemaRDD.sample is doing his work well.

Any idea ?

Hao



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-sample-problem-tp20741.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org