Here is a more cleaned up version, can be used in |./sbt/sbt hive/console| to easily reproduce this issue:

|sql("SELECT * FROM src WHERE key % 2 = 0").
  sample(withReplacement =false, fraction =0.05).
  registerTempTable("sampled")

println(table("sampled").queryExecution)

val  query  =  sql("SELECT * FROM sampled WHERE key % 2 = 1")
println(query.queryExecution)

// Should print `true'
println((1  to10).map(_ => query.collect().isEmpty).reduce(_ && _))
|

The first query plan is:

|= Parsed Logical Plan ==
'Subquery sampled
 'Sample 0.05, false, 7800929008570881071
  'Project [*]
   'Filter (('key % 2) = 0)
    'UnresolvedRelation None, src, None

== Analyzed Logical Plan ==
Sample 0.05, false, 7800929008570881071
 Project [key#12,value#13]
  Filter ((key#12 % 2) = 0)
   MetastoreRelation default, src, None

== Optimized Logical Plan ==
Sample 0.05, false, 7800929008570881071
 Filter ((key#12 % 2) = 0)
  MetastoreRelation default, src, None

== Physical Plan ==
Sample 0.05, false, 7800929008570881071
 Filter ((key#12 % 2) = 0)
  HiveTableScan [key#12,value#13], (MetastoreRelation default, src, None), None
|

The second query plan is:

|== Parsed Logical Plan ==
'Project [*]
 'Filter (('key % 2) = 1)
  'UnresolvedRelation None, sampled, None

== Analyzed Logical Plan ==
Project [key#14,value#15]
 Filter ((key#14 % 2) = 1)
  Sample 0.05, false, 7800929008570881071
   Project [key#14,value#15]
    Filter ((key#14 % 2) = 0)
     MetastoreRelation default, src, None

== Optimized Logical Plan ==
Filter ((key#14 % 2) = 1)
 Sample 0.05, false, 7800929008570881071
  Filter ((key#14 % 2) = 0)
   MetastoreRelation default, src, None

== Physical Plan ==
Filter ((key#14 % 2) = 1)
 Sample 0.05, false, 7800929008570881071
  Filter ((key#14 % 2) = 0)
   HiveTableScan [key#14,value#15], (MetastoreRelation default, src, None), None
|

Notice that when fraction is less than 0.4, |GapSamplingIterator| is used to do the sampling. I guess there’s something to do with the underlying mutable row objects used in |HiveTableScan|, but didn’t find a clue yet.

On 12/24/14 12:39 AM, Hao Ren wrote:

One observation is that:
if fraction is big, say 50% - 80%, sampling is good, everything run as
expected.
But if fraction is small, for example, 5%, sampled data contains wrong rows
which should have been filtered.

The workaround is materializing t1 first:
t1.cache
t1.count

These operations make sure that t1 is materialized correctly so that the
following sample will work.

This approach is tested, and works fine. But still dont know why
SchemaRDD.sample will cause the problem when fraction is small.

Any help is appreciated.

Hao

Hao.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-sample-problem-tp20741p20835.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


Reply via email to