deshanxiao created SPARK-31140:
----------------------------------

             Summary: Support Quick sample in RDD
                 Key: SPARK-31140
                 URL: https://issues.apache.org/jira/browse/SPARK-31140
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 3.0.0
            Reporter: deshanxiao


RDD.sample use *filter* to read the raw data. It means that if the raw data is 
very huge, we must cost too much time to read it. We can filter the raw 
partition to speed up the processing of sample.


{code:java}
  override def compute(splitIn: Partition, context: TaskContext): Iterator[U] = 
{
    val split = splitIn.asInstanceOf[PartitionwiseSampledRDDPartition]
    val thisSampler = sampler.clone
    thisSampler.setSeed(split.seed)
    thisSampler.sample(firstParent[T].iterator(split.prev, context))
  }
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to