[jira] [Commented] (SPARK-31140) Support Quick sample in RDD

deshanxiao (Jira) Sun, 15 Mar 2020 19:45:21 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-31140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17059908#comment-17059908
 ]


deshanxiao commented on SPARK-31140:
------------------------------------

[~viirya] Thanks for your comment! It mean that we can overwrite the 
*getPartitions* to filter the partition directly. If we have 200 partitions, 
the samplePartition will return 20 partitions when the ratio is 0.1.

> Support Quick sample in RDD
> ---------------------------
>
>                 Key: SPARK-31140
>                 URL: https://issues.apache.org/jira/browse/SPARK-31140
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.0.0
>            Reporter: deshanxiao
>            Priority: Minor
>
> RDD.sample use the function of *filter* to pick up the data we need. It means 
> that if the raw data is very huge, we must spend too much time reading it. We 
> can filter the raw partition to speed up the processing of sample.
> {code:java}
>   override def compute(splitIn: Partition, context: TaskContext): Iterator[U] 
> = {
>     val split = splitIn.asInstanceOf[PartitionwiseSampledRDDPartition]
>     val thisSampler = sampler.clone
>     thisSampler.setSeed(split.seed)
>     thisSampler.sample(firstParent[T].iterator(split.prev, context))
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31140) Support Quick sample in RDD

Reply via email to