[jira] [Commented] (SPARK-31140) Support Quick sample in RDD

Hyukjin Kwon (Jira) Sun, 22 Mar 2020 23:09:46 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-31140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064557#comment-17064557
 ]


Hyukjin Kwon commented on SPARK-31140:
--------------------------------------

Seems like very simply able to work around. Also given that RDD API is almost 
freeze now, I think it's not worthwhile adding it.

> Support Quick sample in RDD
> ---------------------------
>
>                 Key: SPARK-31140
>                 URL: https://issues.apache.org/jira/browse/SPARK-31140
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.1.0
>            Reporter: deshanxiao
>            Priority: Minor
>
> RDD.sample use the function of *filter* to pick up the data we need. It means 
> that if the raw data is very huge, we must spend too much time reading it. We 
> can filter the raw partition to speed up the processing of sample.
> {code:java}
>   override def compute(splitIn: Partition, context: TaskContext): Iterator[U] 
> = {
>     val split = splitIn.asInstanceOf[PartitionwiseSampledRDDPartition]
>     val thisSampler = sampler.clone
>     thisSampler.setSeed(split.seed)
>     thisSampler.sample(firstParent[T].iterator(split.prev, context))
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31140) Support Quick sample in RDD

Reply via email to