[ https://issues.apache.org/jira/browse/SPARK-31140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17059908#comment-17059908 ]
deshanxiao commented on SPARK-31140: ------------------------------------ [~viirya] Thanks for your comment! It mean that we can overwrite the *getPartitions* to filter the partition directly. If we have 200 partitions, the samplePartition will return 20 partitions when the ratio is 0.1. > Support Quick sample in RDD > --------------------------- > > Key: SPARK-31140 > URL: https://issues.apache.org/jira/browse/SPARK-31140 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.0.0 > Reporter: deshanxiao > Priority: Minor > > RDD.sample use the function of *filter* to pick up the data we need. It means > that if the raw data is very huge, we must spend too much time reading it. We > can filter the raw partition to speed up the processing of sample. > {code:java} > override def compute(splitIn: Partition, context: TaskContext): Iterator[U] > = { > val split = splitIn.asInstanceOf[PartitionwiseSampledRDDPartition] > val thisSampler = sampler.clone > thisSampler.setSeed(split.seed) > thisSampler.sample(firstParent[T].iterator(split.prev, context)) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org