[ https://issues.apache.org/jira/browse/SPARK-31140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064557#comment-17064557 ]
Hyukjin Kwon commented on SPARK-31140: -------------------------------------- Seems like very simply able to work around. Also given that RDD API is almost freeze now, I think it's not worthwhile adding it. > Support Quick sample in RDD > --------------------------- > > Key: SPARK-31140 > URL: https://issues.apache.org/jira/browse/SPARK-31140 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.1.0 > Reporter: deshanxiao > Priority: Minor > > RDD.sample use the function of *filter* to pick up the data we need. It means > that if the raw data is very huge, we must spend too much time reading it. We > can filter the raw partition to speed up the processing of sample. > {code:java} > override def compute(splitIn: Partition, context: TaskContext): Iterator[U] > = { > val split = splitIn.asInstanceOf[PartitionwiseSampledRDDPartition] > val thisSampler = sampler.clone > thisSampler.setSeed(split.seed) > thisSampler.sample(firstParent[T].iterator(split.prev, context)) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org