[ https://issues.apache.org/jira/browse/SPARK-18463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15670092#comment-15670092 ]
Sean Owen commented on SPARK-18463: ----------------------------------- I don't understand what this is proposing. The example you cite shows no sampling. You can't sample, then zip, two RDDs because they won't sample the same pairs. > I think it's necessary to have an overrided method of smaple > ------------------------------------------------------------ > > Key: SPARK-18463 > URL: https://issues.apache.org/jira/browse/SPARK-18463 > Project: Spark > Issue Type: New Feature > Components: Spark Core > Reporter: Jianfei Wang > > Currently in this situation: > rdd3 = rdd1.zip(rdd2).sample() > if we can take sample on the two sample directly,such as > sample(rdd1,rdd2) ,so we can reduce the memory usage. > there are some use cases in spark mllib,such as in GradientBoostedTrees > while (m < numIterations && !doneLearning) { > // Update data with pseudo-residuals 剩余误差 > val data = predError.zip(input).map { case ((pred, _), point) => > LabeledPoint(-loss.gradient(pred, point.label), point.features) > } > val dt = new DecisionTreeRegressor().setSeed(seed + m) > val model = dt.train(data, treeStrategy) > when we use data to train model,we will do a sample. > so we can imp an method sample(rdd1,rdd2) to reduce the memory usage in such > cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org