[jira] [Commented] (SPARK-18463) I think it's necessary to have an overrided method of smaple
[ https://issues.apache.org/jira/browse/SPARK-18463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15670521#comment-15670521 ] Jianfei Wang commented on SPARK-18463: -- thank you very much ,some misunderstanding about this case > I think it's necessary to have an overrided method of smaple > > > Key: SPARK-18463 > URL: https://issues.apache.org/jira/browse/SPARK-18463 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Jianfei Wang > > Currently in this situation: > rdd3 = rdd1.zip(rdd2).sample() > if we can take sample on the two sample directly,such as > sample(rdd1,rdd2) ,so we can reduce the memory usage. > there are some use cases in spark mllib,such as in GradientBoostedTrees > while (m < numIterations && !doneLearning) { > // Update data with pseudo-residuals 剩余误差 > val data = predError.zip(input).map { case ((pred, _), point) => > LabeledPoint(-loss.gradient(pred, point.label), point.features) > } > val dt = new DecisionTreeRegressor().setSeed(seed + m) > val model = dt.train(data, treeStrategy) > when we use data to train model,we will do a sample. > so we can imp an method sample(rdd1,rdd2) to reduce the memory usage in such > cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18463) I think it's necessary to have an overrided method of smaple
[ https://issues.apache.org/jira/browse/SPARK-18463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15670522#comment-15670522 ] Jianfei Wang commented on SPARK-18463: -- thank you very much ,some misunderstanding about this case > I think it's necessary to have an overrided method of smaple > > > Key: SPARK-18463 > URL: https://issues.apache.org/jira/browse/SPARK-18463 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Jianfei Wang > > Currently in this situation: > rdd3 = rdd1.zip(rdd2).sample() > if we can take sample on the two sample directly,such as > sample(rdd1,rdd2) ,so we can reduce the memory usage. > there are some use cases in spark mllib,such as in GradientBoostedTrees > while (m < numIterations && !doneLearning) { > // Update data with pseudo-residuals 剩余误差 > val data = predError.zip(input).map { case ((pred, _), point) => > LabeledPoint(-loss.gradient(pred, point.label), point.features) > } > val dt = new DecisionTreeRegressor().setSeed(seed + m) > val model = dt.train(data, treeStrategy) > when we use data to train model,we will do a sample. > so we can imp an method sample(rdd1,rdd2) to reduce the memory usage in such > cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18463) I think it's necessary to have an overrided method of smaple
[ https://issues.apache.org/jira/browse/SPARK-18463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15670513#comment-15670513 ] Sean Owen commented on SPARK-18463: --- Yes, zip doesn't use any particular memory. It's a transformation. > I think it's necessary to have an overrided method of smaple > > > Key: SPARK-18463 > URL: https://issues.apache.org/jira/browse/SPARK-18463 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Jianfei Wang > > Currently in this situation: > rdd3 = rdd1.zip(rdd2).sample() > if we can take sample on the two sample directly,such as > sample(rdd1,rdd2) ,so we can reduce the memory usage. > there are some use cases in spark mllib,such as in GradientBoostedTrees > while (m < numIterations && !doneLearning) { > // Update data with pseudo-residuals 剩余误差 > val data = predError.zip(input).map { case ((pred, _), point) => > LabeledPoint(-loss.gradient(pred, point.label), point.features) > } > val dt = new DecisionTreeRegressor().setSeed(seed + m) > val model = dt.train(data, treeStrategy) > when we use data to train model,we will do a sample. > so we can imp an method sample(rdd1,rdd2) to reduce the memory usage in such > cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18463) I think it's necessary to have an overrided method of smaple
[ https://issues.apache.org/jira/browse/SPARK-18463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15670504#comment-15670504 ] Jianfei Wang commented on SPARK-18463: -- ok ,what you mean is rdd1.zip(rdd2).sample() won't use more memory to store rdd1.zip(rdd2) ? thank you. > I think it's necessary to have an overrided method of smaple > > > Key: SPARK-18463 > URL: https://issues.apache.org/jira/browse/SPARK-18463 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Jianfei Wang > > Currently in this situation: > rdd3 = rdd1.zip(rdd2).sample() > if we can take sample on the two sample directly,such as > sample(rdd1,rdd2) ,so we can reduce the memory usage. > there are some use cases in spark mllib,such as in GradientBoostedTrees > while (m < numIterations && !doneLearning) { > // Update data with pseudo-residuals 剩余误差 > val data = predError.zip(input).map { case ((pred, _), point) => > LabeledPoint(-loss.gradient(pred, point.label), point.features) > } > val dt = new DecisionTreeRegressor().setSeed(seed + m) > val model = dt.train(data, treeStrategy) > when we use data to train model,we will do a sample. > so we can imp an method sample(rdd1,rdd2) to reduce the memory usage in such > cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18463) I think it's necessary to have an overrided method of smaple
[ https://issues.apache.org/jira/browse/SPARK-18463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15670229#comment-15670229 ] Sean Owen commented on SPARK-18463: --- Sampling the pairs is exactly what zip + sample would do. there's no memory implication per se. It's a transformation only. I'm going to close this if that's the misunderstanding here. > I think it's necessary to have an overrided method of smaple > > > Key: SPARK-18463 > URL: https://issues.apache.org/jira/browse/SPARK-18463 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Jianfei Wang > > Currently in this situation: > rdd3 = rdd1.zip(rdd2).sample() > if we can take sample on the two sample directly,such as > sample(rdd1,rdd2) ,so we can reduce the memory usage. > there are some use cases in spark mllib,such as in GradientBoostedTrees > while (m < numIterations && !doneLearning) { > // Update data with pseudo-residuals 剩余误差 > val data = predError.zip(input).map { case ((pred, _), point) => > LabeledPoint(-loss.gradient(pred, point.label), point.features) > } > val dt = new DecisionTreeRegressor().setSeed(seed + m) > val model = dt.train(data, treeStrategy) > when we use data to train model,we will do a sample. > so we can imp an method sample(rdd1,rdd2) to reduce the memory usage in such > cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18463) I think it's necessary to have an overrided method of smaple
[ https://issues.apache.org/jira/browse/SPARK-18463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15670222#comment-15670222 ] Jianfei Wang commented on SPARK-18463: -- So, maybe we can imp a sample that sample the two rdds's same pair. if so ,we just can do sample first, to avoid do rdd1.zip(rdd2) first and then do a sample to train the model.if we have such a method, we can reduce the memory usage. > I think it's necessary to have an overrided method of smaple > > > Key: SPARK-18463 > URL: https://issues.apache.org/jira/browse/SPARK-18463 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Jianfei Wang > > Currently in this situation: > rdd3 = rdd1.zip(rdd2).sample() > if we can take sample on the two sample directly,such as > sample(rdd1,rdd2) ,so we can reduce the memory usage. > there are some use cases in spark mllib,such as in GradientBoostedTrees > while (m < numIterations && !doneLearning) { > // Update data with pseudo-residuals 剩余误差 > val data = predError.zip(input).map { case ((pred, _), point) => > LabeledPoint(-loss.gradient(pred, point.label), point.features) > } > val dt = new DecisionTreeRegressor().setSeed(seed + m) > val model = dt.train(data, treeStrategy) > when we use data to train model,we will do a sample. > so we can imp an method sample(rdd1,rdd2) to reduce the memory usage in such > cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18463) I think it's necessary to have an overrided method of smaple
[ https://issues.apache.org/jira/browse/SPARK-18463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15670092#comment-15670092 ] Sean Owen commented on SPARK-18463: --- I don't understand what this is proposing. The example you cite shows no sampling. You can't sample, then zip, two RDDs because they won't sample the same pairs. > I think it's necessary to have an overrided method of smaple > > > Key: SPARK-18463 > URL: https://issues.apache.org/jira/browse/SPARK-18463 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Jianfei Wang > > Currently in this situation: > rdd3 = rdd1.zip(rdd2).sample() > if we can take sample on the two sample directly,such as > sample(rdd1,rdd2) ,so we can reduce the memory usage. > there are some use cases in spark mllib,such as in GradientBoostedTrees > while (m < numIterations && !doneLearning) { > // Update data with pseudo-residuals 剩余误差 > val data = predError.zip(input).map { case ((pred, _), point) => > LabeledPoint(-loss.gradient(pred, point.label), point.features) > } > val dt = new DecisionTreeRegressor().setSeed(seed + m) > val model = dt.train(data, treeStrategy) > when we use data to train model,we will do a sample. > so we can imp an method sample(rdd1,rdd2) to reduce the memory usage in such > cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org