[jira] [Commented] (SPARK-18463) I think it's necessary to have an overrided method of smaple

2016-11-16 Thread Jianfei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15670521#comment-15670521
 ] 

Jianfei Wang commented on SPARK-18463:
--

thank you very much ,some misunderstanding about this case

> I think it's necessary to have an overrided method of smaple
> 
>
> Key: SPARK-18463
> URL: https://issues.apache.org/jira/browse/SPARK-18463
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Jianfei Wang
>
> Currently in this situation: 
> rdd3 = rdd1.zip(rdd2).sample()
> if we can take sample on the two sample directly,such as
> sample(rdd1,rdd2) ,so we can reduce the memory usage.
> there are some use cases in spark mllib,such as in GradientBoostedTrees
>  while (m < numIterations && !doneLearning) {
>   // Update data with pseudo-residuals 剩余误差
>   val data = predError.zip(input).map { case ((pred, _), point) =>
> LabeledPoint(-loss.gradient(pred, point.label), point.features)
>   }
> val dt = new DecisionTreeRegressor().setSeed(seed + m)
>   val model = dt.train(data, treeStrategy)
> when we use data to train model,we will do a sample.
> so we can imp an method sample(rdd1,rdd2) to reduce the memory usage in such 
> cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18463) I think it's necessary to have an overrided method of smaple

2016-11-16 Thread Jianfei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15670522#comment-15670522
 ] 

Jianfei Wang commented on SPARK-18463:
--

thank you very much ,some misunderstanding about this case

> I think it's necessary to have an overrided method of smaple
> 
>
> Key: SPARK-18463
> URL: https://issues.apache.org/jira/browse/SPARK-18463
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Jianfei Wang
>
> Currently in this situation: 
> rdd3 = rdd1.zip(rdd2).sample()
> if we can take sample on the two sample directly,such as
> sample(rdd1,rdd2) ,so we can reduce the memory usage.
> there are some use cases in spark mllib,such as in GradientBoostedTrees
>  while (m < numIterations && !doneLearning) {
>   // Update data with pseudo-residuals 剩余误差
>   val data = predError.zip(input).map { case ((pred, _), point) =>
> LabeledPoint(-loss.gradient(pred, point.label), point.features)
>   }
> val dt = new DecisionTreeRegressor().setSeed(seed + m)
>   val model = dt.train(data, treeStrategy)
> when we use data to train model,we will do a sample.
> so we can imp an method sample(rdd1,rdd2) to reduce the memory usage in such 
> cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18463) I think it's necessary to have an overrided method of smaple

2016-11-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15670513#comment-15670513
 ] 

Sean Owen commented on SPARK-18463:
---

Yes, zip doesn't use any particular memory. It's a transformation.

> I think it's necessary to have an overrided method of smaple
> 
>
> Key: SPARK-18463
> URL: https://issues.apache.org/jira/browse/SPARK-18463
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Jianfei Wang
>
> Currently in this situation: 
> rdd3 = rdd1.zip(rdd2).sample()
> if we can take sample on the two sample directly,such as
> sample(rdd1,rdd2) ,so we can reduce the memory usage.
> there are some use cases in spark mllib,such as in GradientBoostedTrees
>  while (m < numIterations && !doneLearning) {
>   // Update data with pseudo-residuals 剩余误差
>   val data = predError.zip(input).map { case ((pred, _), point) =>
> LabeledPoint(-loss.gradient(pred, point.label), point.features)
>   }
> val dt = new DecisionTreeRegressor().setSeed(seed + m)
>   val model = dt.train(data, treeStrategy)
> when we use data to train model,we will do a sample.
> so we can imp an method sample(rdd1,rdd2) to reduce the memory usage in such 
> cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18463) I think it's necessary to have an overrided method of smaple

2016-11-16 Thread Jianfei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15670504#comment-15670504
 ] 

Jianfei Wang commented on SPARK-18463:
--

ok ,what you mean is rdd1.zip(rdd2).sample() won't use more memory to store 
rdd1.zip(rdd2) ? thank you.

> I think it's necessary to have an overrided method of smaple
> 
>
> Key: SPARK-18463
> URL: https://issues.apache.org/jira/browse/SPARK-18463
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Jianfei Wang
>
> Currently in this situation: 
> rdd3 = rdd1.zip(rdd2).sample()
> if we can take sample on the two sample directly,such as
> sample(rdd1,rdd2) ,so we can reduce the memory usage.
> there are some use cases in spark mllib,such as in GradientBoostedTrees
>  while (m < numIterations && !doneLearning) {
>   // Update data with pseudo-residuals 剩余误差
>   val data = predError.zip(input).map { case ((pred, _), point) =>
> LabeledPoint(-loss.gradient(pred, point.label), point.features)
>   }
> val dt = new DecisionTreeRegressor().setSeed(seed + m)
>   val model = dt.train(data, treeStrategy)
> when we use data to train model,we will do a sample.
> so we can imp an method sample(rdd1,rdd2) to reduce the memory usage in such 
> cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18463) I think it's necessary to have an overrided method of smaple

2016-11-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15670229#comment-15670229
 ] 

Sean Owen commented on SPARK-18463:
---

Sampling the pairs is exactly what zip + sample would do. there's no memory 
implication per se. It's a transformation only. I'm going to close this if 
that's the misunderstanding here.

> I think it's necessary to have an overrided method of smaple
> 
>
> Key: SPARK-18463
> URL: https://issues.apache.org/jira/browse/SPARK-18463
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Jianfei Wang
>
> Currently in this situation: 
> rdd3 = rdd1.zip(rdd2).sample()
> if we can take sample on the two sample directly,such as
> sample(rdd1,rdd2) ,so we can reduce the memory usage.
> there are some use cases in spark mllib,such as in GradientBoostedTrees
>  while (m < numIterations && !doneLearning) {
>   // Update data with pseudo-residuals 剩余误差
>   val data = predError.zip(input).map { case ((pred, _), point) =>
> LabeledPoint(-loss.gradient(pred, point.label), point.features)
>   }
> val dt = new DecisionTreeRegressor().setSeed(seed + m)
>   val model = dt.train(data, treeStrategy)
> when we use data to train model,we will do a sample.
> so we can imp an method sample(rdd1,rdd2) to reduce the memory usage in such 
> cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18463) I think it's necessary to have an overrided method of smaple

2016-11-16 Thread Jianfei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15670222#comment-15670222
 ] 

Jianfei Wang commented on SPARK-18463:
--

So, maybe we can imp a sample that sample the two rdds's same pair.
if so ,we just can do sample first, to avoid do rdd1.zip(rdd2) first and then 
do a sample to train the model.if we have such a method, we can reduce the 
memory usage.

> I think it's necessary to have an overrided method of smaple
> 
>
> Key: SPARK-18463
> URL: https://issues.apache.org/jira/browse/SPARK-18463
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Jianfei Wang
>
> Currently in this situation: 
> rdd3 = rdd1.zip(rdd2).sample()
> if we can take sample on the two sample directly,such as
> sample(rdd1,rdd2) ,so we can reduce the memory usage.
> there are some use cases in spark mllib,such as in GradientBoostedTrees
>  while (m < numIterations && !doneLearning) {
>   // Update data with pseudo-residuals 剩余误差
>   val data = predError.zip(input).map { case ((pred, _), point) =>
> LabeledPoint(-loss.gradient(pred, point.label), point.features)
>   }
> val dt = new DecisionTreeRegressor().setSeed(seed + m)
>   val model = dt.train(data, treeStrategy)
> when we use data to train model,we will do a sample.
> so we can imp an method sample(rdd1,rdd2) to reduce the memory usage in such 
> cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18463) I think it's necessary to have an overrided method of smaple

2016-11-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15670092#comment-15670092
 ] 

Sean Owen commented on SPARK-18463:
---

I don't understand what this is proposing. The example you cite shows no 
sampling. You can't sample, then zip, two RDDs because they won't sample the 
same pairs.

> I think it's necessary to have an overrided method of smaple
> 
>
> Key: SPARK-18463
> URL: https://issues.apache.org/jira/browse/SPARK-18463
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Jianfei Wang
>
> Currently in this situation: 
> rdd3 = rdd1.zip(rdd2).sample()
> if we can take sample on the two sample directly,such as
> sample(rdd1,rdd2) ,so we can reduce the memory usage.
> there are some use cases in spark mllib,such as in GradientBoostedTrees
>  while (m < numIterations && !doneLearning) {
>   // Update data with pseudo-residuals 剩余误差
>   val data = predError.zip(input).map { case ((pred, _), point) =>
> LabeledPoint(-loss.gradient(pred, point.label), point.features)
>   }
> val dt = new DecisionTreeRegressor().setSeed(seed + m)
>   val model = dt.train(data, treeStrategy)
> when we use data to train model,we will do a sample.
> so we can imp an method sample(rdd1,rdd2) to reduce the memory usage in such 
> cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org