[jira] [Commented] (SPARK-8547) xgboost exploration

2015-11-02 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986288#comment-14986288
 ] 

Meihua Wu commented on SPARK-8547:
--

I have created a Spark Package to implement the XGBoost algorithm. 
https://github.com/rotationsymmetry/SparkXGBoost/

In the README, you can find what has been implemented, as well as features on 
the roadmap. 

Thank you for testing. Looking forward to your feedback or suggestions.

> xgboost exploration
> ---
>
> Key: SPARK-8547
> URL: https://issues.apache.org/jira/browse/SPARK-8547
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>
> There has been quite a bit of excitement around xgboost: 
> [https://github.com/dmlc/xgboost]
> It improves the parallelism of boosting by mixing boosting and bagging (where 
> bagging makes the algorithm more parallel).
> It would worth exploring implementing this within MLlib (probably as a new 
> algorithm).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Spark Implementation of XGBoost

2015-10-27 Thread Meihua Wu
Hi DB Tsai,

Thank you again for your insightful comments!

1) I agree the sorting method you suggested is a very efficient way to
handle the unordered categorical variables in binary classification
and regression. I propose we have a Spark ML Transformer to do the
sorting and encoding, bringing the benefits to many tree based
methods. How about I open a jira for this?

2) For L2/L1 regularization vs Learning rate (I use this name instead
shrinkage to avoid confusion), I have the following observations:

Suppose G and H are the sum (over the data assigned to a leaf node) of
the 1st and 2nd derivative of the loss evaluated at f_m, respectively.
Then for this leaf node,

* With a learning rate eta, f_{m+1} = f_m - G/H*eta

* With a L2 regularization coefficient lambda, f_{m+1} =f_m - G/(H+lambda)

If H>0 (convex loss), both approach lead to "shrinkage":

* For the learning rate approach, the percentage of shrinkage is
uniform for any leaf node.

* For L2 regularization, the percentage of shrinkage would adapt to
the number of instances assigned to a leaf node: more instances =>
larger G and H => less shrinkage. This behavior is intuitive to me. If
the value estimated from this node is based on a large amount of data,
the value should be reliable and less shrinkage is needed.

I suppose we could have something similar for L1.

I am not aware of theoretical results to conclude which method is
better. Likely to be dependent on the data at hand. Implementing
learning rate is on my radar for version 0.2. I should be able to add
it in a week or so. I will send you a note once it is done.

Thanks,

Meihua

On Tue, Oct 27, 2015 at 1:02 AM, DB Tsai <dbt...@dbtsai.com> wrote:
> Hi Meihua,
>
> For categorical features, the ordinal issue can be solved by trying
> all kind of different partitions 2^(q-1) -1 for q values into two
> groups. However, it's computational expensive. In Hastie's book, in
> 9.2.4, the trees can be trained by sorting the residuals and being
> learnt as if they are ordered. It can be proven that it will give the
> optimal solution. I have a proof that this works for learning
> regression trees through variance reduction.
>
> I'm also interested in understanding how the L1 and L2 regularization
> within the boosting works (and if it helps with overfitting more than
> shrinkage).
>
> Thanks.
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Mon, Oct 26, 2015 at 8:37 PM, Meihua Wu <rotationsymmetr...@gmail.com> 
> wrote:
>> Hi DB Tsai,
>>
>> Thank you very much for your interest and comment.
>>
>> 1) feature sub-sample is per-node, like random forest.
>>
>> 2) The current code heavily exploits the tree structure to speed up
>> the learning (such as processing multiple learning node in one pass of
>> the training data). So a generic GBM is likely to be a different
>> codebase. Do you have any nice reference of efficient GBM? I am more
>> than happy to look into that.
>>
>> 3) The algorithm accept training data as a DataFrame with the
>> featureCol indexed by VectorIndexer. You can specify which variable is
>> categorical in the VectorIndexer. Please note that currently all
>> categorical variables are treated as ordered. If you want some
>> categorical variables as unordered, you can pass the data through
>> OneHotEncoder before the VectorIndexer. I do have a plan to handle
>> unordered categorical variable using the approach in RF in Spark ML
>> (Please see roadmap in the README.md)
>>
>> Thanks,
>>
>> Meihua
>>
>>
>>
>> On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai <dbt...@dbtsai.com> wrote:
>>> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
>>> you think you can implement generic GBM and have it merged as part of
>>> Spark codebase?
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> --
>>> Web: https://www.dbtsai.com
>>> PGP Key ID: 0xAF08DF8D
>>>
>>>
>>> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
>>> <rotationsymmetr...@gmail.com> wrote:
>>>> Hi Spark User/Dev,
>>>>
>>>> Inspired by the success of XGBoost, I have created a Spark package for
>>>> gradient boosting tree with 2nd order approximation of arbitrary
>>>> user-defined loss functions.
>>>>
>>>> https://github.com/rotationsymmetry/SparkXGBoost
>>>>
>>>> Currently linear (normal) regression, binary classification, Poisson
>>>> regression a

Re: Spark Implementation of XGBoost

2015-10-27 Thread Meihua Wu
Hi DB Tsai,

Thank you again for your insightful comments!

1) I agree the sorting method you suggested is a very efficient way to
handle the unordered categorical variables in binary classification
and regression. I propose we have a Spark ML Transformer to do the
sorting and encoding, bringing the benefits to many tree based
methods. How about I open a jira for this?

2) For L2/L1 regularization vs Learning rate (I use this name instead
shrinkage to avoid confusion), I have the following observations:

Suppose G and H are the sum (over the data assigned to a leaf node) of
the 1st and 2nd derivative of the loss evaluated at f_m, respectively.
Then for this leaf node,

* With a learning rate eta, f_{m+1} = f_m - G/H*eta

* With a L2 regularization coefficient lambda, f_{m+1} =f_m - G/(H+lambda)

If H>0 (convex loss), both approach lead to "shrinkage":

* For the learning rate approach, the percentage of shrinkage is
uniform for any leaf node.

* For L2 regularization, the percentage of shrinkage would adapt to
the number of instances assigned to a leaf node: more instances =>
larger G and H => less shrinkage. This behavior is intuitive to me. If
the value estimated from this node is based on a large amount of data,
the value should be reliable and less shrinkage is needed.

I suppose we could have something similar for L1.

I am not aware of theoretical results to conclude which method is
better. Likely to be dependent on the data at hand. Implementing
learning rate is on my radar for version 0.2. I should be able to add
it in a week or so. I will send you a note once it is done.

Thanks,

Meihua

On Tue, Oct 27, 2015 at 1:02 AM, DB Tsai <dbt...@dbtsai.com> wrote:
> Hi Meihua,
>
> For categorical features, the ordinal issue can be solved by trying
> all kind of different partitions 2^(q-1) -1 for q values into two
> groups. However, it's computational expensive. In Hastie's book, in
> 9.2.4, the trees can be trained by sorting the residuals and being
> learnt as if they are ordered. It can be proven that it will give the
> optimal solution. I have a proof that this works for learning
> regression trees through variance reduction.
>
> I'm also interested in understanding how the L1 and L2 regularization
> within the boosting works (and if it helps with overfitting more than
> shrinkage).
>
> Thanks.
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Mon, Oct 26, 2015 at 8:37 PM, Meihua Wu <rotationsymmetr...@gmail.com> 
> wrote:
>> Hi DB Tsai,
>>
>> Thank you very much for your interest and comment.
>>
>> 1) feature sub-sample is per-node, like random forest.
>>
>> 2) The current code heavily exploits the tree structure to speed up
>> the learning (such as processing multiple learning node in one pass of
>> the training data). So a generic GBM is likely to be a different
>> codebase. Do you have any nice reference of efficient GBM? I am more
>> than happy to look into that.
>>
>> 3) The algorithm accept training data as a DataFrame with the
>> featureCol indexed by VectorIndexer. You can specify which variable is
>> categorical in the VectorIndexer. Please note that currently all
>> categorical variables are treated as ordered. If you want some
>> categorical variables as unordered, you can pass the data through
>> OneHotEncoder before the VectorIndexer. I do have a plan to handle
>> unordered categorical variable using the approach in RF in Spark ML
>> (Please see roadmap in the README.md)
>>
>> Thanks,
>>
>> Meihua
>>
>>
>>
>> On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai <dbt...@dbtsai.com> wrote:
>>> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
>>> you think you can implement generic GBM and have it merged as part of
>>> Spark codebase?
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> --
>>> Web: https://www.dbtsai.com
>>> PGP Key ID: 0xAF08DF8D
>>>
>>>
>>> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
>>> <rotationsymmetr...@gmail.com> wrote:
>>>> Hi Spark User/Dev,
>>>>
>>>> Inspired by the success of XGBoost, I have created a Spark package for
>>>> gradient boosting tree with 2nd order approximation of arbitrary
>>>> user-defined loss functions.
>>>>
>>>> https://github.com/rotationsymmetry/SparkXGBoost
>>>>
>>>> Currently linear (normal) regression, binary classification, Poisson
>>>> regression a

Re: Spark Implementation of XGBoost

2015-10-26 Thread Meihua Wu
Hi YiZhi,

Thank you for mentioning the jira. I will add a note to the jira.

Meihua

On Mon, Oct 26, 2015 at 6:16 PM, YiZhi Liu <javeli...@gmail.com> wrote:
> There's an xgboost exploration jira SPARK-8547. Can it be a good start?
>
> 2015-10-27 7:07 GMT+08:00 DB Tsai <dbt...@dbtsai.com>:
>> Also, does it support categorical feature?
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>>
>> On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai <dbt...@dbtsai.com> wrote:
>>> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
>>> you think you can implement generic GBM and have it merged as part of
>>> Spark codebase?
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> ------
>>> Web: https://www.dbtsai.com
>>> PGP Key ID: 0xAF08DF8D
>>>
>>>
>>> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
>>> <rotationsymmetr...@gmail.com> wrote:
>>>> Hi Spark User/Dev,
>>>>
>>>> Inspired by the success of XGBoost, I have created a Spark package for
>>>> gradient boosting tree with 2nd order approximation of arbitrary
>>>> user-defined loss functions.
>>>>
>>>> https://github.com/rotationsymmetry/SparkXGBoost
>>>>
>>>> Currently linear (normal) regression, binary classification, Poisson
>>>> regression are supported. You can extend with other loss function as
>>>> well.
>>>>
>>>> L1, L2, bagging, feature sub-sampling are also employed to avoid 
>>>> overfitting.
>>>>
>>>> Thank you for testing. I am looking forward to your comments and
>>>> suggestions. Bugs or improvements can be reported through GitHub.
>>>>
>>>> Many thanks!
>>>>
>>>> Meihua
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>
>
> --
> Yizhi Liu
> Senior Software Engineer / Data Mining
> www.mvad.com, Shanghai, China

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark Implementation of XGBoost

2015-10-26 Thread Meihua Wu
Hi YiZhi,

Thank you for mentioning the jira. I will add a note to the jira.

Meihua

On Mon, Oct 26, 2015 at 6:16 PM, YiZhi Liu <javeli...@gmail.com> wrote:
> There's an xgboost exploration jira SPARK-8547. Can it be a good start?
>
> 2015-10-27 7:07 GMT+08:00 DB Tsai <dbt...@dbtsai.com>:
>> Also, does it support categorical feature?
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>>
>> On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai <dbt...@dbtsai.com> wrote:
>>> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
>>> you think you can implement generic GBM and have it merged as part of
>>> Spark codebase?
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> ------
>>> Web: https://www.dbtsai.com
>>> PGP Key ID: 0xAF08DF8D
>>>
>>>
>>> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
>>> <rotationsymmetr...@gmail.com> wrote:
>>>> Hi Spark User/Dev,
>>>>
>>>> Inspired by the success of XGBoost, I have created a Spark package for
>>>> gradient boosting tree with 2nd order approximation of arbitrary
>>>> user-defined loss functions.
>>>>
>>>> https://github.com/rotationsymmetry/SparkXGBoost
>>>>
>>>> Currently linear (normal) regression, binary classification, Poisson
>>>> regression are supported. You can extend with other loss function as
>>>> well.
>>>>
>>>> L1, L2, bagging, feature sub-sampling are also employed to avoid 
>>>> overfitting.
>>>>
>>>> Thank you for testing. I am looking forward to your comments and
>>>> suggestions. Bugs or improvements can be reported through GitHub.
>>>>
>>>> Many thanks!
>>>>
>>>> Meihua
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>
>
> --
> Yizhi Liu
> Senior Software Engineer / Data Mining
> www.mvad.com, Shanghai, China

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Implementation of XGBoost

2015-10-26 Thread Meihua Wu
Hi DB Tsai,

Thank you very much for your interest and comment.

1) feature sub-sample is per-node, like random forest.

2) The current code heavily exploits the tree structure to speed up
the learning (such as processing multiple learning node in one pass of
the training data). So a generic GBM is likely to be a different
codebase. Do you have any nice reference of efficient GBM? I am more
than happy to look into that.

3) The algorithm accept training data as a DataFrame with the
featureCol indexed by VectorIndexer. You can specify which variable is
categorical in the VectorIndexer. Please note that currently all
categorical variables are treated as ordered. If you want some
categorical variables as unordered, you can pass the data through
OneHotEncoder before the VectorIndexer. I do have a plan to handle
unordered categorical variable using the approach in RF in Spark ML
(Please see roadmap in the README.md)

Thanks,

Meihua



On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai <dbt...@dbtsai.com> wrote:
> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
> you think you can implement generic GBM and have it merged as part of
> Spark codebase?
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
> <rotationsymmetr...@gmail.com> wrote:
>> Hi Spark User/Dev,
>>
>> Inspired by the success of XGBoost, I have created a Spark package for
>> gradient boosting tree with 2nd order approximation of arbitrary
>> user-defined loss functions.
>>
>> https://github.com/rotationsymmetry/SparkXGBoost
>>
>> Currently linear (normal) regression, binary classification, Poisson
>> regression are supported. You can extend with other loss function as
>> well.
>>
>> L1, L2, bagging, feature sub-sampling are also employed to avoid overfitting.
>>
>> Thank you for testing. I am looking forward to your comments and
>> suggestions. Bugs or improvements can be reported through GitHub.
>>
>> Many thanks!
>>
>> Meihua
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark Implementation of XGBoost

2015-10-26 Thread Meihua Wu
Hi DB Tsai,

Thank you very much for your interest and comment.

1) feature sub-sample is per-node, like random forest.

2) The current code heavily exploits the tree structure to speed up
the learning (such as processing multiple learning node in one pass of
the training data). So a generic GBM is likely to be a different
codebase. Do you have any nice reference of efficient GBM? I am more
than happy to look into that.

3) The algorithm accept training data as a DataFrame with the
featureCol indexed by VectorIndexer. You can specify which variable is
categorical in the VectorIndexer. Please note that currently all
categorical variables are treated as ordered. If you want some
categorical variables as unordered, you can pass the data through
OneHotEncoder before the VectorIndexer. I do have a plan to handle
unordered categorical variable using the approach in RF in Spark ML
(Please see roadmap in the README.md)

Thanks,

Meihua



On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai <dbt...@dbtsai.com> wrote:
> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
> you think you can implement generic GBM and have it merged as part of
> Spark codebase?
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
> <rotationsymmetr...@gmail.com> wrote:
>> Hi Spark User/Dev,
>>
>> Inspired by the success of XGBoost, I have created a Spark package for
>> gradient boosting tree with 2nd order approximation of arbitrary
>> user-defined loss functions.
>>
>> https://github.com/rotationsymmetry/SparkXGBoost
>>
>> Currently linear (normal) regression, binary classification, Poisson
>> regression are supported. You can extend with other loss function as
>> well.
>>
>> L1, L2, bagging, feature sub-sampling are also employed to avoid overfitting.
>>
>> Thank you for testing. I am looking forward to your comments and
>> suggestions. Bugs or improvements can be reported through GitHub.
>>
>> Many thanks!
>>
>> Meihua
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Implementation of XGBoost

2015-10-26 Thread Meihua Wu
Hi Spark User/Dev,

Inspired by the success of XGBoost, I have created a Spark package for
gradient boosting tree with 2nd order approximation of arbitrary
user-defined loss functions.

https://github.com/rotationsymmetry/SparkXGBoost

Currently linear (normal) regression, binary classification, Poisson
regression are supported. You can extend with other loss function as
well.

L1, L2, bagging, feature sub-sampling are also employed to avoid overfitting.

Thank you for testing. I am looking forward to your comments and
suggestions. Bugs or improvements can be reported through GitHub.

Many thanks!

Meihua

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Implementation of XGBoost

2015-10-26 Thread Meihua Wu
Hi Spark User/Dev,

Inspired by the success of XGBoost, I have created a Spark package for
gradient boosting tree with 2nd order approximation of arbitrary
user-defined loss functions.

https://github.com/rotationsymmetry/SparkXGBoost

Currently linear (normal) regression, binary classification, Poisson
regression are supported. You can extend with other loss function as
well.

L1, L2, bagging, feature sub-sampling are also employed to avoid overfitting.

Thank you for testing. I am looking forward to your comments and
suggestions. Bugs or improvements can be reported through GitHub.

Many thanks!

Meihua

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [SPARK MLLIB] could not understand the wrong and inscrutable result of Linear Regression codes

2015-10-25 Thread Meihua Wu
please add "setFitIntercept(false)" to your LinearRegression.

LinearRegression by default includes an intercept in the model, e.g.
label = intercept + features dot weight

To get the result you want, you need to force the intercept to be zero.

Just curious, are you trying to solve systems of linear equations? If
so, you can probably try breeze.



On Sun, Oct 25, 2015 at 9:10 PM, Zhiliang Zhu
 wrote:
>
>
>
> On Monday, October 26, 2015 11:26 AM, Zhiliang Zhu
>  wrote:
>
>
> Hi DB Tsai,
>
> Thanks very much for your kind help. I  get it now.
>
> I am sorry that there is another issue, the weight/coefficient result is
> perfect while A is triangular matrix, however, while A is not triangular
> matrix (but
> transformed from triangular matrix, still is invertible), the result seems
> not perfect and difficult to make it better by resetting the parameter.
> Would you help comment some about that...
>
> List localTraining = Lists.newArrayList(
>   new LabeledPoint(30.0, Vectors.dense(1.0, 2.0, 3.0, 4.0)),
>   new LabeledPoint(29.0, Vectors.dense(0.0, 2.0, 3.0, 4.0)),
>   new LabeledPoint(25.0, Vectors.dense(0.0, 0.0, 3.0, 4.0)),
>   new LabeledPoint(-3.0, Vectors.dense(0.0, 0.0, -1.0, 0.0)));
> ...
> LinearRegression lr = new LinearRegression()
>   .setMaxIter(2)
>   .setRegParam(0)
>   .setElasticNetParam(0);
> 
>
> --
>
> It seems that no matter how to reset the parameters for lr , the output of
> x3 and x4 is always nearly the same .
> Whether there is some way to make the result a little better...
>
>
> --
>
> x3 and x4 could not become better, the output is:
> Final w:
> [0.999477672867,1.999748740578,3.500112393734,3.50011239377]
>
> Thank you,
> Zhiliang
>
>
>
> On Monday, October 26, 2015 10:25 AM, DB Tsai  wrote:
>
>
> Column 4 is always constant, so no predictive power resulting zero weight.
>
> On Sunday, October 25, 2015, Zhiliang Zhu  wrote:
>
> Hi DB Tsai,
>
> Thanks very much for your kind reply help.
>
> As for your comment, I just modified and tested the key part of the codes:
>
>  LinearRegression lr = new LinearRegression()
>.setMaxIter(1)
>.setRegParam(0)
>.setElasticNetParam(0);  //the number could be reset
>
>  final LinearRegressionModel model = lr.fit(training);
>
> Now the output is much reasonable, however, x4 is always 0 while repeatedly
> reset those parameters in lr , would you help some about it how to properly
> set the parameters ...
>
> Final w: [1.00127825909,1.99979185054,2.3307136,0.0]
>
> Thank you,
> Zhiliang
>
>
>
>
> On Monday, October 26, 2015 5:14 AM, DB Tsai  wrote:
>
>
> LinearRegressionWithSGD is not stable. Please use linear regression in
> ML package instead.
> http://spark.apache.org/docs/latest/ml-linear-methods.html
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Sun, Oct 25, 2015 at 10:14 AM, Zhiliang Zhu
>  wrote:
>> Dear All,
>>
>> I have some program as below which makes me very much confused and
>> inscrutable, it is about multiple dimension linear regression mode, the
>> weight / coefficient is always perfect while the dimension is smaller than
>> 4, otherwise it is wrong all the time.
>> Or, whether the LinearRegressionWithSGD would be selected for another one?
>>
>> public class JavaLinearRegression {
>>  public static void main(String[] args) {
>>SparkConf conf = new SparkConf().setAppName("Linear Regression
>> Example");
>>JavaSparkContext sc = new JavaSparkContext(conf);
>>SQLContext jsql = new SQLContext(sc);
>>
>>//Ax = b, x = [1, 2, 3, 4] would be the only one output about weight
>>//x1 + 2 * x2 + 3 * x3 + 4 * x4 = y would be the multiple linear mode
>>List localTraining = Lists.newArrayList(
>>new LabeledPoint(30.0, Vectors.dense(1.0, 2.0, 3.0, 4.0)),
>>new LabeledPoint(29.0, Vectors.dense(0.0, 2.0, 3.0, 4.0)),
>>new LabeledPoint(25.0, Vectors.dense(0.0, 0.0, 3.0, 4.0)),
>>new LabeledPoint(16.0, Vectors.dense(0.0, 0.0, 0.0, 4.0)));
>>
>>JavaRDD training = sc.parallelize(localTraining).cache();
>>
>>// Building the model
>>int numIterations = 1000; //the number could be reset large
>>final LinearRegressionModel model =
>> LinearRegressionWithSGD.train(JavaRDD.toRDD(training), numIterations);
>>
>>

Flaky Jenkins tests?

2015-10-12 Thread Meihua Wu
Hi Spark Devs,

I recently encountered several cases that the Jenkin failed tests that
are supposed to be unrelated to my patch. For example, I made a patch
to Spark ML Scala API but some Scala RDD tests failed due to timeout,
or the java_gateway in PySpark fails. Just wondering if these are
isolated cases?

Thanks,

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Flaky Jenkins tests?

2015-10-12 Thread Meihua Wu
Hi Ted,

Thanks for the info. I have checked but I did not find the failures though.

In my cases, I have seen

1) spilling in ExternalAppendOnlyMapSuite failed due to timeout.
[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43531/console]

2) pySpark failure
[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43553/console]

Traceback (most recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 316, in _get_connection
IndexError: pop from an empty deque



On Mon, Oct 12, 2015 at 1:36 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> You can go to:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN
>
> and see if the test failure(s) you encountered appeared there.
>
> FYI
>
> On Mon, Oct 12, 2015 at 1:24 PM, Meihua Wu <rotationsymmetr...@gmail.com>
> wrote:
>>
>> Hi Spark Devs,
>>
>> I recently encountered several cases that the Jenkin failed tests that
>> are supposed to be unrelated to my patch. For example, I made a patch
>> to Spark ML Scala API but some Scala RDD tests failed due to timeout,
>> or the java_gateway in PySpark fails. Just wondering if these are
>> isolated cases?
>>
>> Thanks,
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Commented] (SPARK-7129) Add generic boosting algorithm to spark.ml

2015-10-05 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942971#comment-14942971
 ] 

Meihua Wu commented on SPARK-7129:
--

Currently I am not aware of a straightforward way to impose the weak 
restriction using the type system yet. Let's keep discuss. 

> Add generic boosting algorithm to spark.ml
> --
>
> Key: SPARK-7129
> URL: https://issues.apache.org/jira/browse/SPARK-7129
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> The Pipelines API will make it easier to create a generic Boosting algorithm 
> which can work with any Classifier or Regressor. Creating this feature will 
> require researching the possible variants and extensions of boosting which we 
> may want to support now and/or in the future, and planning an API which will 
> be properly extensible.
> In particular, it will be important to think about supporting:
> * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.)
> * multiclass variants
> * multilabel variants (which will probably be in a separate class and JIRA)
> * For more esoteric variants, we should consider them but not design too much 
> around them: totally corrective boosting, cascaded models
> Note: This may interact some with the existing tree ensemble methods, but it 
> should be largely separate since the tree ensemble APIs and implementations 
> are specialized for trees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9478) Add class weights to Random Forest

2015-10-04 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942754#comment-14942754
 ] 

Meihua Wu commented on SPARK-9478:
--

[~pcrenshaw] Are you working on this? If not, I can send a PR based on 
[~josephkb]'s suggestions. 

> Add class weights to Random Forest
> --
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7129) Add generic boosting algorithm to spark.ml

2015-09-26 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14909512#comment-14909512
 ] 

Meihua Wu commented on SPARK-7129:
--

[~sethah] Thank you for your comments! I have updated the design doc to 
(hopefully) address your concerns.

* I agree with you that we will have separate classes for the algorithms. 
Different algorithms will have different type requirement for the base learner. 
So they cannot be combined in one single class. However, I think we need to 
keep the relevant methods and parameters consistent across the boosting 
algorithm classes. For example, they all use `setBaseLearner` to specify the 
base learner.

* I corrected the design doc to clarify this: `setBaseLearner` should take a 
instance of the type `Classifier[FeatureType, Learner, LearnerModel] with 
HasWeightCol`. This type requirement the base learner will make use of the 
weight data in the estimation. At the moment, `setBaseLearner` will take 
`LogisticRegression` for `AdaBoostClassifier` and `LinearRegression` for 
`AdaBoostRegression`. Did I answer your question?

* Sure, I will revise `SAMMEClassifier` to `AdaBoostClassifier`.

* `setNumberOfBaseLearners` is to set the number of iterations. 

Finally, the current proposal only supports one base learner. This is the same 
as the AdaBoost algorithm in SciKit Learn. Adding multiple base learner could 
be our next step.



> Add generic boosting algorithm to spark.ml
> --
>
> Key: SPARK-7129
> URL: https://issues.apache.org/jira/browse/SPARK-7129
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> The Pipelines API will make it easier to create a generic Boosting algorithm 
> which can work with any Classifier or Regressor. Creating this feature will 
> require researching the possible variants and extensions of boosting which we 
> may want to support now and/or in the future, and planning an API which will 
> be properly extensible.
> In particular, it will be important to think about supporting:
> * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.)
> * multiclass variants
> * multilabel variants (which will probably be in a separate class and JIRA)
> * For more esoteric variants, we should consider them but not design too much 
> around them: totally corrective boosting, cascaded models
> Note: This may interact some with the existing tree ensemble methods, but it 
> should be largely separate since the tree ensemble APIs and implementations 
> are specialized for trees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7129) Add generic boosting algorithm to spark.ml

2015-09-25 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908395#comment-14908395
 ] 

Meihua Wu commented on SPARK-7129:
--

[~josephkb] [~sethah]
I have compile a doc for AdaBoost. 
https://docs.google.com/document/d/1Neo5_6po9ap7dZuT3fwT6ptJa_XvkUUdRgCqB51lcy4/edit#heading=h.d4mq6f37je6x

Thank you very much for reviewing them. I am look forward to your comments.

> Add generic boosting algorithm to spark.ml
> --
>
> Key: SPARK-7129
> URL: https://issues.apache.org/jira/browse/SPARK-7129
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> The Pipelines API will make it easier to create a generic Boosting algorithm 
> which can work with any Classifier or Regressor. Creating this feature will 
> require researching the possible variants and extensions of boosting which we 
> may want to support now and/or in the future, and planning an API which will 
> be properly extensible.
> In particular, it will be important to think about supporting:
> * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.)
> * multiclass variants
> * multilabel variants (which will probably be in a separate class and JIRA)
> * For more esoteric variants, we should consider them but not design too much 
> around them: totally corrective boosting, cascaded models
> Note: This may interact some with the existing tree ensemble methods, but it 
> should be largely separate since the tree ensemble APIs and implementations 
> are specialized for trees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7129) Add generic boosting algorithm to spark.ml

2015-09-19 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877275#comment-14877275
 ] 

Meihua Wu commented on SPARK-7129:
--

[~josephkb] As weighting has been added to logistic regression and linear 
regression recently, I think we are in a good position to work on the boosting 
algorithms. Are there any plans to have it for 1.6? If so, I would like to work 
on this. Thanks!

> Add generic boosting algorithm to spark.ml
> --
>
> Key: SPARK-7129
> URL: https://issues.apache.org/jira/browse/SPARK-7129
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> The Pipelines API will make it easier to create a generic Boosting algorithm 
> which can work with any Classifier or Regressor. Creating this feature will 
> require researching the possible variants and extensions of boosting which we 
> may want to support now and/or in the future, and planning an API which will 
> be properly extensible.
> In particular, it will be important to think about supporting:
> * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.)
> * multiclass variants
> * multilabel variants (which will probably be in a separate class and JIRA)
> * For more esoteric variants, we should consider them but not design too much 
> around them: totally corrective boosting, cascaded models
> Note: This may interact some with the existing tree ensemble methods, but it 
> should be largely separate since the tree ensemble APIs and implementations 
> are specialized for trees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10706) Add java wrapper for random vector rdd

2015-09-19 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877220#comment-14877220
 ] 

Meihua Wu edited comment on SPARK-10706 at 9/19/15 6:48 PM:


[~mengxr] I notice the Scala API `randomVectorRDD` has a DeveloperAPI 
annotation. I am checking if there is a reason to not expose the java wrapper. 
If not, I will submit a PR to resolve this JIRA. Thanks. 


was (Author: meihuawu):
I will work on this.

> Add java wrapper for random vector rdd
> --
>
> Key: SPARK-10706
> URL: https://issues.apache.org/jira/browse/SPARK-10706
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, MLlib
>Reporter: holdenk
>
> Similar to SPARK-3136 also wrap the random vector API to make it callable 
> easily from Java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10706) Add java wrapper for random vector rdd

2015-09-19 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877220#comment-14877220
 ] 

Meihua Wu commented on SPARK-10706:
---

I will work on this.

> Add java wrapper for random vector rdd
> --
>
> Key: SPARK-10706
> URL: https://issues.apache.org/jira/browse/SPARK-10706
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, MLlib
>Reporter: holdenk
>
> Similar to SPARK-3136 also wrap the random vector API to make it callable 
> easily from Java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9834) Normal equation solver and summary statistics for ordinary least squares

2015-09-02 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14727568#comment-14727568
 ] 

Meihua Wu commented on SPARK-9834:
--

I would like to work on this. 

> Normal equation solver and summary statistics for ordinary least squares
> 
>
> Key: SPARK-9834
> URL: https://issues.apache.org/jira/browse/SPARK-9834
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Add normal equation solver for ordinary least squares with not many features. 
> The approach requires one pass to collect AtA and Atb, then solve the problem 
> on driver. It works well when the problem is not very ill-conditioned and not 
> having many columns. It also provides R-like summary statistics.
> We can hide this implementation under LinearRegression. It is triggered when 
> there are no more than, e.g., 4096 features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9834) Normal equation solver and summary statistics for ordinary least squares

2015-09-02 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14728065#comment-14728065
 ] 

Meihua Wu commented on SPARK-9834:
--

[~mengxr] Sure. Just let me know if there is anything I can help.

> Normal equation solver and summary statistics for ordinary least squares
> 
>
> Key: SPARK-9834
> URL: https://issues.apache.org/jira/browse/SPARK-9834
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Add normal equation solver for ordinary least squares with not many features. 
> The approach requires one pass to collect AtA and Atb, then solve the problem 
> on driver. It works well when the problem is not very ill-conditioned and not 
> having many columns. It also provides R-like summary statistics.
> We can hide this implementation under LinearRegression. It is triggered when 
> there are no more than, e.g., 4096 features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-09-01 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14725935#comment-14725935
 ] 

Meihua Wu commented on SPARK-8518:
--

For the reference implementations, recommend we consider this R function: 
https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html 



> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>Priority: Critical
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9642) LinearRegression should supported weighted data

2015-08-30 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14721801#comment-14721801
 ] 

Meihua Wu commented on SPARK-9642:
--

[~sethah] Thank you for your help. I worked on this and have a draft version, 
which makes use of a few components in the PR for a similar issue of logistics 
regression (https://issues.apache.org/jira/browse/SPARK-7685). I am planning to 
send a PR after the issue 7685 is resolved. 

 LinearRegression should supported weighted data
 ---

 Key: SPARK-9642
 URL: https://issues.apache.org/jira/browse/SPARK-9642
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Meihua Wu
  Labels: 1.6

 In many modeling application, data points are not necessarily sampled with 
 equal probabilities. Linear regression should support weighting which account 
 the over or under sampling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-08-19 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14702676#comment-14702676
 ] 

Meihua Wu commented on SPARK-8518:
--

[~mengxr] [~yanbo] Either way works for me. 

In R and some Python survival package, it is called event.
In SAS, it is called censored.

 Log-linear models for survival analysis
 ---

 Key: SPARK-8518
 URL: https://issues.apache.org/jira/browse/SPARK-8518
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Xiangrui Meng
Assignee: Yanbo Liang
   Original Estimate: 168h
  Remaining Estimate: 168h

 We want to add basic log-linear models for survival analysis. The 
 implementation should match the result from R's survival package 
 (http://cran.r-project.org/web/packages/survival/index.html).
 Design doc from [~yanboliang]: 
 https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9245) DistributedLDAModel predict top topic per doc-term instance

2015-08-18 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14702155#comment-14702155
 ] 

Meihua Wu commented on SPARK-9245:
--

[~josephkb] Thank you for clarifying my question. I was originally planning to 
work on this but I just got too busy last week. I could help reviewing though.

 DistributedLDAModel predict top topic per doc-term instance
 ---

 Key: SPARK-9245
 URL: https://issues.apache.org/jira/browse/SPARK-9245
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
   Original Estimate: 48h
  Remaining Estimate: 48h

 For each (document, term) pair, return top topic.  Note that instances of 
 (doc, term) pairs within a document (a.k.a. tokens) are exchangeable, so we 
 should provide an estimate per document-term, rather than per token.
 Synopsis for DistributedLDAModel:
 {code}
 /** @return RDD of (doc ID, vector of top topic index for each term) */
 def topTopicAssignments: RDD[(Long, Vector)]
 {code}
 Note that using Vector will let us have a sparse encoding which is 
 Java-friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-08-17 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700302#comment-14700302
 ] 

Meihua Wu commented on SPARK-8518:
--

[~yanbo] Thank you very much for the update!

The loss function and gradient are different for events and censor. So we will 
need to have a column in the data frame to indicate whether an individual 
record is an event or censored. I suppose we will need to define a Param for 
eventCol using code gen and mix it into the AFTRegressionParams. 

cc [~mengxr]

 Log-linear models for survival analysis
 ---

 Key: SPARK-8518
 URL: https://issues.apache.org/jira/browse/SPARK-8518
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Xiangrui Meng
Assignee: Yanbo Liang
   Original Estimate: 168h
  Remaining Estimate: 168h

 We want to add basic log-linear models for survival analysis. The 
 implementation should match the result from R's survival package 
 (http://cran.r-project.org/web/packages/survival/index.html).
 Design doc from [~yanboliang]: 
 https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8520) Improve GLM's scalability on number of features

2015-08-17 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700357#comment-14700357
 ] 

Meihua Wu commented on SPARK-8520:
--

For 1, how about migrate to treeReduce and treeAggregate? 

 Improve GLM's scalability on number of features
 ---

 Key: SPARK-8520
 URL: https://issues.apache.org/jira/browse/SPARK-8520
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical
  Labels: advanced

 MLlib's GLM implementation uses driver to collect gradient updates. When 
 there exist many features (20 million), the driver becomes the performance 
 bottleneck. In practice, it is common to see a problem with a large feature 
 dimension, resulting from hashing or other feature transformations. So it is 
 important to improve MLlib's scalability on number of features.
 There are couple possible solutions:
 1. still use driver to collect updates, but reduce the amount of data it 
 collects at each iteration.
 2. apply 2D partitioning to the training data and store the model 
 coefficients distributively (e.g., vector-free l-bfgs)
 3. parameter server
 4. ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Meihua Wu
I think in the SGD algorithm, the mini batch sample is done without
replacement. So with fraction=1, then all the rows will be sampled
exactly once to form the miniBatch, resulting to the
deterministic/classical case.

On Fri, Aug 7, 2015 at 9:05 AM, Feynman Liang fli...@databricks.com wrote:
 Sounds reasonable to me, feel free to create a JIRA (and PR if you're up for
 it) so we can see what others think!

 On Fri, Aug 7, 2015 at 1:45 AM, Gerald Loeffler
 gerald.loeff...@googlemail.com wrote:

 hi,

 if new LinearRegressionWithSGD() uses a miniBatchFraction of 1.0,
 doesn’t that make it a deterministic/classical gradient descent rather
 than a SGD?

 Specifically, miniBatchFraction=1.0 means the entire data set, i.e.
 all rows. In the spirit of SGD, shouldn’t the default be the fraction
 that results in exactly one row of the data set?

 thank you
 gerald

 --
 Gerald Loeffler
 mailto:gerald.loeff...@googlemail.com
 http://www.gerald-loeffler.net

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Meihua Wu
Feynman, thanks for clarifying.

If we default miniBatchFraction = (1 / numInstances), then we will
only hit one row for every iteration of SGD regardless the number of
partitions and executors. In other words the parallelism provided by
the RDD is lost in this approach. I think this is something we need to
consider for the default value of miniBatchFraction.

On Fri, Aug 7, 2015 at 11:24 AM, Feynman Liang fli...@databricks.com wrote:
 Yep, I think that's what Gerald is saying and they are proposing to default
 miniBatchFraction = (1 / numInstances). Is that correct?

 On Fri, Aug 7, 2015 at 11:16 AM, Meihua Wu rotationsymmetr...@gmail.com
 wrote:

 I think in the SGD algorithm, the mini batch sample is done without
 replacement. So with fraction=1, then all the rows will be sampled
 exactly once to form the miniBatch, resulting to the
 deterministic/classical case.

 On Fri, Aug 7, 2015 at 9:05 AM, Feynman Liang fli...@databricks.com
 wrote:
  Sounds reasonable to me, feel free to create a JIRA (and PR if you're up
  for
  it) so we can see what others think!
 
  On Fri, Aug 7, 2015 at 1:45 AM, Gerald Loeffler
  gerald.loeff...@googlemail.com wrote:
 
  hi,
 
  if new LinearRegressionWithSGD() uses a miniBatchFraction of 1.0,
  doesn’t that make it a deterministic/classical gradient descent rather
  than a SGD?
 
  Specifically, miniBatchFraction=1.0 means the entire data set, i.e.
  all rows. In the spirit of SGD, shouldn’t the default be the fraction
  that results in exactly one row of the data set?
 
  thank you
  gerald
 
  --
  Gerald Loeffler
  mailto:gerald.loeff...@googlemail.com
  http://www.gerald-loeffler.net
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[jira] [Created] (SPARK-9642) LinearRegression should supported weighted data

2015-08-05 Thread Meihua Wu (JIRA)
Meihua Wu created SPARK-9642:


 Summary: LinearRegression should supported weighted data
 Key: SPARK-9642
 URL: https://issues.apache.org/jira/browse/SPARK-9642
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Meihua Wu


In every modeling application, data points are not necessarily sampled with 
equal probabilities. Linear regression should support weighting which account 
the over or under sampling. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9642) LinearRegression should supported weighted data

2015-08-05 Thread Meihua Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meihua Wu updated SPARK-9642:
-
Description: 
In many modeling application, data points are not necessarily sampled with 
equal probabilities. Linear regression should support weighting which account 
the over or under sampling. 


  was:
In every modeling application, data points are not necessarily sampled with 
equal probabilities. Linear regression should support weighting which account 
the over or under sampling. 



 LinearRegression should supported weighted data
 ---

 Key: SPARK-9642
 URL: https://issues.apache.org/jira/browse/SPARK-9642
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Meihua Wu

 In many modeling application, data points are not necessarily sampled with 
 equal probabilities. Linear regression should support weighting which account 
 the over or under sampling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



How to help for 1.5 release?

2015-08-04 Thread Meihua Wu
I think the team is preparing for the 1.5 release. Anything to help with
the QA, testing etc?

Thanks,

MW


Re: Does RDD.cartesian involve shuffling?

2015-08-04 Thread Meihua Wu
Thanks, Richard!

I basically have two RDD's: A and B; and I need to compute a value for
every pair of (a, b) for a in A and b in B. My first thought is
cartesian, but involves expensive shuffle.

Any alternatives? How about I convert B to an array and broadcast it
to every node (assuming B is relative small to fit)?



On Tue, Aug 4, 2015 at 8:23 AM, Richard Marscher
rmarsc...@localytics.com wrote:
 Yes it does, in fact it's probably going to be one of the more expensive
 shuffles you could trigger.

 On Mon, Aug 3, 2015 at 12:56 PM, Meihua Wu rotationsymmetr...@gmail.com
 wrote:

 Does RDD.cartesian involve shuffling?

 Thanks!

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 Richard Marscher
 Software Engineer
 Localytics
 Localytics.com | Our Blog | Twitter | Facebook | LinkedIn

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Does RDD.cartesian involve shuffling?

2015-08-03 Thread Meihua Wu
Does RDD.cartesian involve shuffling?

Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[jira] [Commented] (SPARK-9245) DistributedLDAModel predict top topic per doc-term instance

2015-08-02 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651319#comment-14651319
 ] 

Meihua Wu commented on SPARK-9245:
--

[~josephkb]: would like to confirm: (using notation Asuncion 2009), for doc `j` 
and term `w`, find the topic `k` such that gamma_wjk is maximized?



 DistributedLDAModel predict top topic per doc-term instance
 ---

 Key: SPARK-9245
 URL: https://issues.apache.org/jira/browse/SPARK-9245
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
   Original Estimate: 48h
  Remaining Estimate: 48h

 For each (document, term) pair, return top topic.  Note that instances of 
 (doc, term) pairs within a document (a.k.a. tokens) are exchangeable, so we 
 should provide an estimate per document-term, rather than per token.
 Synopsis for DistributedLDAModel:
 {code}
 /** @return RDD of (doc ID, vector of top topic index for each term) */
 def topTopicAssignments: RDD[(Long, Vector)]
 {code}
 Note that using Vector will let us have a sparse encoding which is 
 Java-friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9530) ScalaDoc should not indicate LDAModel.describeTopics and DistributedLDAModel.topDocumentsPerTopic as approximate.

2015-08-01 Thread Meihua Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meihua Wu updated SPARK-9530:
-
Summary: ScalaDoc should not indicate LDAModel.describeTopics and 
DistributedLDAModel.topDocumentsPerTopic as approximate.  (was: ScalaDoc should 
not indicate LDAModel.descripeTopic and 
DistributedLDAModel.topDocumentsPerTopic as approximate.)

 ScalaDoc should not indicate LDAModel.describeTopics and 
 DistributedLDAModel.topDocumentsPerTopic as approximate.
 -

 Key: SPARK-9530
 URL: https://issues.apache.org/jira/browse/SPARK-9530
 Project: Spark
  Issue Type: Documentation
  Components: MLlib
Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1
Reporter: Meihua Wu
Priority: Minor

 Currently the ScalaDoc for LDAModel.descripeTopic and 
 DistributedLDAModel.topDocumentsPerTopic suggests that these methods are  
 approximate. However, both methods are actually precise and there is no need 
 to increase maxTermsPerTopic or maxDocumentsPerTopic to get a more precise 
 set of top terms. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9530) ScalaDoc should not indicate LDAModel.descripeTopic and DistributedLDAModel.topDocumentsPerTopic as approximate.

2015-08-01 Thread Meihua Wu (JIRA)
Meihua Wu created SPARK-9530:


 Summary: ScalaDoc should not indicate LDAModel.descripeTopic and 
DistributedLDAModel.topDocumentsPerTopic as approximate.
 Key: SPARK-9530
 URL: https://issues.apache.org/jira/browse/SPARK-9530
 Project: Spark
  Issue Type: Documentation
  Components: MLlib
Affects Versions: 1.4.1, 1.4.0, 1.3.1, 1.3.0
Reporter: Meihua Wu
Priority: Minor


Currently the ScalaDoc for LDAModel.descripeTopic and 
DistributedLDAModel.topDocumentsPerTopic suggests that these methods are  
approximate. However, both methods are actually precise and there is no need to 
increase maxTermsPerTopic or maxDocumentsPerTopic to get a more precise set of 
top terms. 






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9530) ScalaDoc should not indicate LDAModel.describeTopics and DistributedLDAModel.topDocumentsPerTopic as approximate.

2015-08-01 Thread Meihua Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meihua Wu updated SPARK-9530:
-
Description: 
Currently the ScalaDoc for LDAModel.describeTopics and 
DistributedLDAModel.topDocumentsPerTopic suggests that these methods are  
approximate. However, both methods are actually precise and there is no need to 
increase maxTermsPerTopic or maxDocumentsPerTopic to get a more precise set of 
top terms. 




  was:
Currently the ScalaDoc for LDAModel.descripeTopic and 
DistributedLDAModel.topDocumentsPerTopic suggests that these methods are  
approximate. However, both methods are actually precise and there is no need to 
increase maxTermsPerTopic or maxDocumentsPerTopic to get a more precise set of 
top terms. 





 ScalaDoc should not indicate LDAModel.describeTopics and 
 DistributedLDAModel.topDocumentsPerTopic as approximate.
 -

 Key: SPARK-9530
 URL: https://issues.apache.org/jira/browse/SPARK-9530
 Project: Spark
  Issue Type: Documentation
  Components: MLlib
Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1
Reporter: Meihua Wu
Priority: Minor

 Currently the ScalaDoc for LDAModel.describeTopics and 
 DistributedLDAModel.topDocumentsPerTopic suggests that these methods are  
 approximate. However, both methods are actually precise and there is no need 
 to increase maxTermsPerTopic or maxDocumentsPerTopic to get a more precise 
 set of top terms. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9246) DistributedLDAModel predict top docs per topic

2015-07-29 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646312#comment-14646312
 ] 

Meihua Wu commented on SPARK-9246:
--

Cool. I see. will keep updating about the progress.

I have a question: is topDocumentsPerTopic exact or approximate (like 
describeTopics which, according to ScalaDoc, may not return exactly the 
top-weighted terms for each topic; to get a more precise set of top terms, 
increase maxTermsPerTopic.)?

 DistributedLDAModel predict top docs per topic
 --

 Key: SPARK-9246
 URL: https://issues.apache.org/jira/browse/SPARK-9246
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
   Original Estimate: 72h
  Remaining Estimate: 72h

 For each topic, return top documents based on topicDistributions.
 Synopsis:
 {code}
 /**
  * @param maxDocuments  Max docs to return for each topic
  * @return Array over topics of (sorted top docs, corresponding doc-topic 
 weights)
  */
 def topDocumentsPerTopic(maxDocuments: Int): Array[(Array[Long], 
 Array[Double])]
 {code}
 Note: We will need to make sure that the above return value format is 
 Java-friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9246) DistributedLDAModel predict top docs per topic

2015-07-29 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646570#comment-14646570
 ] 

Meihua Wu commented on SPARK-9246:
--

Got it. Thanks!

 DistributedLDAModel predict top docs per topic
 --

 Key: SPARK-9246
 URL: https://issues.apache.org/jira/browse/SPARK-9246
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
   Original Estimate: 72h
  Remaining Estimate: 72h

 For each topic, return top documents based on topicDistributions.
 Synopsis:
 {code}
 /**
  * @param maxDocuments  Max docs to return for each topic
  * @return Array over topics of (sorted top docs, corresponding doc-topic 
 weights)
  */
 def topDocumentsPerTopic(maxDocuments: Int): Array[(Array[Long], 
 Array[Double])]
 {code}
 Note: We will need to make sure that the above return value format is 
 Java-friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9246) DistributedLDAModel predict top docs per topic

2015-07-29 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646571#comment-14646571
 ] 

Meihua Wu commented on SPARK-9246:
--

Got it. Thanks!

 DistributedLDAModel predict top docs per topic
 --

 Key: SPARK-9246
 URL: https://issues.apache.org/jira/browse/SPARK-9246
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
   Original Estimate: 72h
  Remaining Estimate: 72h

 For each topic, return top documents based on topicDistributions.
 Synopsis:
 {code}
 /**
  * @param maxDocuments  Max docs to return for each topic
  * @return Array over topics of (sorted top docs, corresponding doc-topic 
 weights)
  */
 def topDocumentsPerTopic(maxDocuments: Int): Array[(Array[Long], 
 Array[Double])]
 {code}
 Note: We will need to make sure that the above return value format is 
 Java-friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-9246) DistributedLDAModel predict top docs per topic

2015-07-29 Thread Meihua Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meihua Wu updated SPARK-9246:
-
Comment: was deleted

(was: Got it. Thanks!)

 DistributedLDAModel predict top docs per topic
 --

 Key: SPARK-9246
 URL: https://issues.apache.org/jira/browse/SPARK-9246
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
   Original Estimate: 72h
  Remaining Estimate: 72h

 For each topic, return top documents based on topicDistributions.
 Synopsis:
 {code}
 /**
  * @param maxDocuments  Max docs to return for each topic
  * @return Array over topics of (sorted top docs, corresponding doc-topic 
 weights)
  */
 def topDocumentsPerTopic(maxDocuments: Int): Array[(Array[Long], 
 Array[Double])]
 {code}
 Note: We will need to make sure that the above return value format is 
 Java-friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-9246) DistributedLDAModel predict top docs per topic

2015-07-29 Thread Meihua Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Meihua Wu updated SPARK-9246:
-
Comment: was deleted

(was: Got it. Thanks!)

 DistributedLDAModel predict top docs per topic
 --

 Key: SPARK-9246
 URL: https://issues.apache.org/jira/browse/SPARK-9246
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
   Original Estimate: 72h
  Remaining Estimate: 72h

 For each topic, return top documents based on topicDistributions.
 Synopsis:
 {code}
 /**
  * @param maxDocuments  Max docs to return for each topic
  * @return Array over topics of (sorted top docs, corresponding doc-topic 
 weights)
  */
 def topDocumentsPerTopic(maxDocuments: Int): Array[(Array[Long], 
 Array[Double])]
 {code}
 Note: We will need to make sure that the above return value format is 
 Java-friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9246) DistributedLDAModel predict top docs per topic

2015-07-29 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647096#comment-14647096
 ] 

Meihua Wu commented on SPARK-9246:
--

I have submitted a PR including the code, ScalaDoc and unit tests. 

If it passes review, I will remove the WIP tag for Jenkins build/test.

 DistributedLDAModel predict top docs per topic
 --

 Key: SPARK-9246
 URL: https://issues.apache.org/jira/browse/SPARK-9246
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
   Original Estimate: 72h
  Remaining Estimate: 72h

 For each topic, return top documents based on topicDistributions.
 Synopsis:
 {code}
 /**
  * @param maxDocuments  Max docs to return for each topic
  * @return Array over topics of (sorted top docs, corresponding doc-topic 
 weights)
  */
 def topDocumentsPerTopic(maxDocuments: Int): Array[(Array[Long], 
 Array[Double])]
 {code}
 Note: We will need to make sure that the above return value format is 
 Java-friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9246) DistributedLDAModel predict top docs per topic

2015-07-28 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645201#comment-14645201
 ] 

Meihua Wu commented on SPARK-9246:
--

[~josephkb] I would like to work on this. 

 DistributedLDAModel predict top docs per topic
 --

 Key: SPARK-9246
 URL: https://issues.apache.org/jira/browse/SPARK-9246
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
   Original Estimate: 72h
  Remaining Estimate: 72h

 For each topic, return top documents based on topicDistributions.
 Synopsis:
 {code}
 /**
  * @param maxDocuments  Max docs to return for each topic
  * @return Array over topics of (sorted top docs, corresponding doc-topic 
 weights)
  */
 def topDocumentsPerTopic(maxDocuments: Int): Array[(Array[Long], 
 Array[Double])]
 {code}
 Note: We will need to make sure that the above return value format is 
 Java-friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Rebase and Squash Commits to Revise PR?

2015-07-28 Thread Meihua Wu
Thanks Sean. Very helpful!

On Tue, Jul 28, 2015 at 1:49 PM, Sean Owen so...@cloudera.com wrote:
 You only need to rebase if your branch/PR now conflicts with master.
 you don't need to squash since the merge script will do that in the
 end for you. You can squash commits and force-push if you think it
 would help clean up your intent, but, often it's clearer to leave the
 review and commit history of your branch since the review comments go
 along with it.

 On Tue, Jul 28, 2015 at 9:46 PM, Meihua Wu rotationsymmetr...@gmail.com 
 wrote:
 I am planning to update my PR to incorporate comments from reviewers.
 Do I need to rebase/squash the commits into a single one?

 Thanks!

 -MW

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Rebase and Squash Commits to Revise PR?

2015-07-28 Thread Meihua Wu
I am planning to update my PR to incorporate comments from reviewers.
Do I need to rebase/squash the commits into a single one?

Thanks!

-MW

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Commented] (SPARK-9225) LDASuite needs unit tests for empty documents

2015-07-23 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638992#comment-14638992
 ] 

Meihua Wu commented on SPARK-9225:
--

working on this.


 LDASuite needs unit tests for empty documents
 -

 Key: SPARK-9225
 URL: https://issues.apache.org/jira/browse/SPARK-9225
 Project: Spark
  Issue Type: Test
  Components: MLlib
Reporter: Feynman Liang
Priority: Minor
  Labels: starter

 We need to add a unit test to {{LDASuite}} which check that empty documents 
 are handled appropriately without crashing. This would require defining an 
 empty corpus within {{LDASuite}} and adding tests for the available LDA 
 optimizers (currently EM and Online). Note that only {{SparseVector}}s can be 
 empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-07-21 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14634621#comment-14634621
 ] 

Meihua Wu commented on SPARK-8518:
--

[~mengxr] [~yanboliang]
Sounds like to plan. We would start with something simple.

I agree that the Cox PH model is a non-parametric model. It is not easy to 
implement it efficiently in Spark: To determine the contribution of a 
particular row in the RDD to the objective function, you will need to reference 
to other rows in the RDD, effectively breaking the parallelism. 

The log-linear model of survival models are often called Accelerated Failure 
Time (AFT) model 
(https://en.wikipedia.org/wiki/Accelerated_failure_time_model). For AFT, there 
are again two favor: parametric vs non-parametric. For the parametric favor, 
the commonly used model is based on Weilbull / exponential distribution. Under 
these models, each row in the RDD contribute to the objective function 
independently, thus easily parallelizable. 



 Log-linear models for survival analysis
 ---

 Key: SPARK-8518
 URL: https://issues.apache.org/jira/browse/SPARK-8518
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Xiangrui Meng
Assignee: Yanbo Liang
   Original Estimate: 168h
  Remaining Estimate: 168h

 We want to add basic log-linear models for survival analysis. The 
 implementation should match the result from R's survival package 
 (http://cran.r-project.org/web/packages/survival/index.html).
 Design doc from [~yanboliang]: 
 https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-07-19 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632965#comment-14632965
 ] 

Meihua Wu commented on SPARK-8518:
--

Hi [~mengxr] and [~yanboliang], 

For the log-linear model for censored survival data, I believe the most 
commonly used and easy to parallel methods are based on the exponential/Weibull 
distribution of the survival time. The algorithm is to optimize the log 
likelihood. So I think we could start with stochastic gradient descend for 
large scale data. Can I chime in and contribute for this jira as well?




 Log-linear models for survival analysis
 ---

 Key: SPARK-8518
 URL: https://issues.apache.org/jira/browse/SPARK-8518
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Xiangrui Meng
Assignee: Yanbo Liang
   Original Estimate: 168h
  Remaining Estimate: 168h

 We want to add basic log-linear models for survival analysis. The 
 implementation should match the result from R's survival package 
 (http://cran.r-project.org/web/packages/survival/index.html).
 Design doc from [~yanboliang]: 
 https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9175) BLAS.gemm fails to update matrix C when alpha==0 and beta!=1

2015-07-18 Thread Meihua Wu (JIRA)
Meihua Wu created SPARK-9175:


 Summary: BLAS.gemm fails to update matrix C when alpha==0 and 
beta!=1
 Key: SPARK-9175
 URL: https://issues.apache.org/jira/browse/SPARK-9175
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Meihua Wu


In the BLAS wrapper, gemm is supposed to update matrix C to be alpha * A * B + 
beta * C. However, the current implementation will not update C as long as 
alpha == 0. This is incorrect when beta is not equal to 1. 

Example:
val p = 3 
val a = DenseMatrix.zeros(p,p)
val b = DenseMatrix.zeros(p,p)
var c = DenseMatrix.eye(p)
BLAS.gemm(0, a, b, 5, c)

c is unchanged in the Spark 1.4 even though it should be multiplied by 5 
element-wise.

The bug is caused by the following in BLAS.gemm:
if (alpha == 0.0) {
  logDebug(gemm: alpha is equal to 0. Returning C.)
}

Will submit PR to fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org