One comment about """ 1) I agree the sorting method you suggested is a very efficient way to handle the unordered categorical variables in binary classification and regression. I propose we have a Spark ML Transformer to do the sorting and encoding, bringing the benefits to many tree based methods. How about I open a jira for this? """
--> MLlib trees do this currently, so you could check out that code as an example. I'm not sure how this would work as a generic transformer, though; it seems more like an internal part of space-partitioning algorithms. On Tue, Oct 27, 2015 at 5:04 PM, Meihua Wu <rotationsymmetr...@gmail.com> wrote: > Hi DB Tsai, > > Thank you again for your insightful comments! > > 1) I agree the sorting method you suggested is a very efficient way to > handle the unordered categorical variables in binary classification > and regression. I propose we have a Spark ML Transformer to do the > sorting and encoding, bringing the benefits to many tree based > methods. How about I open a jira for this? > > 2) For L2/L1 regularization vs Learning rate (I use this name instead > shrinkage to avoid confusion), I have the following observations: > > Suppose G and H are the sum (over the data assigned to a leaf node) of > the 1st and 2nd derivative of the loss evaluated at f_m, respectively. > Then for this leaf node, > > * With a learning rate eta, f_{m+1} = f_m - G/H*eta > > * With a L2 regularization coefficient lambda, f_{m+1} =f_m - G/(H+lambda) > > If H>0 (convex loss), both approach lead to "shrinkage": > > * For the learning rate approach, the percentage of shrinkage is > uniform for any leaf node. > > * For L2 regularization, the percentage of shrinkage would adapt to > the number of instances assigned to a leaf node: more instances => > larger G and H => less shrinkage. This behavior is intuitive to me. If > the value estimated from this node is based on a large amount of data, > the value should be reliable and less shrinkage is needed. > > I suppose we could have something similar for L1. > > I am not aware of theoretical results to conclude which method is > better. Likely to be dependent on the data at hand. Implementing > learning rate is on my radar for version 0.2. I should be able to add > it in a week or so. I will send you a note once it is done. > > Thanks, > > Meihua > > On Tue, Oct 27, 2015 at 1:02 AM, DB Tsai <dbt...@dbtsai.com> wrote: > > Hi Meihua, > > > > For categorical features, the ordinal issue can be solved by trying > > all kind of different partitions 2^(q-1) -1 for q values into two > > groups. However, it's computational expensive. In Hastie's book, in > > 9.2.4, the trees can be trained by sorting the residuals and being > > learnt as if they are ordered. It can be proven that it will give the > > optimal solution. I have a proof that this works for learning > > regression trees through variance reduction. > > > > I'm also interested in understanding how the L1 and L2 regularization > > within the boosting works (and if it helps with overfitting more than > > shrinkage). > > > > Thanks. > > > > Sincerely, > > > > DB Tsai > > ---------------------------------------------------------- > > Web: https://www.dbtsai.com > > PGP Key ID: 0xAF08DF8D > > > > > > On Mon, Oct 26, 2015 at 8:37 PM, Meihua Wu <rotationsymmetr...@gmail.com> > wrote: > >> Hi DB Tsai, > >> > >> Thank you very much for your interest and comment. > >> > >> 1) feature sub-sample is per-node, like random forest. > >> > >> 2) The current code heavily exploits the tree structure to speed up > >> the learning (such as processing multiple learning node in one pass of > >> the training data). So a generic GBM is likely to be a different > >> codebase. Do you have any nice reference of efficient GBM? I am more > >> than happy to look into that. > >> > >> 3) The algorithm accept training data as a DataFrame with the > >> featureCol indexed by VectorIndexer. You can specify which variable is > >> categorical in the VectorIndexer. Please note that currently all > >> categorical variables are treated as ordered. If you want some > >> categorical variables as unordered, you can pass the data through > >> OneHotEncoder before the VectorIndexer. I do have a plan to handle > >> unordered categorical variable using the approach in RF in Spark ML > >> (Please see roadmap in the README.md) > >> > >> Thanks, > >> > >> Meihua > >> > >> > >> > >> On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai <dbt...@dbtsai.com> wrote: > >>> Interesting. For feature sub-sampling, is it per-node or per-tree? Do > >>> you think you can implement generic GBM and have it merged as part of > >>> Spark codebase? > >>> > >>> Sincerely, > >>> > >>> DB Tsai > >>> ---------------------------------------------------------- > >>> Web: https://www.dbtsai.com > >>> PGP Key ID: 0xAF08DF8D > >>> > >>> > >>> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu > >>> <rotationsymmetr...@gmail.com> wrote: > >>>> Hi Spark User/Dev, > >>>> > >>>> Inspired by the success of XGBoost, I have created a Spark package for > >>>> gradient boosting tree with 2nd order approximation of arbitrary > >>>> user-defined loss functions. > >>>> > >>>> https://github.com/rotationsymmetry/SparkXGBoost > >>>> > >>>> Currently linear (normal) regression, binary classification, Poisson > >>>> regression are supported. You can extend with other loss function as > >>>> well. > >>>> > >>>> L1, L2, bagging, feature sub-sampling are also employed to avoid > overfitting. > >>>> > >>>> Thank you for testing. I am looking forward to your comments and > >>>> suggestions. Bugs or improvements can be reported through GitHub. > >>>> > >>>> Many thanks! > >>>> > >>>> Meihua > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >>>> For additional commands, e-mail: user-h...@spark.apache.org > >>>> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >