Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2014-03-02 Thread Vishal Santoshi
Should we maintain   (  num_categories  * num_of features )   matrix for
per term learning rates in a num_categories-way classification ?


for( i = 0 ; i  num_categories ;i++){

  for( j = 0 '; j   num_of features;j++){

   sum_of_squares[i][j] =   sum_of_squares[i][j]
 +(beta[i][j]*beta[i][j]);

   learning_rates[i][j] =
(initial_rate/Math.sqrt(sum_of_squares[i][j]))
 *beta[i][j]*;*

  }

}


*beta *in the base class is rightly   (  num_categories -1  * num_of
features ) matrix.
















On Fri, Feb 28, 2014 at 11:57 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 I have been swamped.  Generally ad adagrad is a great idea. The code looks
 fine at first glance.  Certainly some sort of adagrad would be preferable
 to the hack that I put in.

 Sent from my iPhone

  On Feb 26, 2014, at 18:30, Vishal Santoshi vishal.santo...@gmail.com
 wrote:
 
  Ted,  Any feedback ?
 
 
  On Mon, Feb 24, 2014 at 2:58 PM, Vishal Santoshi
  vishal.santo...@gmail.comwrote:
 
  Hello Ted,
 
   This is regarding AdaGrad update per feature.Have
  attached  a file which reflects
  http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf  ( 2 )
 
 
 
  It does differ from OnlineLogisticRegression in the way it implements
 
  public double perTermLearningRate(int j) ;
 
 
  This class maintains 2 Dense Vectors
 
  /**
 
  * ADA  Per Term Sum of Squares of Learning gradients
 
  */
 
  protected Vector perTermLSumOfSquaresOfGradients;
 
  /**
 
  * ADA Per Term Learning gradient
 
  */
 
  protected Vector perTermGradients;
 
  and it overrides the learn( ) method to  update these two vectors
  respectively.
 
 
 
 
  Please tell me if I am totally off here.
 
 
 
  Thank you for your help and Regards.
 
 
  Vishal Santoshi.
 
 
  PS . I had wrongly interpreted the code. last 2 emails. Please ignore.
 
 
 
  On Sun, Dec 29, 2013 at 8:45 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
 
  :-)
 
  Many leaks are *very* subtle.
 
  One leak that had me going for weeks was in a news wire corpus.  I
  couldn't
  figure out why the cross validation was so good and running the
 classifier
  on new data was s much worse.
 
  The answer was that the training corpus had near-duplicate articles.
  This
  means that there was leakage between the training and test corpora.
  This
  wasn't quite a target leak, but it was a leak.
 
  For target leaks, it is very common to have partial target leaks due to
  the
  fact that you learn more about positive cases after the moment that you
  had
  to select which case to investigate.  Suppose, for instance you are
  targeting potential customers based on very limited information.  If
 you
  make an enticing offer to the people you target, then those who accept
 the
  offer will buy something from you.  You will also learn some
 particulars
  such as name and address from those who buy from you.
 
  Looking retrospectively, it looks like you can target good customers
 who
  have names or addresses that are not null.  Without a good snapshot of
  each
  customer record at exactly the time that the targeting was done, you
  cannot
  know that *all* customers have a null name and address before you
 target
  them.  This sort of time machine leak can be enormously more subtle
 than
  this.
 
 
 
  On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan gkhn...@gmail.com
 wrote:
 
  Gokhan
 
 
  On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
  On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi 
  vishal.santo...@gmail.com
 
 
 
  Are we to assume that SGD is still a work in progress and
  implementations (
  Cross Fold, Online, Adaptive ) are too flawed to be realistically
  used
  ?
 
 
  They are too raw to be accepted uncritically, for sure.  They have
  been
  used successfully in production.
 
 
  The evolutionary algorithm seems to be the core of
  OnlineLogisticRegression,
  which in turn builds up to Adaptive/Cross Fold.
 
  b) for truly on-line learning where no repeated passes through the
  data..
 
  What would it take to get to an implementation ? How can any one
  help ?
 
 
  Would you like to help on this?  The amount of work required to get a
  distributed asynchronous learner up is moderate, but definitely not
  huge.
 
 
  Ted, do you describe a generic distributed learner for all kinds of
  online
  algorithms? Possibly zookeeper-coordinated and with #predict and
  #getFeedbackAndUpdateTheModel methods?
 
 
  I think that OnlineLogisticRegression is basically sound, but should
  get
  a
  better learning rate update equation.  That would largely make the
  Adaptive* stuff unnecessary, expecially if OLR could be used in the
  distributed asynchronous learner.
 
 
 
 
 



Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2014-03-02 Thread Ted Dunning
Yes.  I think that maintaining a learning rate for every parameter that is
being learned is important.  It might help to make that sparse, but I
wouldn't think so.




On Sun, Mar 2, 2014 at 1:33 PM, Vishal Santoshi
vishal.santo...@gmail.comwrote:

 Should we maintain   (  num_categories  * num_of features )   matrix for
 per term learning rates in a num_categories-way classification ?


 for( i = 0 ; i  num_categories ;i++){

   for( j = 0 '; j   num_of features;j++){

sum_of_squares[i][j] =   sum_of_squares[i][j]
  +(beta[i][j]*beta[i][j]);

learning_rates[i][j] =
 (initial_rate/Math.sqrt(sum_of_squares[i][j]))
  *beta[i][j]*;*

   }

 }


 *beta *in the base class is rightly   (  num_categories -1  * num_of
 features ) matrix.
















 On Fri, Feb 28, 2014 at 11:57 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  I have been swamped.  Generally ad adagrad is a great idea. The code
 looks
  fine at first glance.  Certainly some sort of adagrad would be preferable
  to the hack that I put in.
 
  Sent from my iPhone
 
   On Feb 26, 2014, at 18:30, Vishal Santoshi vishal.santo...@gmail.com
  wrote:
  
   Ted,  Any feedback ?
  
  
   On Mon, Feb 24, 2014 at 2:58 PM, Vishal Santoshi
   vishal.santo...@gmail.comwrote:
  
   Hello Ted,
  
This is regarding AdaGrad update per feature.Have
   attached  a file which reflects
   http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf  ( 2 )
  
  
  
   It does differ from OnlineLogisticRegression in the way it implements
  
   public double perTermLearningRate(int j) ;
  
  
   This class maintains 2 Dense Vectors
  
   /**
  
   * ADA  Per Term Sum of Squares of Learning gradients
  
   */
  
   protected Vector perTermLSumOfSquaresOfGradients;
  
   /**
  
   * ADA Per Term Learning gradient
  
   */
  
   protected Vector perTermGradients;
  
   and it overrides the learn( ) method to  update these two vectors
   respectively.
  
  
  
  
   Please tell me if I am totally off here.
  
  
  
   Thank you for your help and Regards.
  
  
   Vishal Santoshi.
  
  
   PS . I had wrongly interpreted the code. last 2 emails. Please ignore.
  
  
  
   On Sun, Dec 29, 2013 at 8:45 PM, Ted Dunning ted.dunn...@gmail.com
  wrote:
  
   :-)
  
   Many leaks are *very* subtle.
  
   One leak that had me going for weeks was in a news wire corpus.  I
   couldn't
   figure out why the cross validation was so good and running the
  classifier
   on new data was s much worse.
  
   The answer was that the training corpus had near-duplicate articles.
   This
   means that there was leakage between the training and test corpora.
   This
   wasn't quite a target leak, but it was a leak.
  
   For target leaks, it is very common to have partial target leaks due
 to
   the
   fact that you learn more about positive cases after the moment that
 you
   had
   to select which case to investigate.  Suppose, for instance you are
   targeting potential customers based on very limited information.  If
  you
   make an enticing offer to the people you target, then those who
 accept
  the
   offer will buy something from you.  You will also learn some
  particulars
   such as name and address from those who buy from you.
  
   Looking retrospectively, it looks like you can target good customers
  who
   have names or addresses that are not null.  Without a good snapshot
 of
   each
   customer record at exactly the time that the targeting was done, you
   cannot
   know that *all* customers have a null name and address before you
  target
   them.  This sort of time machine leak can be enormously more subtle
  than
   this.
  
  
  
   On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan gkhn...@gmail.com
  wrote:
  
   Gokhan
  
  
   On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning ted.dunn...@gmail.com
 
   wrote:
  
   On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi 
   vishal.santo...@gmail.com
  
  
  
   Are we to assume that SGD is still a work in progress and
   implementations (
   Cross Fold, Online, Adaptive ) are too flawed to be realistically
   used
   ?
  
  
   They are too raw to be accepted uncritically, for sure.  They have
   been
   used successfully in production.
  
  
   The evolutionary algorithm seems to be the core of
   OnlineLogisticRegression,
   which in turn builds up to Adaptive/Cross Fold.
  
   b) for truly on-line learning where no repeated passes through
 the
   data..
  
   What would it take to get to an implementation ? How can any one
   help ?
  
  
   Would you like to help on this?  The amount of work required to
 get a
   distributed asynchronous learner up is moderate, but definitely not
   huge.
  
  
   Ted, do you describe a generic distributed learner for all kinds of
   online
   algorithms? Possibly zookeeper-coordinated and with #predict and
   #getFeedbackAndUpdateTheModel methods?
  
  
   I think that OnlineLogisticRegression is basically sound, but
 should
   get
   a
   better learning 

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2014-02-28 Thread Ted Dunning
I have been swamped.  Generally ad adagrad is a great idea. The code looks fine 
at first glance.  Certainly some sort of adagrad would be preferable to the 
hack that I put in. 

Sent from my iPhone

 On Feb 26, 2014, at 18:30, Vishal Santoshi vishal.santo...@gmail.com wrote:
 
 Ted,  Any feedback ?
 
 
 On Mon, Feb 24, 2014 at 2:58 PM, Vishal Santoshi
 vishal.santo...@gmail.comwrote:
 
 Hello Ted,
 
  This is regarding AdaGrad update per feature.Have
 attached  a file which reflects
 http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf  ( 2 )
 
 
 
 It does differ from OnlineLogisticRegression in the way it implements
 
 public double perTermLearningRate(int j) ;
 
 
 This class maintains 2 Dense Vectors
 
 /**
 
 * ADA  Per Term Sum of Squares of Learning gradients
 
 */
 
 protected Vector perTermLSumOfSquaresOfGradients;
 
 /**
 
 * ADA Per Term Learning gradient
 
 */
 
 protected Vector perTermGradients;
 
 and it overrides the learn( ) method to  update these two vectors
 respectively.
 
 
 
 
 Please tell me if I am totally off here.
 
 
 
 Thank you for your help and Regards.
 
 
 Vishal Santoshi.
 
 
 PS . I had wrongly interpreted the code. last 2 emails. Please ignore.
 
 
 
 On Sun, Dec 29, 2013 at 8:45 PM, Ted Dunning ted.dunn...@gmail.comwrote:
 
 :-)
 
 Many leaks are *very* subtle.
 
 One leak that had me going for weeks was in a news wire corpus.  I
 couldn't
 figure out why the cross validation was so good and running the classifier
 on new data was s much worse.
 
 The answer was that the training corpus had near-duplicate articles.  This
 means that there was leakage between the training and test corpora.  This
 wasn't quite a target leak, but it was a leak.
 
 For target leaks, it is very common to have partial target leaks due to
 the
 fact that you learn more about positive cases after the moment that you
 had
 to select which case to investigate.  Suppose, for instance you are
 targeting potential customers based on very limited information.  If you
 make an enticing offer to the people you target, then those who accept the
 offer will buy something from you.  You will also learn some particulars
 such as name and address from those who buy from you.
 
 Looking retrospectively, it looks like you can target good customers who
 have names or addresses that are not null.  Without a good snapshot of
 each
 customer record at exactly the time that the targeting was done, you
 cannot
 know that *all* customers have a null name and address before you target
 them.  This sort of time machine leak can be enormously more subtle than
 this.
 
 
 
 On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan gkhn...@gmail.com wrote:
 
 Gokhan
 
 
 On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:
 
 On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi 
 vishal.santo...@gmail.com
 
 
 
 Are we to assume that SGD is still a work in progress and
 implementations (
 Cross Fold, Online, Adaptive ) are too flawed to be realistically
 used
 ?
 
 
 They are too raw to be accepted uncritically, for sure.  They have
 been
 used successfully in production.
 
 
 The evolutionary algorithm seems to be the core of
 OnlineLogisticRegression,
 which in turn builds up to Adaptive/Cross Fold.
 
 b) for truly on-line learning where no repeated passes through the
 data..
 
 What would it take to get to an implementation ? How can any one
 help ?
 
 
 Would you like to help on this?  The amount of work required to get a
 distributed asynchronous learner up is moderate, but definitely not
 huge.
 
 
 Ted, do you describe a generic distributed learner for all kinds of
 online
 algorithms? Possibly zookeeper-coordinated and with #predict and
 #getFeedbackAndUpdateTheModel methods?
 
 
 I think that OnlineLogisticRegression is basically sound, but should
 get
 a
 better learning rate update equation.  That would largely make the
 Adaptive* stuff unnecessary, expecially if OLR could be used in the
 distributed asynchronous learner.
 
 
 
 
 


Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2014-02-26 Thread Vishal Santoshi
Ted,  Any feedback ?


On Mon, Feb 24, 2014 at 2:58 PM, Vishal Santoshi
vishal.santo...@gmail.comwrote:

 Hello Ted,

   This is regarding AdaGrad update per feature.Have
 attached  a file which reflects
 http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf  ( 2 )



 It does differ from OnlineLogisticRegression in the way it implements

 public double perTermLearningRate(int j) ;


 This class maintains 2 Dense Vectors

 /**

  * ADA  Per Term Sum of Squares of Learning gradients

  */

 protected Vector perTermLSumOfSquaresOfGradients;

 /**

  * ADA Per Term Learning gradient

  */

 protected Vector perTermGradients;

 and it overrides the learn( ) method to  update these two vectors
 respectively.




 Please tell me if I am totally off here.



 Thank you for your help and Regards.


 Vishal Santoshi.


 PS . I had wrongly interpreted the code. last 2 emails. Please ignore.



 On Sun, Dec 29, 2013 at 8:45 PM, Ted Dunning ted.dunn...@gmail.comwrote:

 :-)

 Many leaks are *very* subtle.

 One leak that had me going for weeks was in a news wire corpus.  I
 couldn't
 figure out why the cross validation was so good and running the classifier
 on new data was s much worse.

 The answer was that the training corpus had near-duplicate articles.  This
 means that there was leakage between the training and test corpora.  This
 wasn't quite a target leak, but it was a leak.

 For target leaks, it is very common to have partial target leaks due to
 the
 fact that you learn more about positive cases after the moment that you
 had
 to select which case to investigate.  Suppose, for instance you are
 targeting potential customers based on very limited information.  If you
 make an enticing offer to the people you target, then those who accept the
 offer will buy something from you.  You will also learn some particulars
 such as name and address from those who buy from you.

 Looking retrospectively, it looks like you can target good customers who
 have names or addresses that are not null.  Without a good snapshot of
 each
 customer record at exactly the time that the targeting was done, you
 cannot
 know that *all* customers have a null name and address before you target
 them.  This sort of time machine leak can be enormously more subtle than
 this.



 On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan gkhn...@gmail.com wrote:

  Gokhan
 
 
  On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
   On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi 
   vishal.santo...@gmail.com
  
   
   
Are we to assume that SGD is still a work in progress and
   implementations (
Cross Fold, Online, Adaptive ) are too flawed to be realistically
 used
  ?
   
  
   They are too raw to be accepted uncritically, for sure.  They have
 been
   used successfully in production.
  
  
The evolutionary algorithm seems to be the core of
OnlineLogisticRegression,
which in turn builds up to Adaptive/Cross Fold.
   
b) for truly on-line learning where no repeated passes through the
   data..
   
What would it take to get to an implementation ? How can any one
 help ?
   
  
   Would you like to help on this?  The amount of work required to get a
   distributed asynchronous learner up is moderate, but definitely not
 huge.
  
 
  Ted, do you describe a generic distributed learner for all kinds of
 online
  algorithms? Possibly zookeeper-coordinated and with #predict and
  #getFeedbackAndUpdateTheModel methods?
 
  
   I think that OnlineLogisticRegression is basically sound, but should
 get
  a
   better learning rate update equation.  That would largely make the
   Adaptive* stuff unnecessary, expecially if OLR could be used in the
   distributed asynchronous learner.
  
 





Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2014-02-24 Thread Vishal Santoshi
Hello Ted,

  This is regarding AdaGrad update per feature.Have
attached  a file which reflects
http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf ( 2 )



It does differ from OnlineLogisticRegression in the way it implements

public double perTermLearningRate(int j) ;


This class maintains 2 Dense Vectors

/**

 * ADA  Per Term Sum of Squares of Learning gradients

 */

protected Vector perTermLSumOfSquaresOfGradients;

/**

 * ADA Per Term Learning gradient

 */

protected Vector perTermGradients;

and it overrides the learn( ) method to  update these two vectors
respectively.




Please tell me if I am totally off here.



Thank you for your help and Regards.


Vishal Santoshi.


PS . I had wrongly interpreted the code. last 2 emails. Please ignore.



On Sun, Dec 29, 2013 at 8:45 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 :-)

 Many leaks are *very* subtle.

 One leak that had me going for weeks was in a news wire corpus.  I couldn't
 figure out why the cross validation was so good and running the classifier
 on new data was s much worse.

 The answer was that the training corpus had near-duplicate articles.  This
 means that there was leakage between the training and test corpora.  This
 wasn't quite a target leak, but it was a leak.

 For target leaks, it is very common to have partial target leaks due to the
 fact that you learn more about positive cases after the moment that you had
 to select which case to investigate.  Suppose, for instance you are
 targeting potential customers based on very limited information.  If you
 make an enticing offer to the people you target, then those who accept the
 offer will buy something from you.  You will also learn some particulars
 such as name and address from those who buy from you.

 Looking retrospectively, it looks like you can target good customers who
 have names or addresses that are not null.  Without a good snapshot of each
 customer record at exactly the time that the targeting was done, you cannot
 know that *all* customers have a null name and address before you target
 them.  This sort of time machine leak can be enormously more subtle than
 this.



 On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan gkhn...@gmail.com wrote:

  Gokhan
 
 
  On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
   On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi 
   vishal.santo...@gmail.com
  
   
   
Are we to assume that SGD is still a work in progress and
   implementations (
Cross Fold, Online, Adaptive ) are too flawed to be realistically
 used
  ?
   
  
   They are too raw to be accepted uncritically, for sure.  They have been
   used successfully in production.
  
  
The evolutionary algorithm seems to be the core of
OnlineLogisticRegression,
which in turn builds up to Adaptive/Cross Fold.
   
b) for truly on-line learning where no repeated passes through the
   data..
   
What would it take to get to an implementation ? How can any one
 help ?
   
  
   Would you like to help on this?  The amount of work required to get a
   distributed asynchronous learner up is moderate, but definitely not
 huge.
  
 
  Ted, do you describe a generic distributed learner for all kinds of
 online
  algorithms? Possibly zookeeper-coordinated and with #predict and
  #getFeedbackAndUpdateTheModel methods?
 
  
   I think that OnlineLogisticRegression is basically sound, but should
 get
  a
   better learning rate update equation.  That would largely make the
   Adaptive* stuff unnecessary, expecially if OLR could be used in the
   distributed asynchronous learner.
  
 



Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2014-02-20 Thread Vishal Santoshi
Hey Ted,

 I presume that you would like  Adagrad-like solution to replace the
above ?

Things that I could glean out.




 *  Maintain a simple d-dimensional vector representing to store a running
total of the squares of the gradients, where d is the number of terms.  Say
*gradients*.




*  Based on

 Since the learning rate for each feature is quickly adapted, the
value for  is far less important than it is with SGD. I have used  = 1:0
for a very large number of different problems. The primary role of
 is to determine how much a feature changes the very first time it is
encountered, so in problems with large numbers of extremely rare features,
some additional care may be warranted.

 *How important or even necessary is  perTermLearningRate(j)  ?*




*  double newValue = beta.getQuick(i, j) + gradientBase * learningRate *
perTermLearningRate(j) * instance.get(j);

   becomes

double newGradient = beta.getQuick(i, j) + ( learningRate / Math.sqrt(
*gradients*(i)) )* instance.get(j);

*gradients*(i)  = *gradients*(i) + newGradient ^2;





Does this make sense ? The only thing is that the abstract class changes.


Regards.




On Sun, Dec 29, 2013 at 8:45 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 :-)

 Many leaks are *very* subtle.

 One leak that had me going for weeks was in a news wire corpus.  I couldn't
 figure out why the cross validation was so good and running the classifier
 on new data was s much worse.

 The answer was that the training corpus had near-duplicate articles.  This
 means that there was leakage between the training and test corpora.  This
 wasn't quite a target leak, but it was a leak.

 For target leaks, it is very common to have partial target leaks due to the
 fact that you learn more about positive cases after the moment that you had
 to select which case to investigate.  Suppose, for instance you are
 targeting potential customers based on very limited information.  If you
 make an enticing offer to the people you target, then those who accept the
 offer will buy something from you.  You will also learn some particulars
 such as name and address from those who buy from you.

 Looking retrospectively, it looks like you can target good customers who
 have names or addresses that are not null.  Without a good snapshot of each
 customer record at exactly the time that the targeting was done, you cannot
 know that *all* customers have a null name and address before you target
 them.  This sort of time machine leak can be enormously more subtle than
 this.



 On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan gkhn...@gmail.com wrote:

  Gokhan
 
 
  On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
   On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi 
   vishal.santo...@gmail.com
  
   
   
Are we to assume that SGD is still a work in progress and
   implementations (
Cross Fold, Online, Adaptive ) are too flawed to be realistically
 used
  ?
   
  
   They are too raw to be accepted uncritically, for sure.  They have been
   used successfully in production.
  
  
The evolutionary algorithm seems to be the core of
OnlineLogisticRegression,
which in turn builds up to Adaptive/Cross Fold.
   
b) for truly on-line learning where no repeated passes through the
   data..
   
What would it take to get to an implementation ? How can any one
 help ?
   
  
   Would you like to help on this?  The amount of work required to get a
   distributed asynchronous learner up is moderate, but definitely not
 huge.
  
 
  Ted, do you describe a generic distributed learner for all kinds of
 online
  algorithms? Possibly zookeeper-coordinated and with #predict and
  #getFeedbackAndUpdateTheModel methods?
 
  
   I think that OnlineLogisticRegression is basically sound, but should
 get
  a
   better learning rate update equation.  That would largely make the
   Adaptive* stuff unnecessary, expecially if OLR could be used in the
   distributed asynchronous learner.
  
 



Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2014-02-20 Thread Vishal Santoshi
I do see the regularize has  the prior ( LI and L2 )  depend on *
perTermLearningRate(j))
...*


On Thu, Feb 20, 2014 at 11:49 AM, Vishal Santoshi vishal.santo...@gmail.com
 wrote:

 Hey Ted,

  I presume that you would like  Adagrad-like solution to replace the
 above ?

 Things that I could glean out.




  *  Maintain a simple d-dimensional vector representing to store a running
 total of the squares of the gradients, where d is the number of terms.  Say
 *gradients*.




 *  Based on

  Since the learning rate for each feature is quickly adapted, the
 value for is far less important than it is with SGD. I have used = 1:0 for
 a very large number of different problems. The primary role of
  is to determine how much a feature changes the very first time it is
 encountered, so in problems with large numbers of extremely rare features,
 some additional care may be warranted.

  *How important or even necessary is  perTermLearningRate(j)  ?*




 *  double newValue = beta.getQuick(i, j) + gradientBase * learningRate *
 perTermLearningRate(j) * instance.get(j);

becomes

 double newGradient = beta.getQuick(i, j) + ( learningRate / Math.sqrt(
 *gradients*(i)) )* instance.get(j);

 *gradients*(i)  = *gradients*(i) + newGradient ^2;





 Does this make sense ? The only thing is that the abstract class changes.


 Regards.




 On Sun, Dec 29, 2013 at 8:45 PM, Ted Dunning ted.dunn...@gmail.comwrote:

 :-)

 Many leaks are *very* subtle.

 One leak that had me going for weeks was in a news wire corpus.  I
 couldn't
 figure out why the cross validation was so good and running the classifier
 on new data was s much worse.

 The answer was that the training corpus had near-duplicate articles.  This
 means that there was leakage between the training and test corpora.  This
 wasn't quite a target leak, but it was a leak.

 For target leaks, it is very common to have partial target leaks due to
 the
 fact that you learn more about positive cases after the moment that you
 had
 to select which case to investigate.  Suppose, for instance you are
 targeting potential customers based on very limited information.  If you
 make an enticing offer to the people you target, then those who accept the
 offer will buy something from you.  You will also learn some particulars
 such as name and address from those who buy from you.

 Looking retrospectively, it looks like you can target good customers who
 have names or addresses that are not null.  Without a good snapshot of
 each
 customer record at exactly the time that the targeting was done, you
 cannot
 know that *all* customers have a null name and address before you target
 them.  This sort of time machine leak can be enormously more subtle than
 this.



 On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan gkhn...@gmail.com wrote:

  Gokhan
 
 
  On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
   On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi 
   vishal.santo...@gmail.com
  
   
   
Are we to assume that SGD is still a work in progress and
   implementations (
Cross Fold, Online, Adaptive ) are too flawed to be realistically
 used
  ?
   
  
   They are too raw to be accepted uncritically, for sure.  They have
 been
   used successfully in production.
  
  
The evolutionary algorithm seems to be the core of
OnlineLogisticRegression,
which in turn builds up to Adaptive/Cross Fold.
   
b) for truly on-line learning where no repeated passes through the
   data..
   
What would it take to get to an implementation ? How can any one
 help ?
   
  
   Would you like to help on this?  The amount of work required to get a
   distributed asynchronous learner up is moderate, but definitely not
 huge.
  
 
  Ted, do you describe a generic distributed learner for all kinds of
 online
  algorithms? Possibly zookeeper-coordinated and with #predict and
  #getFeedbackAndUpdateTheModel methods?
 
  
   I think that OnlineLogisticRegression is basically sound, but should
 get
  a
   better learning rate update equation.  That would largely make the
   Adaptive* stuff unnecessary, expecially if OLR could be used in the
   distributed asynchronous learner.
  
 





Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-12-29 Thread Ted Dunning
:-)

Many leaks are *very* subtle.

One leak that had me going for weeks was in a news wire corpus.  I couldn't
figure out why the cross validation was so good and running the classifier
on new data was s much worse.

The answer was that the training corpus had near-duplicate articles.  This
means that there was leakage between the training and test corpora.  This
wasn't quite a target leak, but it was a leak.

For target leaks, it is very common to have partial target leaks due to the
fact that you learn more about positive cases after the moment that you had
to select which case to investigate.  Suppose, for instance you are
targeting potential customers based on very limited information.  If you
make an enticing offer to the people you target, then those who accept the
offer will buy something from you.  You will also learn some particulars
such as name and address from those who buy from you.

Looking retrospectively, it looks like you can target good customers who
have names or addresses that are not null.  Without a good snapshot of each
customer record at exactly the time that the targeting was done, you cannot
know that *all* customers have a null name and address before you target
them.  This sort of time machine leak can be enormously more subtle than
this.



On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan gkhn...@gmail.com wrote:

 Gokhan


 On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi 
  vishal.santo...@gmail.com
 
  
  
   Are we to assume that SGD is still a work in progress and
  implementations (
   Cross Fold, Online, Adaptive ) are too flawed to be realistically used
 ?
  
 
  They are too raw to be accepted uncritically, for sure.  They have been
  used successfully in production.
 
 
   The evolutionary algorithm seems to be the core of
   OnlineLogisticRegression,
   which in turn builds up to Adaptive/Cross Fold.
  
   b) for truly on-line learning where no repeated passes through the
  data..
  
   What would it take to get to an implementation ? How can any one help ?
  
 
  Would you like to help on this?  The amount of work required to get a
  distributed asynchronous learner up is moderate, but definitely not huge.
 

 Ted, do you describe a generic distributed learner for all kinds of online
 algorithms? Possibly zookeeper-coordinated and with #predict and
 #getFeedbackAndUpdateTheModel methods?

 
  I think that OnlineLogisticRegression is basically sound, but should get
 a
  better learning rate update equation.  That would largely make the
  Adaptive* stuff unnecessary, expecially if OLR could be used in the
  distributed asynchronous learner.
 



Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-12-04 Thread optimusfan




 We've been playing around with a number of different parameters, feature
 selection, etc. and are able to achieve pretty good results in
 cross-validation.

When you say cross validation, do you mean the magic cross validation that
the ALR uses?  Or do you mean your 20%?

I mean the 20%.  Does the ALR algorithm do it's own cross validation?  I was 
under the impression that it did training and testing steps with a percentage 
split based on the number of something (CrossFoldLearners?) in the object.  Is 
that correct?  As I said, we've been holding back 20% to do our own cross 
validation.

  We have a ton of different metrics we're tracking on the results, most
 significant to this discussion is that it looks like we're achieving very
 good precision (typically .85 or .9) and a good f1-score (typically again
 .85 or .9).

These are extremely good results.   In fact they are good enough I would
starting thinking about a target leak.

The possibility of a target leak is interesting as it hadn't occurred to me 
previously.  However, thinking it through I'm less inclined to think it's a 
possibility.  We wrote a simple program to extract the model features and 
weights and I would think a leak would be obvious there, yes?  The terms we're 
seeing seem to make sense.

However, when we then take the models generated and try to apply them to
 some new documents, we're getting many more false positives than we would
 expect.  Documents that should have 2 categories are testing positive for
 16, which is well above what I'd expect.  By my math I should expect 2 true
 positives, plus maybe 4.4 (.10 false positives * 44 classes) additional
 false positives.


You said documents.  Where do these documents come from?

Sorry, to clarify all of our inputs are documents.  Specifically, they're 
technical (scientific) papers written by people at our company.  The documents 
are indexed in SOLR, and we use the Mahout lucene vector to extract our data.  
We started our development of this process a couple of months ago and took an 
extract from SOLR at that time.  The new documents we're trying to classify 
after settling on a model are those that have come in to SOLR after that 
extraction took place.

One way to get results just like you describe is if you train on raw news
wire that is split randomly between training and test.  What can happen is
that stories that get edited and republished have a high chance of getting
at least one version in both training and test.  This means that the
supposedly independent test set actually has significant overlap with the
training set.  If your classifier over-fits, then the test set doesn't
catch the problem.

I don't believe this is happening, but it is worth checking into.  

Another way to get this sort of problem is if you do your training/test
randomly, but the new documents come from a later time.  If your classifier
is a good classifier, but is highly specific to documents from a particular
moment in time, then your test performance will be a realistic estimate of
performance for contemporaneous documents but will be much higher than
performance on documents from a later point in time.

The temporal aspect is an interesting one.  I will have to check on that.

A third option could happen if your training and test sets were somehow
scrubbed of poorly structured and invalid documents.  This often happens.
Then, in the real system, if the scrubbing is not done, the classifier may
fail because the new documents are not scrubbed in the same way as the
training documents.

I think we've handled this.  I'm processing new documents programmatically 
through an analysis chain that I believe accurately mimics the one that I 
indexed against in SOLR.  The results were complete garbage before I made them 
match exactly.  In addition, wouldn't I expect more false negatives than false 
positives if that was the case?

Well, I think that, almost by definition, you have an overfitting problem
of some kind.  The question is what kind.  The only think that I think that
you don't have is a frank target leak in your documents.  That would
(probably) have given you even higher scores on your test case.
Is there any easy way to detect an overfit?  We've noticed at least one 
interesting thing that seem to be typical of the bad models.  For each class a 
percentage confidence score is reported.  With our binary models obviously 
the choices are 0 or 1.   The bad models tend to be very certain in their 
answers -- e.g. it's either 99% certain it is or isn't a particular class.  Is 
that indicative of overfitting, or completely unrelated?

THANKS!
Ian

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-12-02 Thread Gokhan Capan
Gokhan


On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi 
 vishal.santo...@gmail.com

 
 
  Are we to assume that SGD is still a work in progress and
 implementations (
  Cross Fold, Online, Adaptive ) are too flawed to be realistically used ?
 

 They are too raw to be accepted uncritically, for sure.  They have been
 used successfully in production.


  The evolutionary algorithm seems to be the core of
  OnlineLogisticRegression,
  which in turn builds up to Adaptive/Cross Fold.
 
  b) for truly on-line learning where no repeated passes through the
 data..
 
  What would it take to get to an implementation ? How can any one help ?
 

 Would you like to help on this?  The amount of work required to get a
 distributed asynchronous learner up is moderate, but definitely not huge.


Ted, do you describe a generic distributed learner for all kinds of online
algorithms? Possibly zookeeper-coordinated and with #predict and
#getFeedbackAndUpdateTheModel methods?


 I think that OnlineLogisticRegression is basically sound, but should get a
 better learning rate update equation.  That would largely make the
 Adaptive* stuff unnecessary, expecially if OLR could be used in the
 distributed asynchronous learner.



Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-12-02 Thread Ted Dunning
Inline


On Mon, Dec 2, 2013 at 8:55 AM, optimusfan optimus...@yahoo.com wrote:

 ... To accomplish this, we used AdaptiveLogisticRegression and trained 46
 binary classification models.  Our approach has been to do an 80/20 split
 on the data, holding the 20% back for cross-validation of the models we
 generate.


Sounds reasonable.


 We've been playing around with a number of different parameters, feature
 selection, etc. and are able to achieve pretty good results in
 cross-validation.


When you say cross validation, do you mean the magic cross validation that
the ALR uses?  Or do you mean your 20%?


  We have a ton of different metrics we're tracking on the results, most
 significant to this discussion is that it looks like we're achieving very
 good precision (typically .85 or .9) and a good f1-score (typically again
 .85 or .9).


These are extremely good results.   In fact they are good enough I would
starting thinking about a target leak.

 However, when we then take the models generated and try to apply them to
 some new documents, we're getting many more false positives than we would
 expect.  Documents that should have 2 categories are testing positive for
 16, which is well above what I'd expect.  By my math I should expect 2 true
 positives, plus maybe 4.4 (.10 false positives * 44 classes) additional
 false positives.


You said documents.  Where do these documents come from?

One way to get results just like you describe is if you train on raw news
wire that is split randomly between training and test.  What can happen is
that stories that get edited and republished have a high chance of getting
at least one version in both training and test.  This means that the
supposedly independent test set actually has significant overlap with the
training set.  If your classifier over-fits, then the test set doesn't
catch the problem.

Another way to get this sort of problem is if you do your training/test
randomly, but the new documents come from a later time.  If your classifier
is a good classifier, but is highly specific to documents from a particular
moment in time, then your test performance will be a realistic estimate of
performance for contemporaneous documents but will be much higher than
performance on documents from a later point in time.

A third option could happen if your training and test sets were somehow
scrubbed of poorly structured and invalid documents.  This often happens.
 Then, in the real system, if the scrubbing is not done, the classifier may
fail because the new documents are not scrubbed in the same way as the
training documents.

These are just a few of the ways that *I* have screwed up building
classifiers.  I am sure that there are more.

We suspected that perhaps our models were underfitting or overfitting,
 hence this post.  However, I'll take any and all suggestions for anything
 else we should be looking at.


Well, I think that, almost by definition, you have an overfitting problem
of some kind.  The question is what kind.  The only think that I think that
you don't have is a frank target leak in your documents.  That would
(probably) have given you even higher scores on your test case.


Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-11-28 Thread Vishal Santoshi
Absolutely. I will read through.  The idea is to first  fix the learning
rate update equation in OLR.
I think this code  in  OnlineLogisticRegression is the current equation ?

@Override

  public double currentLearningRate() {

return mu0 * Math.pow(decayFactor, getStep()) * Math.pow(getStep() +
stepOffset, forgettingExponent);

  }


I presume that you would like  Adagrad-like solution to replace the above ?






On Wed, Nov 27, 2013 at 8:18 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi 
 vishal.santo...@gmail.com

 
 
  Are we to assume that SGD is still a work in progress and
 implementations (
  Cross Fold, Online, Adaptive ) are too flawed to be realistically used ?
 

 They are too raw to be accepted uncritically, for sure.  They have been
 used successfully in production.


  The evolutionary algorithm seems to be the core of
  OnlineLogisticRegression,
  which in turn builds up to Adaptive/Cross Fold.
 
  b) for truly on-line learning where no repeated passes through the
 data..
 
  What would it take to get to an implementation ? How can any one help ?
 

 Would you like to help on this?  The amount of work required to get a
 distributed asynchronous learner up is moderate, but definitely not huge.

 I think that OnlineLogisticRegression is basically sound, but should get a
 better learning rate update equation.  That would largely make the
 Adaptive* stuff unnecessary, expecially if OLR could be used in the
 distributed asynchronous learner.



Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-11-28 Thread Ted Dunning
Yes.  Exactly.


On Thu, Nov 28, 2013 at 6:32 AM, Vishal Santoshi
vishal.santo...@gmail.comwrote:

 Absolutely. I will read through.  The idea is to first  fix the learning
 rate update equation in OLR.
 I think this code  in  OnlineLogisticRegression is the current equation ?

 @Override

   public double currentLearningRate() {

 return mu0 * Math.pow(decayFactor, getStep()) * Math.pow(getStep() +
 stepOffset, forgettingExponent);

   }


 I presume that you would like  Adagrad-like solution to replace the above ?






 On Wed, Nov 27, 2013 at 8:18 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi 
  vishal.santo...@gmail.com
 
  
  
   Are we to assume that SGD is still a work in progress and
  implementations (
   Cross Fold, Online, Adaptive ) are too flawed to be realistically used
 ?
  
 
  They are too raw to be accepted uncritically, for sure.  They have been
  used successfully in production.
 
 
   The evolutionary algorithm seems to be the core of
   OnlineLogisticRegression,
   which in turn builds up to Adaptive/Cross Fold.
  
   b) for truly on-line learning where no repeated passes through the
  data..
  
   What would it take to get to an implementation ? How can any one help ?
  
 
  Would you like to help on this?  The amount of work required to get a
  distributed asynchronous learner up is moderate, but definitely not huge.
 
  I think that OnlineLogisticRegression is basically sound, but should get
 a
  better learning rate update equation.  That would largely make the
  Adaptive* stuff unnecessary, expecially if OLR could be used in the
  distributed asynchronous learner.
 



Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-11-27 Thread Vishal Santoshi
Hell Ted,

Are we to assume that SGD is still a work in progress and implementations (
Cross Fold, Online, Adaptive ) are too flawed to be realistically used ?
The evolutionary algorithm seems to be the core of OnlineLogisticRegression,
which in turn builds up to Adaptive/Cross Fold.

b) for truly on-line learning where no repeated passes through the data..

What would it take to get to an implementation ? How can any one help ?

Regards,





On Wed, Nov 27, 2013 at 2:26 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 Well, first off, let me say that I am much less of a fan now of the magical
 cross validation approach and adaptation based on that than I was when I
 wrote the ALR code.  There are definitely legs in the ideas, but my
 implementation has a number of flaws.

 For example:

 a) the way that I provide for handling multiple passes through the data is
 very easy to screw up.  I think that simply separating the data entirely
 might be a better approach.

 b) for truly on-line learning where no repeated passes through the data
 will ever occur, then cross validation is not the best choice.  Much better
 in those cases to use what Google researchers described in [1].

 c) it is clear from several reports that the evolutionary algorithm
 prematurely shuts down the learning rate.  I think that Adagrad-like
 learning rates are more reliable.  See [1] again for one of the more
 readable descriptions of this.  See also [2] for another view on adaptive
 learning rates.

 d) item (c) is also related to the way that learning rates are adapted in
 the underlying OnlineLogisticRegression.  That needs to be fixed.

 e) asynchronous parallel stochastic gradient descent with mini-batch
 learning is where we should be headed.  I do not have time to write it,
 however.

 All this aside, I am happy to help in any way that I can given my recent
 time limits.


 [1] http://research.google.com/pubs/pub41159.html

 [2] http://www.cs.jhu.edu/~mdredze/publications/cw_nips_08.pdf



 On Tue, Nov 26, 2013 at 12:54 PM, optimusfan optimus...@yahoo.com wrote:

  Hi-
 
  We're currently working on a binary classifier using
  Mahout's AdaptiveLogisticRegression class.  We're trying to determine
  whether or not the models are suffering from high bias or variance and
 were
  wondering how to do this using Mahout's APIs?  I can easily calculate the
  cross validation error and I think I could detect high bias or variance
 if
  I could compare that number to my training error, but I'm not sure how to
  do this.  Or, any other ideas would be appreciated!
 
  Thanks,
  Ian



Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-11-27 Thread Vishal Santoshi
Sorry to spam, I never meant the Hello to come out as Hell. Given a
little disappointment in the mail, I figure I rather spam than be
misunderstood,



On Wed, Nov 27, 2013 at 10:07 AM, Vishal Santoshi vishal.santo...@gmail.com
 wrote:

 Hell Ted,

 Are we to assume that SGD is still a work in progress and implementations
 ( Cross Fold, Online, Adaptive ) are too flawed to be realistically used ?
 The evolutionary algorithm seems to be the core of OnlineLogisticRegression,
 which in turn builds up to Adaptive/Cross Fold.

 b) for truly on-line learning where no repeated passes through the
 data..

 What would it take to get to an implementation ? How can any one help ?

 Regards,





 On Wed, Nov 27, 2013 at 2:26 AM, Ted Dunning ted.dunn...@gmail.comwrote:

 Well, first off, let me say that I am much less of a fan now of the
 magical
 cross validation approach and adaptation based on that than I was when I
 wrote the ALR code.  There are definitely legs in the ideas, but my
 implementation has a number of flaws.

 For example:

 a) the way that I provide for handling multiple passes through the data is
 very easy to screw up.  I think that simply separating the data entirely
 might be a better approach.

 b) for truly on-line learning where no repeated passes through the data
 will ever occur, then cross validation is not the best choice.  Much
 better
 in those cases to use what Google researchers described in [1].

 c) it is clear from several reports that the evolutionary algorithm
 prematurely shuts down the learning rate.  I think that Adagrad-like
 learning rates are more reliable.  See [1] again for one of the more
 readable descriptions of this.  See also [2] for another view on adaptive
 learning rates.

 d) item (c) is also related to the way that learning rates are adapted in
 the underlying OnlineLogisticRegression.  That needs to be fixed.

 e) asynchronous parallel stochastic gradient descent with mini-batch
 learning is where we should be headed.  I do not have time to write it,
 however.

 All this aside, I am happy to help in any way that I can given my recent
 time limits.


 [1] http://research.google.com/pubs/pub41159.html

 [2] http://www.cs.jhu.edu/~mdredze/publications/cw_nips_08.pdf



 On Tue, Nov 26, 2013 at 12:54 PM, optimusfan optimus...@yahoo.com
 wrote:

  Hi-
 
  We're currently working on a binary classifier using
  Mahout's AdaptiveLogisticRegression class.  We're trying to determine
  whether or not the models are suffering from high bias or variance and
 were
  wondering how to do this using Mahout's APIs?  I can easily calculate
 the
  cross validation error and I think I could detect high bias or variance
 if
  I could compare that number to my training error, but I'm not sure how
 to
  do this.  Or, any other ideas would be appreciated!
 
  Thanks,
  Ian





Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-11-27 Thread Ted Dunning
No problem at all.  Kind of funny.



On Wed, Nov 27, 2013 at 7:08 AM, Vishal Santoshi
vishal.santo...@gmail.comwrote:

 Sorry to spam, I never meant the Hello to come out as Hell. Given a
 little disappointment in the mail, I figure I rather spam than be
 misunderstood,



 On Wed, Nov 27, 2013 at 10:07 AM, Vishal Santoshi 
 vishal.santo...@gmail.com
  wrote:

  Hell Ted,
 
  Are we to assume that SGD is still a work in progress and implementations
  ( Cross Fold, Online, Adaptive ) are too flawed to be realistically used
 ?
  The evolutionary algorithm seems to be the core of
 OnlineLogisticRegression,
  which in turn builds up to Adaptive/Cross Fold.
 
  b) for truly on-line learning where no repeated passes through the
  data..
 
  What would it take to get to an implementation ? How can any one help ?
 
  Regards,
 
 
 
 
 
  On Wed, Nov 27, 2013 at 2:26 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:
 
  Well, first off, let me say that I am much less of a fan now of the
  magical
  cross validation approach and adaptation based on that than I was when I
  wrote the ALR code.  There are definitely legs in the ideas, but my
  implementation has a number of flaws.
 
  For example:
 
  a) the way that I provide for handling multiple passes through the data
 is
  very easy to screw up.  I think that simply separating the data entirely
  might be a better approach.
 
  b) for truly on-line learning where no repeated passes through the data
  will ever occur, then cross validation is not the best choice.  Much
  better
  in those cases to use what Google researchers described in [1].
 
  c) it is clear from several reports that the evolutionary algorithm
  prematurely shuts down the learning rate.  I think that Adagrad-like
  learning rates are more reliable.  See [1] again for one of the more
  readable descriptions of this.  See also [2] for another view on
 adaptive
  learning rates.
 
  d) item (c) is also related to the way that learning rates are adapted
 in
  the underlying OnlineLogisticRegression.  That needs to be fixed.
 
  e) asynchronous parallel stochastic gradient descent with mini-batch
  learning is where we should be headed.  I do not have time to write it,
  however.
 
  All this aside, I am happy to help in any way that I can given my recent
  time limits.
 
 
  [1] http://research.google.com/pubs/pub41159.html
 
  [2] http://www.cs.jhu.edu/~mdredze/publications/cw_nips_08.pdf
 
 
 
  On Tue, Nov 26, 2013 at 12:54 PM, optimusfan optimus...@yahoo.com
  wrote:
 
   Hi-
  
   We're currently working on a binary classifier using
   Mahout's AdaptiveLogisticRegression class.  We're trying to determine
   whether or not the models are suffering from high bias or variance and
  were
   wondering how to do this using Mahout's APIs?  I can easily calculate
  the
   cross validation error and I think I could detect high bias or
 variance
  if
   I could compare that number to my training error, but I'm not sure how
  to
   do this.  Or, any other ideas would be appreciated!
  
   Thanks,
   Ian
 
 
 



Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-11-27 Thread Ted Dunning
On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi vishal.santo...@gmail.com



 Are we to assume that SGD is still a work in progress and implementations (
 Cross Fold, Online, Adaptive ) are too flawed to be realistically used ?


They are too raw to be accepted uncritically, for sure.  They have been
used successfully in production.


 The evolutionary algorithm seems to be the core of
 OnlineLogisticRegression,
 which in turn builds up to Adaptive/Cross Fold.

 b) for truly on-line learning where no repeated passes through the data..

 What would it take to get to an implementation ? How can any one help ?


Would you like to help on this?  The amount of work required to get a
distributed asynchronous learner up is moderate, but definitely not huge.

I think that OnlineLogisticRegression is basically sound, but should get a
better learning rate update equation.  That would largely make the
Adaptive* stuff unnecessary, expecially if OLR could be used in the
distributed asynchronous learner.


Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-11-26 Thread optimusfan
Hi-

We're currently working on a binary classifier using Mahout's 
AdaptiveLogisticRegression class.  We're trying to determine whether or not the 
models are suffering from high bias or variance and were wondering how to do 
this using Mahout's APIs?  I can easily calculate the cross validation error 
and I think I could detect high bias or variance if I could compare that number 
to my training error, but I'm not sure how to do this.  Or, any other ideas 
would be appreciated!

Thanks,
Ian

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-11-26 Thread Ted Dunning
Well, first off, let me say that I am much less of a fan now of the magical
cross validation approach and adaptation based on that than I was when I
wrote the ALR code.  There are definitely legs in the ideas, but my
implementation has a number of flaws.

For example:

a) the way that I provide for handling multiple passes through the data is
very easy to screw up.  I think that simply separating the data entirely
might be a better approach.

b) for truly on-line learning where no repeated passes through the data
will ever occur, then cross validation is not the best choice.  Much better
in those cases to use what Google researchers described in [1].

c) it is clear from several reports that the evolutionary algorithm
prematurely shuts down the learning rate.  I think that Adagrad-like
learning rates are more reliable.  See [1] again for one of the more
readable descriptions of this.  See also [2] for another view on adaptive
learning rates.

d) item (c) is also related to the way that learning rates are adapted in
the underlying OnlineLogisticRegression.  That needs to be fixed.

e) asynchronous parallel stochastic gradient descent with mini-batch
learning is where we should be headed.  I do not have time to write it,
however.

All this aside, I am happy to help in any way that I can given my recent
time limits.


[1] http://research.google.com/pubs/pub41159.html

[2] http://www.cs.jhu.edu/~mdredze/publications/cw_nips_08.pdf



On Tue, Nov 26, 2013 at 12:54 PM, optimusfan optimus...@yahoo.com wrote:

 Hi-

 We're currently working on a binary classifier using
 Mahout's AdaptiveLogisticRegression class.  We're trying to determine
 whether or not the models are suffering from high bias or variance and were
 wondering how to do this using Mahout's APIs?  I can easily calculate the
 cross validation error and I think I could detect high bias or variance if
 I could compare that number to my training error, but I'm not sure how to
 do this.  Or, any other ideas would be appreciated!

 Thanks,
 Ian