Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
Should we maintain ( num_categories * num_of features ) matrix for per term learning rates in a num_categories-way classification ? for( i = 0 ; i num_categories ;i++){ for( j = 0 '; j num_of features;j++){ sum_of_squares[i][j] = sum_of_squares[i][j] +(beta[i][j]*beta[i][j]); learning_rates[i][j] = (initial_rate/Math.sqrt(sum_of_squares[i][j])) *beta[i][j]*;* } } *beta *in the base class is rightly ( num_categories -1 * num_of features ) matrix. On Fri, Feb 28, 2014 at 11:57 PM, Ted Dunning ted.dunn...@gmail.com wrote: I have been swamped. Generally ad adagrad is a great idea. The code looks fine at first glance. Certainly some sort of adagrad would be preferable to the hack that I put in. Sent from my iPhone On Feb 26, 2014, at 18:30, Vishal Santoshi vishal.santo...@gmail.com wrote: Ted, Any feedback ? On Mon, Feb 24, 2014 at 2:58 PM, Vishal Santoshi vishal.santo...@gmail.comwrote: Hello Ted, This is regarding AdaGrad update per feature.Have attached a file which reflects http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf ( 2 ) It does differ from OnlineLogisticRegression in the way it implements public double perTermLearningRate(int j) ; This class maintains 2 Dense Vectors /** * ADA Per Term Sum of Squares of Learning gradients */ protected Vector perTermLSumOfSquaresOfGradients; /** * ADA Per Term Learning gradient */ protected Vector perTermGradients; and it overrides the learn( ) method to update these two vectors respectively. Please tell me if I am totally off here. Thank you for your help and Regards. Vishal Santoshi. PS . I had wrongly interpreted the code. last 2 emails. Please ignore. On Sun, Dec 29, 2013 at 8:45 PM, Ted Dunning ted.dunn...@gmail.com wrote: :-) Many leaks are *very* subtle. One leak that had me going for weeks was in a news wire corpus. I couldn't figure out why the cross validation was so good and running the classifier on new data was s much worse. The answer was that the training corpus had near-duplicate articles. This means that there was leakage between the training and test corpora. This wasn't quite a target leak, but it was a leak. For target leaks, it is very common to have partial target leaks due to the fact that you learn more about positive cases after the moment that you had to select which case to investigate. Suppose, for instance you are targeting potential customers based on very limited information. If you make an enticing offer to the people you target, then those who accept the offer will buy something from you. You will also learn some particulars such as name and address from those who buy from you. Looking retrospectively, it looks like you can target good customers who have names or addresses that are not null. Without a good snapshot of each customer record at exactly the time that the targeting was done, you cannot know that *all* customers have a null name and address before you target them. This sort of time machine leak can be enormously more subtle than this. On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan gkhn...@gmail.com wrote: Gokhan On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi vishal.santo...@gmail.com Are we to assume that SGD is still a work in progress and implementations ( Cross Fold, Online, Adaptive ) are too flawed to be realistically used ? They are too raw to be accepted uncritically, for sure. They have been used successfully in production. The evolutionary algorithm seems to be the core of OnlineLogisticRegression, which in turn builds up to Adaptive/Cross Fold. b) for truly on-line learning where no repeated passes through the data.. What would it take to get to an implementation ? How can any one help ? Would you like to help on this? The amount of work required to get a distributed asynchronous learner up is moderate, but definitely not huge. Ted, do you describe a generic distributed learner for all kinds of online algorithms? Possibly zookeeper-coordinated and with #predict and #getFeedbackAndUpdateTheModel methods? I think that OnlineLogisticRegression is basically sound, but should get a better learning rate update equation. That would largely make the Adaptive* stuff unnecessary, expecially if OLR could be used in the distributed asynchronous learner.
Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
Yes. I think that maintaining a learning rate for every parameter that is being learned is important. It might help to make that sparse, but I wouldn't think so. On Sun, Mar 2, 2014 at 1:33 PM, Vishal Santoshi vishal.santo...@gmail.comwrote: Should we maintain ( num_categories * num_of features ) matrix for per term learning rates in a num_categories-way classification ? for( i = 0 ; i num_categories ;i++){ for( j = 0 '; j num_of features;j++){ sum_of_squares[i][j] = sum_of_squares[i][j] +(beta[i][j]*beta[i][j]); learning_rates[i][j] = (initial_rate/Math.sqrt(sum_of_squares[i][j])) *beta[i][j]*;* } } *beta *in the base class is rightly ( num_categories -1 * num_of features ) matrix. On Fri, Feb 28, 2014 at 11:57 PM, Ted Dunning ted.dunn...@gmail.com wrote: I have been swamped. Generally ad adagrad is a great idea. The code looks fine at first glance. Certainly some sort of adagrad would be preferable to the hack that I put in. Sent from my iPhone On Feb 26, 2014, at 18:30, Vishal Santoshi vishal.santo...@gmail.com wrote: Ted, Any feedback ? On Mon, Feb 24, 2014 at 2:58 PM, Vishal Santoshi vishal.santo...@gmail.comwrote: Hello Ted, This is regarding AdaGrad update per feature.Have attached a file which reflects http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf ( 2 ) It does differ from OnlineLogisticRegression in the way it implements public double perTermLearningRate(int j) ; This class maintains 2 Dense Vectors /** * ADA Per Term Sum of Squares of Learning gradients */ protected Vector perTermLSumOfSquaresOfGradients; /** * ADA Per Term Learning gradient */ protected Vector perTermGradients; and it overrides the learn( ) method to update these two vectors respectively. Please tell me if I am totally off here. Thank you for your help and Regards. Vishal Santoshi. PS . I had wrongly interpreted the code. last 2 emails. Please ignore. On Sun, Dec 29, 2013 at 8:45 PM, Ted Dunning ted.dunn...@gmail.com wrote: :-) Many leaks are *very* subtle. One leak that had me going for weeks was in a news wire corpus. I couldn't figure out why the cross validation was so good and running the classifier on new data was s much worse. The answer was that the training corpus had near-duplicate articles. This means that there was leakage between the training and test corpora. This wasn't quite a target leak, but it was a leak. For target leaks, it is very common to have partial target leaks due to the fact that you learn more about positive cases after the moment that you had to select which case to investigate. Suppose, for instance you are targeting potential customers based on very limited information. If you make an enticing offer to the people you target, then those who accept the offer will buy something from you. You will also learn some particulars such as name and address from those who buy from you. Looking retrospectively, it looks like you can target good customers who have names or addresses that are not null. Without a good snapshot of each customer record at exactly the time that the targeting was done, you cannot know that *all* customers have a null name and address before you target them. This sort of time machine leak can be enormously more subtle than this. On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan gkhn...@gmail.com wrote: Gokhan On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi vishal.santo...@gmail.com Are we to assume that SGD is still a work in progress and implementations ( Cross Fold, Online, Adaptive ) are too flawed to be realistically used ? They are too raw to be accepted uncritically, for sure. They have been used successfully in production. The evolutionary algorithm seems to be the core of OnlineLogisticRegression, which in turn builds up to Adaptive/Cross Fold. b) for truly on-line learning where no repeated passes through the data.. What would it take to get to an implementation ? How can any one help ? Would you like to help on this? The amount of work required to get a distributed asynchronous learner up is moderate, but definitely not huge. Ted, do you describe a generic distributed learner for all kinds of online algorithms? Possibly zookeeper-coordinated and with #predict and #getFeedbackAndUpdateTheModel methods? I think that OnlineLogisticRegression is basically sound, but should get a better learning
Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
I have been swamped. Generally ad adagrad is a great idea. The code looks fine at first glance. Certainly some sort of adagrad would be preferable to the hack that I put in. Sent from my iPhone On Feb 26, 2014, at 18:30, Vishal Santoshi vishal.santo...@gmail.com wrote: Ted, Any feedback ? On Mon, Feb 24, 2014 at 2:58 PM, Vishal Santoshi vishal.santo...@gmail.comwrote: Hello Ted, This is regarding AdaGrad update per feature.Have attached a file which reflects http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf ( 2 ) It does differ from OnlineLogisticRegression in the way it implements public double perTermLearningRate(int j) ; This class maintains 2 Dense Vectors /** * ADA Per Term Sum of Squares of Learning gradients */ protected Vector perTermLSumOfSquaresOfGradients; /** * ADA Per Term Learning gradient */ protected Vector perTermGradients; and it overrides the learn( ) method to update these two vectors respectively. Please tell me if I am totally off here. Thank you for your help and Regards. Vishal Santoshi. PS . I had wrongly interpreted the code. last 2 emails. Please ignore. On Sun, Dec 29, 2013 at 8:45 PM, Ted Dunning ted.dunn...@gmail.comwrote: :-) Many leaks are *very* subtle. One leak that had me going for weeks was in a news wire corpus. I couldn't figure out why the cross validation was so good and running the classifier on new data was s much worse. The answer was that the training corpus had near-duplicate articles. This means that there was leakage between the training and test corpora. This wasn't quite a target leak, but it was a leak. For target leaks, it is very common to have partial target leaks due to the fact that you learn more about positive cases after the moment that you had to select which case to investigate. Suppose, for instance you are targeting potential customers based on very limited information. If you make an enticing offer to the people you target, then those who accept the offer will buy something from you. You will also learn some particulars such as name and address from those who buy from you. Looking retrospectively, it looks like you can target good customers who have names or addresses that are not null. Without a good snapshot of each customer record at exactly the time that the targeting was done, you cannot know that *all* customers have a null name and address before you target them. This sort of time machine leak can be enormously more subtle than this. On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan gkhn...@gmail.com wrote: Gokhan On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi vishal.santo...@gmail.com Are we to assume that SGD is still a work in progress and implementations ( Cross Fold, Online, Adaptive ) are too flawed to be realistically used ? They are too raw to be accepted uncritically, for sure. They have been used successfully in production. The evolutionary algorithm seems to be the core of OnlineLogisticRegression, which in turn builds up to Adaptive/Cross Fold. b) for truly on-line learning where no repeated passes through the data.. What would it take to get to an implementation ? How can any one help ? Would you like to help on this? The amount of work required to get a distributed asynchronous learner up is moderate, but definitely not huge. Ted, do you describe a generic distributed learner for all kinds of online algorithms? Possibly zookeeper-coordinated and with #predict and #getFeedbackAndUpdateTheModel methods? I think that OnlineLogisticRegression is basically sound, but should get a better learning rate update equation. That would largely make the Adaptive* stuff unnecessary, expecially if OLR could be used in the distributed asynchronous learner.
Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
Ted, Any feedback ? On Mon, Feb 24, 2014 at 2:58 PM, Vishal Santoshi vishal.santo...@gmail.comwrote: Hello Ted, This is regarding AdaGrad update per feature.Have attached a file which reflects http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf ( 2 ) It does differ from OnlineLogisticRegression in the way it implements public double perTermLearningRate(int j) ; This class maintains 2 Dense Vectors /** * ADA Per Term Sum of Squares of Learning gradients */ protected Vector perTermLSumOfSquaresOfGradients; /** * ADA Per Term Learning gradient */ protected Vector perTermGradients; and it overrides the learn( ) method to update these two vectors respectively. Please tell me if I am totally off here. Thank you for your help and Regards. Vishal Santoshi. PS . I had wrongly interpreted the code. last 2 emails. Please ignore. On Sun, Dec 29, 2013 at 8:45 PM, Ted Dunning ted.dunn...@gmail.comwrote: :-) Many leaks are *very* subtle. One leak that had me going for weeks was in a news wire corpus. I couldn't figure out why the cross validation was so good and running the classifier on new data was s much worse. The answer was that the training corpus had near-duplicate articles. This means that there was leakage between the training and test corpora. This wasn't quite a target leak, but it was a leak. For target leaks, it is very common to have partial target leaks due to the fact that you learn more about positive cases after the moment that you had to select which case to investigate. Suppose, for instance you are targeting potential customers based on very limited information. If you make an enticing offer to the people you target, then those who accept the offer will buy something from you. You will also learn some particulars such as name and address from those who buy from you. Looking retrospectively, it looks like you can target good customers who have names or addresses that are not null. Without a good snapshot of each customer record at exactly the time that the targeting was done, you cannot know that *all* customers have a null name and address before you target them. This sort of time machine leak can be enormously more subtle than this. On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan gkhn...@gmail.com wrote: Gokhan On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi vishal.santo...@gmail.com Are we to assume that SGD is still a work in progress and implementations ( Cross Fold, Online, Adaptive ) are too flawed to be realistically used ? They are too raw to be accepted uncritically, for sure. They have been used successfully in production. The evolutionary algorithm seems to be the core of OnlineLogisticRegression, which in turn builds up to Adaptive/Cross Fold. b) for truly on-line learning where no repeated passes through the data.. What would it take to get to an implementation ? How can any one help ? Would you like to help on this? The amount of work required to get a distributed asynchronous learner up is moderate, but definitely not huge. Ted, do you describe a generic distributed learner for all kinds of online algorithms? Possibly zookeeper-coordinated and with #predict and #getFeedbackAndUpdateTheModel methods? I think that OnlineLogisticRegression is basically sound, but should get a better learning rate update equation. That would largely make the Adaptive* stuff unnecessary, expecially if OLR could be used in the distributed asynchronous learner.
Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
Hello Ted, This is regarding AdaGrad update per feature.Have attached a file which reflects http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf ( 2 ) It does differ from OnlineLogisticRegression in the way it implements public double perTermLearningRate(int j) ; This class maintains 2 Dense Vectors /** * ADA Per Term Sum of Squares of Learning gradients */ protected Vector perTermLSumOfSquaresOfGradients; /** * ADA Per Term Learning gradient */ protected Vector perTermGradients; and it overrides the learn( ) method to update these two vectors respectively. Please tell me if I am totally off here. Thank you for your help and Regards. Vishal Santoshi. PS . I had wrongly interpreted the code. last 2 emails. Please ignore. On Sun, Dec 29, 2013 at 8:45 PM, Ted Dunning ted.dunn...@gmail.com wrote: :-) Many leaks are *very* subtle. One leak that had me going for weeks was in a news wire corpus. I couldn't figure out why the cross validation was so good and running the classifier on new data was s much worse. The answer was that the training corpus had near-duplicate articles. This means that there was leakage between the training and test corpora. This wasn't quite a target leak, but it was a leak. For target leaks, it is very common to have partial target leaks due to the fact that you learn more about positive cases after the moment that you had to select which case to investigate. Suppose, for instance you are targeting potential customers based on very limited information. If you make an enticing offer to the people you target, then those who accept the offer will buy something from you. You will also learn some particulars such as name and address from those who buy from you. Looking retrospectively, it looks like you can target good customers who have names or addresses that are not null. Without a good snapshot of each customer record at exactly the time that the targeting was done, you cannot know that *all* customers have a null name and address before you target them. This sort of time machine leak can be enormously more subtle than this. On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan gkhn...@gmail.com wrote: Gokhan On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi vishal.santo...@gmail.com Are we to assume that SGD is still a work in progress and implementations ( Cross Fold, Online, Adaptive ) are too flawed to be realistically used ? They are too raw to be accepted uncritically, for sure. They have been used successfully in production. The evolutionary algorithm seems to be the core of OnlineLogisticRegression, which in turn builds up to Adaptive/Cross Fold. b) for truly on-line learning where no repeated passes through the data.. What would it take to get to an implementation ? How can any one help ? Would you like to help on this? The amount of work required to get a distributed asynchronous learner up is moderate, but definitely not huge. Ted, do you describe a generic distributed learner for all kinds of online algorithms? Possibly zookeeper-coordinated and with #predict and #getFeedbackAndUpdateTheModel methods? I think that OnlineLogisticRegression is basically sound, but should get a better learning rate update equation. That would largely make the Adaptive* stuff unnecessary, expecially if OLR could be used in the distributed asynchronous learner.
Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
Hey Ted, I presume that you would like Adagrad-like solution to replace the above ? Things that I could glean out. * Maintain a simple d-dimensional vector representing to store a running total of the squares of the gradients, where d is the number of terms. Say *gradients*. * Based on Since the learning rate for each feature is quickly adapted, the value for is far less important than it is with SGD. I have used = 1:0 for a very large number of different problems. The primary role of is to determine how much a feature changes the very first time it is encountered, so in problems with large numbers of extremely rare features, some additional care may be warranted. *How important or even necessary is perTermLearningRate(j) ?* * double newValue = beta.getQuick(i, j) + gradientBase * learningRate * perTermLearningRate(j) * instance.get(j); becomes double newGradient = beta.getQuick(i, j) + ( learningRate / Math.sqrt( *gradients*(i)) )* instance.get(j); *gradients*(i) = *gradients*(i) + newGradient ^2; Does this make sense ? The only thing is that the abstract class changes. Regards. On Sun, Dec 29, 2013 at 8:45 PM, Ted Dunning ted.dunn...@gmail.com wrote: :-) Many leaks are *very* subtle. One leak that had me going for weeks was in a news wire corpus. I couldn't figure out why the cross validation was so good and running the classifier on new data was s much worse. The answer was that the training corpus had near-duplicate articles. This means that there was leakage between the training and test corpora. This wasn't quite a target leak, but it was a leak. For target leaks, it is very common to have partial target leaks due to the fact that you learn more about positive cases after the moment that you had to select which case to investigate. Suppose, for instance you are targeting potential customers based on very limited information. If you make an enticing offer to the people you target, then those who accept the offer will buy something from you. You will also learn some particulars such as name and address from those who buy from you. Looking retrospectively, it looks like you can target good customers who have names or addresses that are not null. Without a good snapshot of each customer record at exactly the time that the targeting was done, you cannot know that *all* customers have a null name and address before you target them. This sort of time machine leak can be enormously more subtle than this. On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan gkhn...@gmail.com wrote: Gokhan On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi vishal.santo...@gmail.com Are we to assume that SGD is still a work in progress and implementations ( Cross Fold, Online, Adaptive ) are too flawed to be realistically used ? They are too raw to be accepted uncritically, for sure. They have been used successfully in production. The evolutionary algorithm seems to be the core of OnlineLogisticRegression, which in turn builds up to Adaptive/Cross Fold. b) for truly on-line learning where no repeated passes through the data.. What would it take to get to an implementation ? How can any one help ? Would you like to help on this? The amount of work required to get a distributed asynchronous learner up is moderate, but definitely not huge. Ted, do you describe a generic distributed learner for all kinds of online algorithms? Possibly zookeeper-coordinated and with #predict and #getFeedbackAndUpdateTheModel methods? I think that OnlineLogisticRegression is basically sound, but should get a better learning rate update equation. That would largely make the Adaptive* stuff unnecessary, expecially if OLR could be used in the distributed asynchronous learner.
Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
I do see the regularize has the prior ( LI and L2 ) depend on * perTermLearningRate(j)) ...* On Thu, Feb 20, 2014 at 11:49 AM, Vishal Santoshi vishal.santo...@gmail.com wrote: Hey Ted, I presume that you would like Adagrad-like solution to replace the above ? Things that I could glean out. * Maintain a simple d-dimensional vector representing to store a running total of the squares of the gradients, where d is the number of terms. Say *gradients*. * Based on Since the learning rate for each feature is quickly adapted, the value for is far less important than it is with SGD. I have used = 1:0 for a very large number of different problems. The primary role of is to determine how much a feature changes the very first time it is encountered, so in problems with large numbers of extremely rare features, some additional care may be warranted. *How important or even necessary is perTermLearningRate(j) ?* * double newValue = beta.getQuick(i, j) + gradientBase * learningRate * perTermLearningRate(j) * instance.get(j); becomes double newGradient = beta.getQuick(i, j) + ( learningRate / Math.sqrt( *gradients*(i)) )* instance.get(j); *gradients*(i) = *gradients*(i) + newGradient ^2; Does this make sense ? The only thing is that the abstract class changes. Regards. On Sun, Dec 29, 2013 at 8:45 PM, Ted Dunning ted.dunn...@gmail.comwrote: :-) Many leaks are *very* subtle. One leak that had me going for weeks was in a news wire corpus. I couldn't figure out why the cross validation was so good and running the classifier on new data was s much worse. The answer was that the training corpus had near-duplicate articles. This means that there was leakage between the training and test corpora. This wasn't quite a target leak, but it was a leak. For target leaks, it is very common to have partial target leaks due to the fact that you learn more about positive cases after the moment that you had to select which case to investigate. Suppose, for instance you are targeting potential customers based on very limited information. If you make an enticing offer to the people you target, then those who accept the offer will buy something from you. You will also learn some particulars such as name and address from those who buy from you. Looking retrospectively, it looks like you can target good customers who have names or addresses that are not null. Without a good snapshot of each customer record at exactly the time that the targeting was done, you cannot know that *all* customers have a null name and address before you target them. This sort of time machine leak can be enormously more subtle than this. On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan gkhn...@gmail.com wrote: Gokhan On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi vishal.santo...@gmail.com Are we to assume that SGD is still a work in progress and implementations ( Cross Fold, Online, Adaptive ) are too flawed to be realistically used ? They are too raw to be accepted uncritically, for sure. They have been used successfully in production. The evolutionary algorithm seems to be the core of OnlineLogisticRegression, which in turn builds up to Adaptive/Cross Fold. b) for truly on-line learning where no repeated passes through the data.. What would it take to get to an implementation ? How can any one help ? Would you like to help on this? The amount of work required to get a distributed asynchronous learner up is moderate, but definitely not huge. Ted, do you describe a generic distributed learner for all kinds of online algorithms? Possibly zookeeper-coordinated and with #predict and #getFeedbackAndUpdateTheModel methods? I think that OnlineLogisticRegression is basically sound, but should get a better learning rate update equation. That would largely make the Adaptive* stuff unnecessary, expecially if OLR could be used in the distributed asynchronous learner.
Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
:-) Many leaks are *very* subtle. One leak that had me going for weeks was in a news wire corpus. I couldn't figure out why the cross validation was so good and running the classifier on new data was s much worse. The answer was that the training corpus had near-duplicate articles. This means that there was leakage between the training and test corpora. This wasn't quite a target leak, but it was a leak. For target leaks, it is very common to have partial target leaks due to the fact that you learn more about positive cases after the moment that you had to select which case to investigate. Suppose, for instance you are targeting potential customers based on very limited information. If you make an enticing offer to the people you target, then those who accept the offer will buy something from you. You will also learn some particulars such as name and address from those who buy from you. Looking retrospectively, it looks like you can target good customers who have names or addresses that are not null. Without a good snapshot of each customer record at exactly the time that the targeting was done, you cannot know that *all* customers have a null name and address before you target them. This sort of time machine leak can be enormously more subtle than this. On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan gkhn...@gmail.com wrote: Gokhan On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi vishal.santo...@gmail.com Are we to assume that SGD is still a work in progress and implementations ( Cross Fold, Online, Adaptive ) are too flawed to be realistically used ? They are too raw to be accepted uncritically, for sure. They have been used successfully in production. The evolutionary algorithm seems to be the core of OnlineLogisticRegression, which in turn builds up to Adaptive/Cross Fold. b) for truly on-line learning where no repeated passes through the data.. What would it take to get to an implementation ? How can any one help ? Would you like to help on this? The amount of work required to get a distributed asynchronous learner up is moderate, but definitely not huge. Ted, do you describe a generic distributed learner for all kinds of online algorithms? Possibly zookeeper-coordinated and with #predict and #getFeedbackAndUpdateTheModel methods? I think that OnlineLogisticRegression is basically sound, but should get a better learning rate update equation. That would largely make the Adaptive* stuff unnecessary, expecially if OLR could be used in the distributed asynchronous learner.
Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
We've been playing around with a number of different parameters, feature selection, etc. and are able to achieve pretty good results in cross-validation. When you say cross validation, do you mean the magic cross validation that the ALR uses? Or do you mean your 20%? I mean the 20%. Does the ALR algorithm do it's own cross validation? I was under the impression that it did training and testing steps with a percentage split based on the number of something (CrossFoldLearners?) in the object. Is that correct? As I said, we've been holding back 20% to do our own cross validation. We have a ton of different metrics we're tracking on the results, most significant to this discussion is that it looks like we're achieving very good precision (typically .85 or .9) and a good f1-score (typically again .85 or .9). These are extremely good results. In fact they are good enough I would starting thinking about a target leak. The possibility of a target leak is interesting as it hadn't occurred to me previously. However, thinking it through I'm less inclined to think it's a possibility. We wrote a simple program to extract the model features and weights and I would think a leak would be obvious there, yes? The terms we're seeing seem to make sense. However, when we then take the models generated and try to apply them to some new documents, we're getting many more false positives than we would expect. Documents that should have 2 categories are testing positive for 16, which is well above what I'd expect. By my math I should expect 2 true positives, plus maybe 4.4 (.10 false positives * 44 classes) additional false positives. You said documents. Where do these documents come from? Sorry, to clarify all of our inputs are documents. Specifically, they're technical (scientific) papers written by people at our company. The documents are indexed in SOLR, and we use the Mahout lucene vector to extract our data. We started our development of this process a couple of months ago and took an extract from SOLR at that time. The new documents we're trying to classify after settling on a model are those that have come in to SOLR after that extraction took place. One way to get results just like you describe is if you train on raw news wire that is split randomly between training and test. What can happen is that stories that get edited and republished have a high chance of getting at least one version in both training and test. This means that the supposedly independent test set actually has significant overlap with the training set. If your classifier over-fits, then the test set doesn't catch the problem. I don't believe this is happening, but it is worth checking into. Another way to get this sort of problem is if you do your training/test randomly, but the new documents come from a later time. If your classifier is a good classifier, but is highly specific to documents from a particular moment in time, then your test performance will be a realistic estimate of performance for contemporaneous documents but will be much higher than performance on documents from a later point in time. The temporal aspect is an interesting one. I will have to check on that. A third option could happen if your training and test sets were somehow scrubbed of poorly structured and invalid documents. This often happens. Then, in the real system, if the scrubbing is not done, the classifier may fail because the new documents are not scrubbed in the same way as the training documents. I think we've handled this. I'm processing new documents programmatically through an analysis chain that I believe accurately mimics the one that I indexed against in SOLR. The results were complete garbage before I made them match exactly. In addition, wouldn't I expect more false negatives than false positives if that was the case? Well, I think that, almost by definition, you have an overfitting problem of some kind. The question is what kind. The only think that I think that you don't have is a frank target leak in your documents. That would (probably) have given you even higher scores on your test case. Is there any easy way to detect an overfit? We've noticed at least one interesting thing that seem to be typical of the bad models. For each class a percentage confidence score is reported. With our binary models obviously the choices are 0 or 1. The bad models tend to be very certain in their answers -- e.g. it's either 99% certain it is or isn't a particular class. Is that indicative of overfitting, or completely unrelated? THANKS! Ian
Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
Gokhan On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi vishal.santo...@gmail.com Are we to assume that SGD is still a work in progress and implementations ( Cross Fold, Online, Adaptive ) are too flawed to be realistically used ? They are too raw to be accepted uncritically, for sure. They have been used successfully in production. The evolutionary algorithm seems to be the core of OnlineLogisticRegression, which in turn builds up to Adaptive/Cross Fold. b) for truly on-line learning where no repeated passes through the data.. What would it take to get to an implementation ? How can any one help ? Would you like to help on this? The amount of work required to get a distributed asynchronous learner up is moderate, but definitely not huge. Ted, do you describe a generic distributed learner for all kinds of online algorithms? Possibly zookeeper-coordinated and with #predict and #getFeedbackAndUpdateTheModel methods? I think that OnlineLogisticRegression is basically sound, but should get a better learning rate update equation. That would largely make the Adaptive* stuff unnecessary, expecially if OLR could be used in the distributed asynchronous learner.
Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
Inline On Mon, Dec 2, 2013 at 8:55 AM, optimusfan optimus...@yahoo.com wrote: ... To accomplish this, we used AdaptiveLogisticRegression and trained 46 binary classification models. Our approach has been to do an 80/20 split on the data, holding the 20% back for cross-validation of the models we generate. Sounds reasonable. We've been playing around with a number of different parameters, feature selection, etc. and are able to achieve pretty good results in cross-validation. When you say cross validation, do you mean the magic cross validation that the ALR uses? Or do you mean your 20%? We have a ton of different metrics we're tracking on the results, most significant to this discussion is that it looks like we're achieving very good precision (typically .85 or .9) and a good f1-score (typically again .85 or .9). These are extremely good results. In fact they are good enough I would starting thinking about a target leak. However, when we then take the models generated and try to apply them to some new documents, we're getting many more false positives than we would expect. Documents that should have 2 categories are testing positive for 16, which is well above what I'd expect. By my math I should expect 2 true positives, plus maybe 4.4 (.10 false positives * 44 classes) additional false positives. You said documents. Where do these documents come from? One way to get results just like you describe is if you train on raw news wire that is split randomly between training and test. What can happen is that stories that get edited and republished have a high chance of getting at least one version in both training and test. This means that the supposedly independent test set actually has significant overlap with the training set. If your classifier over-fits, then the test set doesn't catch the problem. Another way to get this sort of problem is if you do your training/test randomly, but the new documents come from a later time. If your classifier is a good classifier, but is highly specific to documents from a particular moment in time, then your test performance will be a realistic estimate of performance for contemporaneous documents but will be much higher than performance on documents from a later point in time. A third option could happen if your training and test sets were somehow scrubbed of poorly structured and invalid documents. This often happens. Then, in the real system, if the scrubbing is not done, the classifier may fail because the new documents are not scrubbed in the same way as the training documents. These are just a few of the ways that *I* have screwed up building classifiers. I am sure that there are more. We suspected that perhaps our models were underfitting or overfitting, hence this post. However, I'll take any and all suggestions for anything else we should be looking at. Well, I think that, almost by definition, you have an overfitting problem of some kind. The question is what kind. The only think that I think that you don't have is a frank target leak in your documents. That would (probably) have given you even higher scores on your test case.
Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
Absolutely. I will read through. The idea is to first fix the learning rate update equation in OLR. I think this code in OnlineLogisticRegression is the current equation ? @Override public double currentLearningRate() { return mu0 * Math.pow(decayFactor, getStep()) * Math.pow(getStep() + stepOffset, forgettingExponent); } I presume that you would like Adagrad-like solution to replace the above ? On Wed, Nov 27, 2013 at 8:18 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi vishal.santo...@gmail.com Are we to assume that SGD is still a work in progress and implementations ( Cross Fold, Online, Adaptive ) are too flawed to be realistically used ? They are too raw to be accepted uncritically, for sure. They have been used successfully in production. The evolutionary algorithm seems to be the core of OnlineLogisticRegression, which in turn builds up to Adaptive/Cross Fold. b) for truly on-line learning where no repeated passes through the data.. What would it take to get to an implementation ? How can any one help ? Would you like to help on this? The amount of work required to get a distributed asynchronous learner up is moderate, but definitely not huge. I think that OnlineLogisticRegression is basically sound, but should get a better learning rate update equation. That would largely make the Adaptive* stuff unnecessary, expecially if OLR could be used in the distributed asynchronous learner.
Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
Yes. Exactly. On Thu, Nov 28, 2013 at 6:32 AM, Vishal Santoshi vishal.santo...@gmail.comwrote: Absolutely. I will read through. The idea is to first fix the learning rate update equation in OLR. I think this code in OnlineLogisticRegression is the current equation ? @Override public double currentLearningRate() { return mu0 * Math.pow(decayFactor, getStep()) * Math.pow(getStep() + stepOffset, forgettingExponent); } I presume that you would like Adagrad-like solution to replace the above ? On Wed, Nov 27, 2013 at 8:18 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi vishal.santo...@gmail.com Are we to assume that SGD is still a work in progress and implementations ( Cross Fold, Online, Adaptive ) are too flawed to be realistically used ? They are too raw to be accepted uncritically, for sure. They have been used successfully in production. The evolutionary algorithm seems to be the core of OnlineLogisticRegression, which in turn builds up to Adaptive/Cross Fold. b) for truly on-line learning where no repeated passes through the data.. What would it take to get to an implementation ? How can any one help ? Would you like to help on this? The amount of work required to get a distributed asynchronous learner up is moderate, but definitely not huge. I think that OnlineLogisticRegression is basically sound, but should get a better learning rate update equation. That would largely make the Adaptive* stuff unnecessary, expecially if OLR could be used in the distributed asynchronous learner.
Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
Hell Ted, Are we to assume that SGD is still a work in progress and implementations ( Cross Fold, Online, Adaptive ) are too flawed to be realistically used ? The evolutionary algorithm seems to be the core of OnlineLogisticRegression, which in turn builds up to Adaptive/Cross Fold. b) for truly on-line learning where no repeated passes through the data.. What would it take to get to an implementation ? How can any one help ? Regards, On Wed, Nov 27, 2013 at 2:26 AM, Ted Dunning ted.dunn...@gmail.com wrote: Well, first off, let me say that I am much less of a fan now of the magical cross validation approach and adaptation based on that than I was when I wrote the ALR code. There are definitely legs in the ideas, but my implementation has a number of flaws. For example: a) the way that I provide for handling multiple passes through the data is very easy to screw up. I think that simply separating the data entirely might be a better approach. b) for truly on-line learning where no repeated passes through the data will ever occur, then cross validation is not the best choice. Much better in those cases to use what Google researchers described in [1]. c) it is clear from several reports that the evolutionary algorithm prematurely shuts down the learning rate. I think that Adagrad-like learning rates are more reliable. See [1] again for one of the more readable descriptions of this. See also [2] for another view on adaptive learning rates. d) item (c) is also related to the way that learning rates are adapted in the underlying OnlineLogisticRegression. That needs to be fixed. e) asynchronous parallel stochastic gradient descent with mini-batch learning is where we should be headed. I do not have time to write it, however. All this aside, I am happy to help in any way that I can given my recent time limits. [1] http://research.google.com/pubs/pub41159.html [2] http://www.cs.jhu.edu/~mdredze/publications/cw_nips_08.pdf On Tue, Nov 26, 2013 at 12:54 PM, optimusfan optimus...@yahoo.com wrote: Hi- We're currently working on a binary classifier using Mahout's AdaptiveLogisticRegression class. We're trying to determine whether or not the models are suffering from high bias or variance and were wondering how to do this using Mahout's APIs? I can easily calculate the cross validation error and I think I could detect high bias or variance if I could compare that number to my training error, but I'm not sure how to do this. Or, any other ideas would be appreciated! Thanks, Ian
Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
Sorry to spam, I never meant the Hello to come out as Hell. Given a little disappointment in the mail, I figure I rather spam than be misunderstood, On Wed, Nov 27, 2013 at 10:07 AM, Vishal Santoshi vishal.santo...@gmail.com wrote: Hell Ted, Are we to assume that SGD is still a work in progress and implementations ( Cross Fold, Online, Adaptive ) are too flawed to be realistically used ? The evolutionary algorithm seems to be the core of OnlineLogisticRegression, which in turn builds up to Adaptive/Cross Fold. b) for truly on-line learning where no repeated passes through the data.. What would it take to get to an implementation ? How can any one help ? Regards, On Wed, Nov 27, 2013 at 2:26 AM, Ted Dunning ted.dunn...@gmail.comwrote: Well, first off, let me say that I am much less of a fan now of the magical cross validation approach and adaptation based on that than I was when I wrote the ALR code. There are definitely legs in the ideas, but my implementation has a number of flaws. For example: a) the way that I provide for handling multiple passes through the data is very easy to screw up. I think that simply separating the data entirely might be a better approach. b) for truly on-line learning where no repeated passes through the data will ever occur, then cross validation is not the best choice. Much better in those cases to use what Google researchers described in [1]. c) it is clear from several reports that the evolutionary algorithm prematurely shuts down the learning rate. I think that Adagrad-like learning rates are more reliable. See [1] again for one of the more readable descriptions of this. See also [2] for another view on adaptive learning rates. d) item (c) is also related to the way that learning rates are adapted in the underlying OnlineLogisticRegression. That needs to be fixed. e) asynchronous parallel stochastic gradient descent with mini-batch learning is where we should be headed. I do not have time to write it, however. All this aside, I am happy to help in any way that I can given my recent time limits. [1] http://research.google.com/pubs/pub41159.html [2] http://www.cs.jhu.edu/~mdredze/publications/cw_nips_08.pdf On Tue, Nov 26, 2013 at 12:54 PM, optimusfan optimus...@yahoo.com wrote: Hi- We're currently working on a binary classifier using Mahout's AdaptiveLogisticRegression class. We're trying to determine whether or not the models are suffering from high bias or variance and were wondering how to do this using Mahout's APIs? I can easily calculate the cross validation error and I think I could detect high bias or variance if I could compare that number to my training error, but I'm not sure how to do this. Or, any other ideas would be appreciated! Thanks, Ian
Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
No problem at all. Kind of funny. On Wed, Nov 27, 2013 at 7:08 AM, Vishal Santoshi vishal.santo...@gmail.comwrote: Sorry to spam, I never meant the Hello to come out as Hell. Given a little disappointment in the mail, I figure I rather spam than be misunderstood, On Wed, Nov 27, 2013 at 10:07 AM, Vishal Santoshi vishal.santo...@gmail.com wrote: Hell Ted, Are we to assume that SGD is still a work in progress and implementations ( Cross Fold, Online, Adaptive ) are too flawed to be realistically used ? The evolutionary algorithm seems to be the core of OnlineLogisticRegression, which in turn builds up to Adaptive/Cross Fold. b) for truly on-line learning where no repeated passes through the data.. What would it take to get to an implementation ? How can any one help ? Regards, On Wed, Nov 27, 2013 at 2:26 AM, Ted Dunning ted.dunn...@gmail.com wrote: Well, first off, let me say that I am much less of a fan now of the magical cross validation approach and adaptation based on that than I was when I wrote the ALR code. There are definitely legs in the ideas, but my implementation has a number of flaws. For example: a) the way that I provide for handling multiple passes through the data is very easy to screw up. I think that simply separating the data entirely might be a better approach. b) for truly on-line learning where no repeated passes through the data will ever occur, then cross validation is not the best choice. Much better in those cases to use what Google researchers described in [1]. c) it is clear from several reports that the evolutionary algorithm prematurely shuts down the learning rate. I think that Adagrad-like learning rates are more reliable. See [1] again for one of the more readable descriptions of this. See also [2] for another view on adaptive learning rates. d) item (c) is also related to the way that learning rates are adapted in the underlying OnlineLogisticRegression. That needs to be fixed. e) asynchronous parallel stochastic gradient descent with mini-batch learning is where we should be headed. I do not have time to write it, however. All this aside, I am happy to help in any way that I can given my recent time limits. [1] http://research.google.com/pubs/pub41159.html [2] http://www.cs.jhu.edu/~mdredze/publications/cw_nips_08.pdf On Tue, Nov 26, 2013 at 12:54 PM, optimusfan optimus...@yahoo.com wrote: Hi- We're currently working on a binary classifier using Mahout's AdaptiveLogisticRegression class. We're trying to determine whether or not the models are suffering from high bias or variance and were wondering how to do this using Mahout's APIs? I can easily calculate the cross validation error and I think I could detect high bias or variance if I could compare that number to my training error, but I'm not sure how to do this. Or, any other ideas would be appreciated! Thanks, Ian
Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi vishal.santo...@gmail.com Are we to assume that SGD is still a work in progress and implementations ( Cross Fold, Online, Adaptive ) are too flawed to be realistically used ? They are too raw to be accepted uncritically, for sure. They have been used successfully in production. The evolutionary algorithm seems to be the core of OnlineLogisticRegression, which in turn builds up to Adaptive/Cross Fold. b) for truly on-line learning where no repeated passes through the data.. What would it take to get to an implementation ? How can any one help ? Would you like to help on this? The amount of work required to get a distributed asynchronous learner up is moderate, but definitely not huge. I think that OnlineLogisticRegression is basically sound, but should get a better learning rate update equation. That would largely make the Adaptive* stuff unnecessary, expecially if OLR could be used in the distributed asynchronous learner.
Detecting high bias and variance in AdaptiveLogisticRegression classification
Hi- We're currently working on a binary classifier using Mahout's AdaptiveLogisticRegression class. We're trying to determine whether or not the models are suffering from high bias or variance and were wondering how to do this using Mahout's APIs? I can easily calculate the cross validation error and I think I could detect high bias or variance if I could compare that number to my training error, but I'm not sure how to do this. Or, any other ideas would be appreciated! Thanks, Ian
Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
Well, first off, let me say that I am much less of a fan now of the magical cross validation approach and adaptation based on that than I was when I wrote the ALR code. There are definitely legs in the ideas, but my implementation has a number of flaws. For example: a) the way that I provide for handling multiple passes through the data is very easy to screw up. I think that simply separating the data entirely might be a better approach. b) for truly on-line learning where no repeated passes through the data will ever occur, then cross validation is not the best choice. Much better in those cases to use what Google researchers described in [1]. c) it is clear from several reports that the evolutionary algorithm prematurely shuts down the learning rate. I think that Adagrad-like learning rates are more reliable. See [1] again for one of the more readable descriptions of this. See also [2] for another view on adaptive learning rates. d) item (c) is also related to the way that learning rates are adapted in the underlying OnlineLogisticRegression. That needs to be fixed. e) asynchronous parallel stochastic gradient descent with mini-batch learning is where we should be headed. I do not have time to write it, however. All this aside, I am happy to help in any way that I can given my recent time limits. [1] http://research.google.com/pubs/pub41159.html [2] http://www.cs.jhu.edu/~mdredze/publications/cw_nips_08.pdf On Tue, Nov 26, 2013 at 12:54 PM, optimusfan optimus...@yahoo.com wrote: Hi- We're currently working on a binary classifier using Mahout's AdaptiveLogisticRegression class. We're trying to determine whether or not the models are suffering from high bias or variance and were wondering how to do this using Mahout's APIs? I can easily calculate the cross validation error and I think I could detect high bias or variance if I could compare that number to my training error, but I'm not sure how to do this. Or, any other ideas would be appreciated! Thanks, Ian