Re: [Rd] application to mentor syrfr package development for Google Summer of Code 2010

2010-03-10 Thread James Salsman
Michael,

Thanks for your reply with the information about the Eureqa API -- I
am forwarding it to the r-devel list below.

Dirk,

Will you please agree to referring to the syrfr package as symbolic
genetic algorithm regression of functions but not (yet) general
relations?  It would be best to refer to general relation regression
as a future package, something like 'syrr' and leave the
parametrization of the derivatives to that package.

May I please mentor, in consultation with Michael if necessary, work
on general function regressions while Chillu and John Nash work on the
derivative package necessary for general relation regressions?  Thank
you for your kind consideration.

Best regards,
James Salsman


On Tue, Mar 9, 2010 at 11:06 AM, Michael Schmidt md...@cornell.edu wrote:
 I think it's a great idea worth trying out. We have always done significance
 tests just on the final frontier of models as a post processing step. Moving
 this into the algorithm could focus the search more on significant higher
 quality solutions. One thing to beware of though is that using parsimony
 pressure just on the number of free parameters tends to focus the search on
 complex equations with no free parameters. So, some care should be taken how
 to implement it.

 We do see a measurable improvement using crossover versus just mutation on
 random test problems. Empirically, it doesn't seem necessary for all
 problems but also doesn't seem to ever inhibit the search.

 I didn't know that anyone was working on a SR package for R. Very cool! I'm
 happy to consult if you have any questions I can help with.

 You may also be interested that we just recently opened up the API for
 interacting with Eureqa servers:
 http://code.google.com/p/eureqa-api/
 If you know of anyone that might be interested in making a wrapper for R,
 please forward.

 Michael


 On Mon, Mar 8, 2010 at 5:45 PM, James Salsman jsals...@talknicer.com
 wrote:

 Michael,

 Thanks for your reply:

 On Mon, Mar 8, 2010 at 12:41 AM, Michael Schmidt md...@cornell.edu
 wrote:
 
  Thanks for contacting me. Eureqa takes into account the total size of an
  equation when comparing different candidate models. It attempts to find
  the
  set of possible equations that are non-dominated in both error and size.
  The
  final results is a short list consisting of the most accurate equation
  for
  increasing equation sizes.
 
  This is closely related to degrees of freedom, but not exactly the same

 That's very good, but I wonder whether we can perform automatic
 outlier exclusion that way.  We would need to keep the confidence
 interval, or at least the information necessary to derive it, accurate
 in every step of the genetic beam search.  Since the confidence
 intervals of extrapolation depend so heavily on the number of degrees
 of freedom of the fit (along with the residual standard error) it's a
 good idea to use a degree-of-freedom-adjusted F statistic instead of a
 post-hoc combination of equation size and residual standard error, I
 would think.  You might want to try it and see how it improves things.
  Confidence intervals, by representing the goodness of fit in the
 original units and domain of the dependent variable, are tremendously
 useful and sometimes make many kinds of tests which would otherwise be
 very laborious easy to eyeball.

 Being able to fit curves to one-to-many relations instead of strict
 one-to-one functions appeals to those working in the imaging domain,
 but not to as many traditional non-image statisticians. Regressing
 functions usually results in well-defined confidence intervals, but
 regressing general relations with derivatives produces confidence
 intervals which can also be relations.  Trying to figure out a
 spiral-shaped confidence interval probably appeals to astronomers more
 than most people.  So I am proposing that, for R's contemplated
 'syrfr' symbolic regression package, we do functions in a general
 genetic beam search framework, Chillu and John Nash can do derivatives
 in the new 'adinr' package, and then we can try to put them together,
 extend the syrfr package with a parameter indicating to fit relations
 with derivatives instead of functions, to try to replicate your work
 on Eureqa using d.o.f-adjusted F statistics as a heuristic beam search
 evaluation function.

 Have you quantified the extent to which using the crossover rule in
 the equation tree search is an improvement over mutation alone in
 symbolic regression?  I am glad that Chillu and Dirk have already
 supported that; there is no denying its utility.

 Would you like to co-mentor this project?
 http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010:syrfr
 I've already stepped forward, so you could do as much or as little as
 you like if you wanted to co-mentor and Dirk agreed to that
 arrangement.

 Best regards,
 James Salsman


  On Mon, Mar 8, 2010 at 2:49 AM, James Salsman jsals...@talknicer.com
  wrote:
 
  I meant that development

[Rd] application to mentor syrfr package development for Google Summer of Code 2010

2010-03-07 Thread James Salsman
Per http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010
-- and http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010:syrfr
-- I am applying to mentor the Symbolic Regression for R (syrfr)
package for the Google Summer of Code 2010.

I propose the following test which an applicant would have to pass in
order to qualify for the topic:

1. Describe each of the following terms as they relate to statistical
regression: categorical, periodic, modular, continuous, bimodal,
log-normal, logistic, Gompertz, and nonlinear.

2. Explain which parts of http://bit.ly/tablecurve were adopted in
SigmaPlot and which weren't.

3. Use the 'outliers' package to improve a regression fit maintaining
the correct extrapolation confidence intervals as are between those
with and without outlier exclusions in proportion to the confidence
that the outliers were reasonably excluded.  (Show your R transcript.)

4. Explain the relationship between degrees of freedom and correlated
independent variables.

Best regards,

James Salsman
jsals...@talknicer.com
http://talknicer.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] application to mentor syrfr package development for Google Summer of Code 2010

2010-03-07 Thread James Salsman
Chillu,

If I understand your concern, you want to lay the foundation for
derivatives so that you can implement the search strategies described
in Schmidt and Lipson (2010) --
http://www.springerlink.com/content/l79v2183725413w0/ -- is that
right? It is not clear to me how well this generalized approach will
work in practice, but there is no reason not to proceed in parallel to
establish a framework under which you could implement the metrics
proposed by Schmidt and Lipson in the contemplated syrfr package.

I have expanded the test I proposed with two more questions -- at
http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010:syrfr
-- specifically:

5. Critique http://sites.google.com/site/gptips4matlab/

6. Use anova to compare the goodness-of-fit of a SSfpl nls fit with a
linear model of your choice. How can your characterize the
degree-of-freedom-adjusted goodness of fit of nonlinear models?

I believe pairwise anova.nls is the optimal comparison for nonlinear
models, but there are several good choices for approximations,
including the residual standard error, which I believe can be adjusted
for degrees of freedom, as can the F statistic which TableCurve uses;
see: http://en.wikipedia.org/wiki/F-test#Regression_problems

Best regards,
James Salsman


On Sun, Mar 7, 2010 at 7:35 PM, Chidambaram Annamalai
quantumeli...@gmail.com wrote:
 It's been a while since I proposed syrfr and I have been constantly in
 contact with the many people in the R community and I wasn't able to find a
 mentor for the project. I later got interested in the Automatic
 Differentiation proposal (adinr) and, on consulting with a few others within
 the R community, I mailed John Nash (who proposed adinr in the first place)
 if he'd be willing to take me up on the project. I got a positive reply only
 a few hours ago and it was my mistake to have not removed the syrfr proposal
 in time from the wiki, as being listed under proposals looking for mentors.

 While I appreciate your interest in the syrfr proposal I am afraid my
 allegiances have shifted towards the adinr proposal, as I got convinced that
 it might interest a larger group of people and it has wider scope in
 general.

 I apologize for having caused this trouble.

 Best Regards,
 Chillu

 On Mon, Mar 8, 2010 at 6:41 AM, James Salsman jsals...@talknicer.com
 wrote:

 Per http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010
 -- and
 http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010:syrfr
 -- I am applying to mentor the Symbolic Regression for R (syrfr)
 package for the Google Summer of Code 2010.

 I propose the following test which an applicant would have to pass in
 order to qualify for the topic:

 1. Describe each of the following terms as they relate to statistical
 regression: categorical, periodic, modular, continuous, bimodal,
 log-normal, logistic, Gompertz, and nonlinear.

 2. Explain which parts of http://bit.ly/tablecurve were adopted in
 SigmaPlot and which weren't.

 3. Use the 'outliers' package to improve a regression fit maintaining
 the correct extrapolation confidence intervals as are between those
 with and without outlier exclusions in proportion to the confidence
 that the outliers were reasonably excluded.  (Show your R transcript.)

 4. Explain the relationship between degrees of freedom and correlated
 independent variables.

 Best regards,

 James Salsman
 jsals...@talknicer.com
 http://talknicer.com

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] application to mentor syrfr package development for Google Summer of Code 2010

2010-03-07 Thread James Salsman
Chillu, I meant that development on both a syrfr R package capable of
using either F statistics or parametric derivatives should proceed in
parallel with your work on such a derivatives package. You are right
that genetic algorithm search (and general best-first search --
http://en.wikipedia.org/wiki/Best-first_search -- of which genetic
algorithms are various special cases) can be very effectively
parallelized, too.

In any case, thank you for pointing out Eureqa --
http://ccsl.mae.cornell.edu/eureqa -- but I can see no evidence there
or in the user manual or user forums that Eureqa is considering
degrees of freedom in its goodness-of-fit estimation.  That is a
serious problem which will typically result in invalid symbolic
regression.  I am sending this message also to Michael Schmidt so that
he might be able to comment on the extent to which Eureqa adjusts for
degrees of freedom in his fit evaluations.

Best regards,
James Salsman

On Sun, Mar 7, 2010 at 10:39 PM, Chidambaram Annamalai
quantumeli...@gmail.com wrote:

 If I understand your concern, you want to lay the foundation for
 derivatives so that you can implement the search strategies described
 in Schmidt and Lipson (2010) --
 http://www.springerlink.com/content/l79v2183725413w0/ -- is that
 right?

 Yes. Basically traditional naive error estimators or fitness functions
 fail miserably when used in SR with implicit equations because they
 immediately close in on best fits like f(x) = x - x and other trivial
 solutions. In such cases no amount of regularization and complexity
 penalizing methods will help since x - x is fairly simple by most measures
 of complexity and it does have zero error. So the paper outlines such
 problems associated with direct error estimators and thus they infer the
 triviality of the fit by probing its estimates around nearby points and
 seeing if it does follow the pattern dictated by the data points -- ergo
 derivatives.

 Also, somewhat like a side benefit, this method also enables us to perform
 regression on closed loops and other implicit equations since the fitness
 functions are based only on derivatives. The specific form of the error is
 equation 1.2 which is what, I believe, comprises of the internals of the
 evaluation procedure used in Eureqa.

 You are correct in pointing out that there is no reason to not work in
 parallel, since GAs generally have a more or less fixed form
 (evaluate-reproduce cycle) which is quite easily parallelized. I have used
 OpenMP in the past, in which it is fairly trivial to parallelize well formed
 for loops.

 Chillu

 It is not clear to me how well this generalized approach will
 work in practice, but there is no reason not to proceed in parallel to
 establish a framework under which you could implement the metrics
 proposed by Schmidt and Lipson in the contemplated syrfr package.

 I have expanded the test I proposed with two more questions -- at
 http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010:syrfr
 -- specifically:

 5. Critique http://sites.google.com/site/gptips4matlab/

 6. Use anova to compare the goodness-of-fit of a SSfpl nls fit with a
 linear model of your choice. How can your characterize the
 degree-of-freedom-adjusted goodness of fit of nonlinear models?

 I believe pairwise anova.nls is the optimal comparison for nonlinear
 models, but there are several good choices for approximations,
 including the residual standard error, which I believe can be adjusted
 for degrees of freedom, as can the F statistic which TableCurve uses;
 see: http://en.wikipedia.org/wiki/F-test#Regression_problems

 Best regards,
 James Salsman


 On Sun, Mar 7, 2010 at 7:35 PM, Chidambaram Annamalai
 quantumeli...@gmail.com wrote:
  It's been a while since I proposed syrfr and I have been constantly in
  contact with the many people in the R community and I wasn't able to
  find a
  mentor for the project. I later got interested in the Automatic
  Differentiation proposal (adinr) and, on consulting with a few others
  within
  the R community, I mailed John Nash (who proposed adinr in the first
  place)
  if he'd be willing to take me up on the project. I got a positive reply
  only
  a few hours ago and it was my mistake to have not removed the syrfr
  proposal
  in time from the wiki, as being listed under proposals looking for
  mentors.
 
  While I appreciate your interest in the syrfr proposal I am afraid my
  allegiances have shifted towards the adinr proposal, as I got convinced
  that
  it might interest a larger group of people and it has wider scope in
  general.
 
  I apologize for having caused this trouble.
 
  Best Regards,
  Chillu
 
  On Mon, Mar 8, 2010 at 6:41 AM, James Salsman jsals...@talknicer.com
  wrote:
 
  Per http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010
  -- and
 
  http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010:syrfr
  -- I am applying to mentor the Symbolic Regression for R (syrfr)
  package