Re: [boost] RE: Any interest in a stats class
At Tuesday 2003/02/25 09:10, you wrote: Please remember that stats can be more general. I frequently use stats for complex types. In that case, mean is also complex, but var is scalar. The proposed implementation doesn't address this. You sure lost me. Would you care to point out _where_ the proposed implementation lacks? Victor A. Wagner Jr. http://rudbek.com The five most dangerous words in the English language: There oughta be a law ___ Unsubscribe other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
[boost] Re: Any interest in a stats class
Somewhere in the E.U., le 25/02/2003 Bonjour In article [EMAIL PROTECTED], Jason D Schmidt [EMAIL PROTECTED] wrote: I know this is well after the discussion on the stats class has ended, but I think I have a good idea here. Scott Kirkwood proposed a class that behaves something like this: stats myStats; for (int i = 0; i 100; ++i) { myStats.add(i); } cout Average: myStats.getAverage() \n; cout Max: myStats.getMax() \n; cout Standard deviation: myStats.getStd() \n; In one of my classes in grad school, I found it quite useful and effecient to do statistics on the fly like this, so this stats class interests me. Anyway, Scott has already alluded to the point I'm about to make. I think it's important and useful for this stats class to integrate with the STL well. This example code was inspired by the PointAverage example from Effective STL p. 161: // this class reports statistics template typename value_type class stats { public: stats(const size_t n, const value_type sum, const value_type sum_sqr): m_n(n), m_sum(sum), m_sum_sqr(sum_sqr) {} value_type sum() const { return m_sum; } value_type mean() const { return m_sum/m_n; } value_type var() const { return m_sum_sqr - m_sum*m_sum/m_n; } value_type delta() const // aka, standard dev { return sqrt(var() / (m_n-1)); } private: value_type m_n, m_sum, m_sum_sqr; }; // this class accumulates results that can be used to // compute meaningful statistics template typename value_type class stats_accum: public std::unary_functionconst value_type, void { public: stats_accum(): n(0), sum(0), sum_sqr(0) {} // use this to operate on each value in a range void operator()(argument_type x) { ++n; sum += x; sum_sqr += x*x; } statsvalue_type result() const { return statsvalue_type(n, sum, sum_sqr); } private: size_t n; value_type sum, sum_sqr; }; int main(int argc, char *argv[]) { typedef float value_type; const size_t n(10); float f[n] = {0, 2, 3, 4, 5, 6, 7, 8, 9, 8}; // accumulate stats over a range of iterators my_stats = std::for_each(f, f+n, stats_accumvalue_type()).result(); m = my_stats.mean(); m = my_stats.delta(); // aka, standard deviation return 0; } In this example, what is the advantage over filling a valarray and using a stat class which uses that as a constructor argument? You would get sum for free, and hopefully (yeah, right...) operations on valarrays could be hardware accelerated, whereas direct coding might not be. That is, at least, one of the ideas I encoded in the file I just uploaded on Yahoo (statistical_descriptor.h.gz). This seems to be pretty similar to what Scott has proposed, and it turns out that this method is very fast. In my tests it has been nearly as fast as if we got rid of the classes and used a hand-written loop. It's certainly much faster than storing the data in a std::valarray object, and using functions that calculate the mean standard deviation separately. This is just a neat application of Scott's idea. I think this stats could be pretty useful for scientific computing, and in this example it works very well with the STL and has great performance. I'd like to see more code like this in Boost, but most of my work is numerical. Take my opinion or leave it. Jason Schmidt I agree with you that if the cardinal of the population is not known then your approach is still useable whereas mine is not realistic. But in that case you might have to reset the class periodically (if you are doing statistics on the fly and want to just test a sample). Your method might also be usefull when the amount of data is too big to be properly placed at once in memory. So, we need classes for sequences, either in memory or via some iterator, one dimensional or multi dimensional, and we also need classes for (experimental) densities. We also need generators for the usual densities. Since we aready have implementations of random, we should hitch our code to it. This also ties in with the request for special functions such as erf. Since we now have uBlas, we can also try to aim for more complex statistical constructs such as Gaussian Mixture Models, though to train the Neural Networks which produce them, we also need good optimisation code, which we lack completely at present (and which in turn usually need some LA code). Anybody want to try to get the COOOL (http://coool.mines.edu/) people aboard Boost? A bientot Hubert Holin ___ Unsubscribe other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
RE: [boost] Re: Any interest in a stats class
http://scicomp.ewha.ac.kr/netlib/cephes/ for example, but many others according to Google. (My attempts in using F2C were less than satisfying from a style point of view. NOT Fortran to C++, if one wanted that ...) Paul Dr Paul A Bristow, hetp Chromatography Prizet Farmhouse, Kendal, Cumbria, LA8 8AB UK +44 1539 561830 Mobile +44 7714 33 02 04 mailto:[EMAIL PROTECTED] -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Hubert Holin Sent: Monday, February 17, 2003 12:35 PM To: [EMAIL PROTECTED] Subject: [boost] Re: Any interest in a stats class Somewhere in the E.U., le 17/02/2003 In article [EMAIL PROTECTED], Paul A. Bristow [EMAIL PROTECTED] wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Hubert Holin Sent: Friday, February 14, 2003 1:25 PM To: [EMAIL PROTECTED] Subject: [boost] Re: Any interest in a stats class Somewhere in the E.U., le 14/02/2003 There still is the question of whether similarity with NR is a problem or not (the language in which the techniques are implemented is different, but implementations of the techniques themselves are of course basically similar since they refer to the same math construction). I cannot see this being a serious problem unless we simply lift the NR in C++ code verbatim. (Most of it is still in old C style for one thing, despite the recent reissue). Yes, on that front we should be safe, but then IANAL... I am hoping that with uBlas, we can contribute more numerical stuff. I have some Gaussian Mixture Models code that I should be rewriting in the not too distant future (currently based on an old version of TNT, and most of the important pre-processing needed has to be done elsewhere, for the then lack of svd). This would be a most welcome developement. uBLAS seems a good starting point. My old files provide number_of_samples , max, min, first_max_index, first_min_index, mean, median, variance, standard_deviation, average_deviation, skewness and kurtosis for sequences (where appropriate), number_of_bins, mass, first_mode_value, first_mode, mean, median, variance, standard_deviation, average_deviation, skewness and kurtosis for deensities (where appropriate). Sounds a pretty good selection. I'll uplaod my old file in a moment, for inspirational input, and make a note in the Wiki, if I can get that to work. Finally, there is the unsolved matter of the math functions we still badly need. Err, I kind of forgot which ones where requested... Well all the items in Stephen Moshier's Cephes collection say. erf, gamma, beta, imcomplete, gaussian etc etc. However, we didn't seem to get far with agreeing the format for these. My naive assumption that double erf(double) style functions would be enough was criticised by those who wanted fancier solutions, some far fancier. I either forgot or missed that thread (I lost quite a bit of data and hence memory during my OS upgrade, thanks to a faulty ftp server...). Would you have a pointer handy? In my view getting this far would be a major step forward. There are major problems in accuracy even at double, let alone long double. There was also talk of an NIST project but I haven't heard of any progress yet. I just checked the DLMf website (http://dlmf.nist.gov/), and it seems they are moving forward albeit slowly (book and free web document in 2004). At any rate, that document will not, as I understand, include actual implementation in a computer language of the functions al., so we should just go ahead and code, perhaps using existing fortran implementations as guidelines (though obviously having the document would make the coding *MUCH* easier :-) ). Paul -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Jeff Garland Sent: Tuesday, February 11, 2003 4:19 PM To: Boost mailing list Subject: RE: [boost] Any interest in a stats class Scott K wrote: Hi all, I have a small family of statistics classes which I have used from time to time. The one I use most often is simply called stats. Here's an example of it's use: ...details snipped... I'm sure there are folks interested in statistical (and other) functions. I've developed exactly this sort of class in the past so I understand the utility. However, I suspect some of us would hope statistical algorithms to be formulated as STL Algorithm extensions. Specifically concerning statistics see: http://www.crystalclearsoftware.com/cgi-bin/boost_wiki/wiki.pl?STLAlgo rithmExtensions/StatisticsAlgorithms and more generally: http://www.crystalclearsoftware.com/cgi-bin/boost_wiki
[boost] Re: Any interest in a stats class
Somewhere in the E.U., le 17/02/2003 In article [EMAIL PROTECTED], Paul A. Bristow [EMAIL PROTECTED] wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Hubert Holin Sent: Friday, February 14, 2003 1:25 PM To: [EMAIL PROTECTED] Subject: [boost] Re: Any interest in a stats class Somewhere in the E.U., le 14/02/2003 There still is the question of whether similarity with NR is a problem or not (the language in which the techniques are implemented is different, but implementations of the techniques themselves are of course basically similar since they refer to the same math construction). I cannot see this being a serious problem unless we simply lift the NR in C++ code verbatim. (Most of it is still in old C style for one thing, despite the recent reissue). Yes, on that front we should be safe, but then IANAL... I am hoping that with uBlas, we can contribute more numerical stuff. I have some Gaussian Mixture Models code that I should be rewriting in the not too distant future (currently based on an old version of TNT, and most of the important pre-processing needed has to be done elsewhere, for the then lack of svd). This would be a most welcome developement. uBLAS seems a good starting point. My old files provide number_of_samples , max, min, first_max_index, first_min_index, mean, median, variance, standard_deviation, average_deviation, skewness and kurtosis for sequences (where appropriate), number_of_bins, mass, first_mode_value, first_mode, mean, median, variance, standard_deviation, average_deviation, skewness and kurtosis for deensities (where appropriate). Sounds a pretty good selection. I'll uplaod my old file in a moment, for inspirational input, and make a note in the Wiki, if I can get that to work. Finally, there is the unsolved matter of the math functions we still badly need. Err, I kind of forgot which ones where requested... Well all the items in Stephen Moshier's Cephes collection say. erf, gamma, beta, imcomplete, gaussian etc etc. However, we didn't seem to get far with agreeing the format for these. My naive assumption that double erf(double) style functions would be enough was criticised by those who wanted fancier solutions, some far fancier. I either forgot or missed that thread (I lost quite a bit of data and hence memory during my OS upgrade, thanks to a faulty ftp server...). Would you have a pointer handy? In my view getting this far would be a major step forward. There are major problems in accuracy even at double, let alone long double. There was also talk of an NIST project but I haven't heard of any progress yet. I just checked the DLMf website (http://dlmf.nist.gov/), and it seems they are moving forward albeit slowly (book and free web document in 2004). At any rate, that document will not, as I understand, include actual implementation in a computer language of the functions al., so we should just go ahead and code, perhaps using existing fortran implementations as guidelines (though obviously having the document would make the coding *MUCH* easier :-) ). Paul -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Jeff Garland Sent: Tuesday, February 11, 2003 4:19 PM To: Boost mailing list Subject: RE: [boost] Any interest in a stats class Scott K wrote: Hi all, I have a small family of statistics classes which I have used from time to time. The one I use most often is simply called stats. Here's an example of it's use: ...details snipped... I'm sure there are folks interested in statistical (and other) functions. I've developed exactly this sort of class in the past so I understand the utility. However, I suspect some of us would hope statistical algorithms to be formulated as STL Algorithm extensions. Specifically concerning statistics see: http://www.crystalclearsoftware.com/cgi-bin/boost_wiki/wiki.pl?STLAlgo rithmExtensions/StatisticsAlgorithms and more generally: http://www.crystalclearsoftware.com/cgi-bin/boost_wiki/wiki.pl?STLAlgo rithmExtensions We definitely need volunteers to take these rough Wiki musings and convert them into actual documented libraries. I'm not sure this is what you had in mind, but I, for one, would welcome your effort either way! Jeff A Bientot HH Hubert ___ Unsubscribe other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
[boost] Re: Any interest in a stats class
Somewhere in the E.U., le 14/02/2003 Bonjour In article [EMAIL PROTECTED], Paul A. Bristow [EMAIL PROTECTED] wrote: Stats are definitely a must-have for Boost, but as ever, the presentation is not so easy to agree upon. I agree statistical utilities are a must. As many of us likely do, I have a few things I can contribute, which I needed for some past work (to work with (multi-dimentional) sequences of values, and with densities of distributions). There still is the question of whether similarity with NR is a problem or not (the language in which the techniques are implemented is different, but implementations of the techniques themselves are of course basically similar since they refer to the same math construction). I am hoping that with uBlas, we can contribute more numerical stuff. I have some Gaussian Mixture Models code that I should be rewriting in the not too distant future (currently based on an old version of TNT, and most of the important pre-processing needed has to be done elsewhere, for the then lack of svd). But it is also crucial to get the most accurate answer, and be able to prove it. For example, B D McCullough, American Statistician Nov 1998 52(4), 358 and 1999 53(2) 149-159 assessed several stats packages, and some came out rather badly - you can guess which was worst, by far! NIST provide some test datasets http://www.itl.nist.gov/div898/strd/ against which code can be judged (and some naive algorithms fail badly). Although I can see the benefits of an STL-style, I also have some difficulty in imagining how the results returned can be other than reals? Even if we 'input' integer types, although sum can sensibly also be integer, I have some difficulty in seeing how the the mean, variance etc are useful as integer types? And to expose the unsuspecting user to the risk of surprise seems unhelpful? Benefits from STL-style would be most obvious if can be applied to a circular buffer into which new data can be fed while stats can be recalculated Kalman filter style. While calculating the mean and variance, it is probably worth calculating the higher two skew and kurtosis too. And of course the median (and some percentiles) are also often more useful than the mean. My old files provide number_of_samples , max, min, first_max_index, first_min_index, mean, median, variance, standard_deviation, average_deviation, skewness and kurtosis for sequences (where appropriate), number_of_bins, mass, first_mode_value, first_mode, mean, median, variance, standard_deviation, average_deviation, skewness and kurtosis for deensities (where appropriate). Finally, there is the unsolved matter of the math functions we still badly need. Err, I kind of forgot which ones where requested... Confidence intervals are more informative than standard deviations etc. Paul Dr Paul A Bristow, hetp Chromatography Prizet Farmhouse, Kendal, Cumbria, LA8 8AB UK +44 1539 561830 Mobile +44 7714 33 02 04 mailto:[EMAIL PROTECTED] -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Jeff Garland Sent: Tuesday, February 11, 2003 4:19 PM To: Boost mailing list Subject: RE: [boost] Any interest in a stats class Scott K wrote: Hi all, I have a small family of statistics classes which I have used from time to time. The one I use most often is simply called stats. Here's an example of it's use: ...details snipped... I'm sure there are folks interested in statistical (and other) functions. I've developed exactly this sort of class in the past so I understand the utility. However, I suspect some of us would hope statistical algorithms to be formulated as STL Algorithm extensions. Specifically concerning statistics see: http://www.crystalclearsoftware.com/cgi-bin/boost_wiki/wiki.pl?STLAlgo rithmExtensions/StatisticsAlgorithms and more generally: http://www.crystalclearsoftware.com/cgi-bin/boost_wiki/wiki.pl?STLAlgo rithmExtensions We definitely need volunteers to take these rough Wiki musings and convert them into actual documented libraries. I'm not sure this is what you had in mind, but I, for one, would welcome your effort either way! Jeff A Bientot HH ___ Unsubscribe other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
RE: [boost] Re: Any interest in a stats class
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Hubert Holin Sent: Friday, February 14, 2003 1:25 PM To: [EMAIL PROTECTED] Subject: [boost] Re: Any interest in a stats class Somewhere in the E.U., le 14/02/2003 There still is the question of whether similarity with NR is a problem or not (the language in which the techniques are implemented is different, but implementations of the techniques themselves are of course basically similar since they refer to the same math construction). I cannot see this being a serious problem unless we simply lift the NR in C++ code verbatim. (Most of it is still in old C style for one thing, despite the recent reissue). I am hoping that with uBlas, we can contribute more numerical stuff. I have some Gaussian Mixture Models code that I should be rewriting in the not too distant future (currently based on an old version of TNT, and most of the important pre-processing needed has to be done elsewhere, for the then lack of svd). This would be a most welcome developement. uBLAS seems a good starting point. My old files provide number_of_samples , max, min, first_max_index, first_min_index, mean, median, variance, standard_deviation, average_deviation, skewness and kurtosis for sequences (where appropriate), number_of_bins, mass, first_mode_value, first_mode, mean, median, variance, standard_deviation, average_deviation, skewness and kurtosis for deensities (where appropriate). Sounds a pretty good selection. Finally, there is the unsolved matter of the math functions we still badly need. Err, I kind of forgot which ones where requested... Well all the items in Stephen Moshier's Cephes collection say. erf, gamma, beta, imcomplete, gaussian etc etc. However, we didn't seem to get far with agreeing the format for these. My naive assumption that double erf(double) style functions would be enough was criticised by those who wanted fancier solutions, some far fancier. In my view getting this far would be a major step forward. There are major problems in accuracy even at double, let alone long double. There was also talk of an NIST project but I haven't heard of any progress yet. Paul -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Jeff Garland Sent: Tuesday, February 11, 2003 4:19 PM To: Boost mailing list Subject: RE: [boost] Any interest in a stats class Scott K wrote: Hi all, I have a small family of statistics classes which I have used from time to time. The one I use most often is simply called stats. Here's an example of it's use: ...details snipped... I'm sure there are folks interested in statistical (and other) functions. I've developed exactly this sort of class in the past so I understand the utility. However, I suspect some of us would hope statistical algorithms to be formulated as STL Algorithm extensions. Specifically concerning statistics see: http://www.crystalclearsoftware.com/cgi-bin/boost_wiki/wiki.pl?STLAlgo rithmExtensions/StatisticsAlgorithms and more generally: http://www.crystalclearsoftware.com/cgi-bin/boost_wiki/wiki.pl?STLAlgo rithmExtensions We definitely need volunteers to take these rough Wiki musings and convert them into actual documented libraries. I'm not sure this is what you had in mind, but I, for one, would welcome your effort either way! Jeff A Bientot HH ___ Unsubscribe other changes: http://lists.boost.org/mailman/listinfo.cgi/boost ___ Unsubscribe other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
[boost] Re: Any interest in a stats class
Well what do you know... The order_2_accumulator class on that page looks just like my stats class. I threw in min and max and have more functions, but otherwise it's the same. -Scott Jeff Garland wrote: ... Specifically concerning statistics see: http://www.crystalclearsoftware.com/cgi-bin/boost_wiki/wiki.pl?STLAlgorithmExtensions/StatisticsAlgorithms ___ Unsubscribe other changes: http://lists.boost.org/mailman/listinfo.cgi/boost