[ https://issues.apache.org/jira/browse/MATH-449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088996#comment-13088996 ]
Phil Steitz commented on MATH-449: ---------------------------------- Good point on the stored data version. This is really our first foray into meaningful management of missing data and now is a great time to start dealing with it. In the correlation package, at this point, we can fairly easily support either or both casewise or pairwise "deletion" so it is probably best to make it configurable. Also, we need to agree on and advertise the fact that NaNs should be used to signal missing data. Lets start by implementing things this way in the new storeless covariance classes and then open new tickets to add support for missing data in first the rest of the correlation package and then regression. One thing that is bugging me a little is convincing myself that if we allow pairwise deletion, the covariance matrix will be legitimate (i.e. have all of the analytical properties associated with a cov matrix). Also, are there negative implications that I have not thought about to using NaNs to signal missing data. > Storeless covariance > -------------------- > > Key: MATH-449 > URL: https://issues.apache.org/jira/browse/MATH-449 > Project: Commons Math > Issue Type: Improvement > Reporter: Patrick Meyer > Assignee: Phil Steitz > Fix For: 3.1 > > Attachments: MATH-449.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > Currently there is no storeless version for computing the covariance. > However, Pebay (2008) describes algorithms for on-line covariance > computations, [http://infoserve.sandia.gov/sand_doc/2008/086212.pdf]. I have > provided a simple class for implementing this algorithm. It would be nice to > have this integrated into org.apache.commons.math.stat.correlation.Covariance. > {code} > //This code is granted for inclusion in the Apache Commons under the terms of > the ASL. > public class StorelessCovariance{ > private double deltaX = 0.0; > private double deltaY = 0.0; > private double meanX = 0.0; > private double meanY = 0.0; > private double N=0; > private Double covarianceNumerator=0.0; > private boolean unbiased=true; > public Covariance(boolean unbiased){ > this.unbiased = unbiased; > } > public void increment(Double x, Double y){ > if(x!=null & y!=null){ > N++; > deltaX = x - meanX; > deltaY = y - meanY; > meanX += deltaX/N; > meanY += deltaY/N; > covarianceNumerator += ((N-1.0)/N)*deltaX*deltaY; > } > > } > public Double getResult(){ > if(unbiased){ > return covarianceNumerator/(N-1.0); > }else{ > return covarianceNumerator/N; > } > } > } > {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira