On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas Neidhart wrote:
> Hi Patrick,
> 
> On 11/07/2012 04:37 PM, Patrick Meyer wrote:
> > I agree that it would be nice to have a constructor that allows you to
> > specific the ranking algorithm only. 
> > 
> > As far as NaN and the Spearman correlation, maybe we should add a default
> > strategy of NaNStrategy.FAIL so that an exception would occur if any NaN is
> > encountered. R uses this treatment of missing data and forces users to
> > choose how to handle it. If we implemented something like listwise or
> > pairwise deletion it could be used in other classes too. As such, treatment
> > of missing data should be part of a larger discussion and handled in a more
> > comprehensive and systematic way.
> 
> I think this additional option makes sense, but I forward this
> discussion to the dev mailing list where it is better suited.

I'm wary of having CM handle "missing" data.
For one thing we'd have to define a "convention" to represent missing data.
There is no good way to do that in Java. Using NaN for this purpose in a
low-level library is not a good idea IMHO. Then, any convention might not be
suitable for some user applications, which would lead such an application's
developer to filter the data anyway in order to change his representation to
CM's representation. Rather that calling two redundant filtering codes, I'd
rather assume that CM gets a clean input on which its algorithm can operate.
As usual, the input is subjected to precondition checks, and exceptions are
thrown if the data is not clean enough.

In summary: data validation (in the sense of discarding input) should not be
done _before_ calling CM routines.


Regards,
Gilles

> Thomas
> 
> > -----Original Message-----
> > From: Thomas Neidhart [mailto:thomas.neidh...@gmail.com] 
> > Sent: Wednesday, November 07, 2012 8:09 AM
> > To: u...@commons.apache.org
> > Subject: Re: [math] correlation analysis with NaNs
> > 
> > On 11/07/2012 01:38 PM, Patrick Meyer wrote:
> >> You are getting values like 2.5 because of the default ties strategy. 
> >> If you do not want to use that method, create an instance of 
> >> RankingAlgorithm with a different ties strategy and pass it to the 
> >> constructor for the SpearmanCorrelation. This approach also gives you 
> >> control over the method for dealing with NaNs. Something like,
> >>
> >> //create data matrix
> >> double[] column1 = new double[]{Double.NaN, 1, 2}; double[] column2 = 
> >> new double[]{10, 2, 10}; Array2DRowRealMatrix mydata = new 
> >> Array2DRowRealMatrix(); For(int i=0;i<column1.length;i++){
> >>    mydata.addToEntry(i, 0, column1[i]);
> >>    mydata.addToEntry(i, 1, column2[i]);
> >> }
> >>
> >> //compute correlation
> >> NaturalRanking ranking = new NaturalRanking(NaNStrategy.FIXED, 
> >> TiesStrategy.RANDOM); SpearmanCorrelation spearman = new 
> >> SpearmanCorrelation(ranking, mydata);
> >>
> >> Try that.
> > 
> > Hi,
> > 
> > this will not really help imho.
> > 
> > As far as I can see, there are at least two problems with the current use of
> > the RankingAlgorithm in the SpearmanCorrelation class:
> > 
> >  * there is no way to select the ranking algorithm in the constructor
> >    without passing the values at the same time
> >  * the NaNStrategy.REMOVED does not work symmetrically, i.e. it removes
> >    the NaN only from the input array where it occurs but not in the
> >    corresponding array, thus rendering it useless as it will result in
> >    exceptions (array lengths differ)
> > 
> > Would you be able to create an issue for this on the issue tracker and
> > provide the test case?
> > 
> > Thanks,
> > 
> > Thomas

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to