On Fri, Aug 26, 2011 at 11:41 AM, Mark Janikas <[email protected]> wrote:
> I wonder if my last statement is essentially the only answer... which I > wanted to avoid... > > Should I just use combinations of the columns and try and construct the > corrcoef() (then ID whether NaNs are present), or use the condition number > to ID the singularity? I just wanted to avoid the whole k! algorithm. > > MJ > > -----Original Message----- > From: [email protected] [mailto: > [email protected]] On Behalf Of Mark Janikas > Sent: Friday, August 26, 2011 10:35 AM > To: Discussion of Numerical Python > Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix > > I actually use the VIF when the design matrix can be inverted.... I do it > the quick and dirty way as opposed to the step regression: > > 1. Calc the correlation coefficient of the matrix (w/o the intercept) > 2. Return the diagonal of the inversion of the correlation matrix in step > 1. > > Again, the problem lies in the multiple column relationship... I wouldn't > be able to run sub regressions at all when the columns are perfectly > collinear. > > MJ > > -----Original Message----- > From: [email protected] [mailto: > [email protected]] On Behalf Of Skipper Seabold > Sent: Friday, August 26, 2011 10:28 AM > To: Discussion of Numerical Python > Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix > > On Fri, Aug 26, 2011 at 1:10 PM, Mark Janikas <[email protected]> wrote: > > Hello All, > > > > > > > > I am trying to identify columns of a matrix that are perfectly collinear. > > It is not that difficult to identify when two columns are identical are > have > > zero variance, but I do not know how to ID when the culprit is of a > higher > > order. i.e. columns 1 + 2 + 3 = column 4. NUM.corrcoef(matrix.T) will > > return NaNs when the matrix is singular, and LA.cond(matrix.T) will > provide > > a very large condition number.. But they do not tell me which columns are > > causing the problem. For example: > > > > > > > > zt = numpy. array([[ 1. , 1. , 1. , 1. , 1. ], > > > > [ 0.25, 0.1 , 0.2 , 0.25, 0.5 ], > > > > [ 0.75, 0.9 , 0.8 , 0.75, 0.5 ], > > > > [ 3. , 8. , 0. , 5. , 0. ]]) > > > > > > > > How can I identify that columns 0,1,2 are the issue because: column 1 + > > column 2 = column 0? > > > > > > > > Any input would be greatly appreciated. Thanks much, > > > > The way that I know to do this in a regression context for (near > perfect) multicollinearity is VIF. It's long been on my todo list for > statsmodels. > > http://en.wikipedia.org/wiki/Variance_inflation_factor > > Maybe there are other ways with decompositions. I'd be happy to hear about > them. > > Please post back if you write any code to do this. > > Why not svd? In [13]: u,d,v = svd(zt) In [14]: d Out[14]: array([ 1.01307066e+01, 1.87795095e+00, 3.03454566e-01, 3.29253945e-16]) In [15]: u[:,3] Out[15]: array([ 0.57735027, -0.57735027, -0.57735027, 0. ]) In [16]: dot(u[:,3], zt) Out[16]: array([ -7.77156117e-16, -6.66133815e-16, -7.21644966e-16, -7.77156117e-16, -8.88178420e-16]) Chuck
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
