Re: [Numpy-discussion] saving groups of numpy arrays to disk

2011-08-26 Thread Paul Anton Letnes

On 25. aug. 2011, at 23.49, David Warde-Farley wrote:

> On 2011-08-25, at 2:42 PM, Chris.Barker wrote:
> 
>> On 8/24/11 9:22 AM, Anthony Scopatz wrote:
>>>   You can use Python pickling, if you do *not* have a requirement for:
>> 
>> I can't recall why, but it seem pickling of numpy arrays has been 
>> fragile and not very performant.
>> 
>> I like the npy / npz format, built in to numpy, if you don't need:
>> 
>>>   - access from non-Python programs
> 
> While I'm not aware of reader implementations for any other language, NPY is 
> a dirt-simple and well-documented format designed by Robert Kern, and should 
> be readable without too much trouble from any language that supports binary 
> I/O. The full spec is at
> 
> https://github.com/numpy/numpy/blob/master/doc/neps/npy-format.txt
> 
> It should be especially trivial to read arrays of simple scalar numeric 
> dtypes, but reading compound dtypes is also doable.
> 
> For NPZ, use a standard zip file reading library to access individual files 
> in the archive, which are in .npy format (or just unzip it by hand first -- 
> it's a normal .zip file with a special extension).
> 
> David

Out of curiosity: is the .npy format guaranteed to be independent of 
architecture (endianness and similar issues)?

Paul

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] saving groups of numpy arrays to disk

2011-08-26 Thread Derek Homeier
On 25.08.2011, at 8:42PM, Chris.Barker wrote:

> On 8/24/11 9:22 AM, Anthony Scopatz wrote:
>>You can use Python pickling, if you do *not* have a requirement for:
> 
> I can't recall why, but it seem pickling of numpy arrays has been 
> fragile and not very performant.
> 
Hmm, the pure Python version might be, but, I've used cPickle for a long time 
and never noted any stability problems. And it is still noticeably faster than 
pytables, in my experience. Still, for the sake of a standardised format I'd 
go with HDF5 any time now (and usually prefer h5py now when starting 
anything new - my pytables implementation mentioned above likely is not 
the most efficient compared to cPickle). 

But with the usual disclaimers, you should be able to simply use cPickle 
as a drop-in replacement in the example below.

Cheers,
Derek

On 21.08.2011, at 2:24PM, Pauli Virtanen wrote:

> You can use Python pickling, if you do *not* have a requirement for:
> 
> - real persistence, i.e., being able to easily read the data years later
> - a standard data format
> - access from non-Python programs
> - safety against malicious parties (unpickling can execute some code
>  in the input -- although this is possible to control)
> 
> then you can use Python pickling:
> 
>   import pickle
> 
>   file = open('out.pck', 'wb')
>   pickle.dump(file, tree, protocol=pickle.HIGHEST_PROTOCOL)
>   file.close()
> 
>   file = open('out.pck', 'rb')
>   tree = pickle.load(file)
>   file.close()

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] array_equal and array_equiv comparison functions for structured arrays

2011-08-26 Thread Derek Homeier
Hi,

as the subject says, the array_* comparison functions currently do not operate 
on structured/record arrays. Pull request 
https://github.com/numpy/numpy/pull/146
implements these comparisons.

There are two commits, differing in their interpretation whether two 
arrays with different field names, but identical data, are equivalent; i.e.

res = array_equiv(array((1,2), dtype=[('i','i4'),('v','f8')]),
  array((1,2), dtype=[('n','i4'),('f','f8')]))

is True in the current HEAD, but False in its parent.
Feedback and additional comments are invited. 

Cheers,
Derek

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] saving groups of numpy arrays to disk

2011-08-26 Thread Chris.Barker
On 8/26/11 5:04 AM, Derek Homeier wrote:
> Hmm, the pure Python version might be, but, I've used cPickle for a long time
> and never noted any stability problems.


well, here is the NEP:

https://github.com/numpy/numpy/blob/master/doc/neps/npy-format.txt

It addresses the why's and hows of the format.

-CHB


-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] saving groups of numpy arrays to disk

2011-08-26 Thread Robert Kern
On Fri, Aug 26, 2011 at 07:04, Derek Homeier
 wrote:
> On 25.08.2011, at 8:42PM, Chris.Barker wrote:
>
>> On 8/24/11 9:22 AM, Anthony Scopatz wrote:
>>>    You can use Python pickling, if you do *not* have a requirement for:
>>
>> I can't recall why, but it seem pickling of numpy arrays has been
>> fragile and not very performant.
>>
> Hmm, the pure Python version might be, but, I've used cPickle for a long time
> and never noted any stability problems.

IIRC, there have been one or two releases where we accidentally broke
the ability to load some old pickles. I think that's the kind of
fragility Chris meant. As for the other kind of stability, we have
had, at times, problems passing unpickled arrays to linear algebra
functions. This is because the SSE instructions used by the optimized
linear algebra package required aligned memory, but the unpickling
machinery did not give us such an option. We do some nasty hacks to
make unpickling performant. The unpickling machinery reads the actual
byte data in as a str object, then passes that to a numpy function to
reconstruct the array object. We simply reuse the memory underlying
the str object. This is a hack, but it's the only way to avoid copying
potentially large amounts of data. This is the cause the unaligned
memory.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Identifying Colinear Columns of a Matrix

2011-08-26 Thread Mark Janikas
Hello All,

I am trying to identify columns of a matrix that are perfectly collinear.  It 
is not that difficult to identify when two columns are identical are have zero 
variance, but I do not know how to ID when the culprit is of a higher order. 
i.e. columns 1 + 2 + 3 = column 4.  NUM.corrcoef(matrix.T) will return NaNs 
when the matrix is singular, and LA.cond(matrix.T) will provide a very large 
condition number But they do not tell me which columns are causing the 
problem.   For example:

zt = numpy. array([[ 1.  ,  1.  ,  1.  ,  1.  ,  1.  ],
   [ 0.25,  0.1 ,  0.2 ,  0.25,  0.5 ],
   [ 0.75,  0.9 ,  0.8 ,  0.75,  0.5 ],
   [ 3.  ,  8.  ,  0.  ,  5.  ,  0.  ]])

How can I identify that columns 0,1,2 are the issue because: column 1 + column 
2 = column 0?

Any input would be greatly appreciated.  Thanks much,

MJ

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix

2011-08-26 Thread Mark Janikas
As you will note, since most of the functions work on rows, the matrix in 
question has been transposed.

From: numpy-discussion-boun...@scipy.org 
[mailto:numpy-discussion-boun...@scipy.org] On Behalf Of Mark Janikas
Sent: Friday, August 26, 2011 10:11 AM
To: 'Discussion of Numerical Python'
Subject: [Numpy-discussion] Identifying Colinear Columns of a Matrix

Hello All,

I am trying to identify columns of a matrix that are perfectly collinear.  It 
is not that difficult to identify when two columns are identical are have zero 
variance, but I do not know how to ID when the culprit is of a higher order. 
i.e. columns 1 + 2 + 3 = column 4.  NUM.corrcoef(matrix.T) will return NaNs 
when the matrix is singular, and LA.cond(matrix.T) will provide a very large 
condition number But they do not tell me which columns are causing the 
problem.   For example:

zt = numpy. array([[ 1.  ,  1.  ,  1.  ,  1.  ,  1.  ],
   [ 0.25,  0.1 ,  0.2 ,  0.25,  0.5 ],
   [ 0.75,  0.9 ,  0.8 ,  0.75,  0.5 ],
   [ 3.  ,  8.  ,  0.  ,  5.  ,  0.  ]])

How can I identify that columns 0,1,2 are the issue because: column 1 + column 
2 = column 0?

Any input would be greatly appreciated.  Thanks much,

MJ

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix

2011-08-26 Thread Skipper Seabold
On Fri, Aug 26, 2011 at 1:10 PM, Mark Janikas  wrote:
> Hello All,
>
>
>
> I am trying to identify columns of a matrix that are perfectly collinear.
> It is not that difficult to identify when two columns are identical are have
> zero variance, but I do not know how to ID when the culprit is of a higher
> order. i.e. columns 1 + 2 + 3 = column 4.  NUM.corrcoef(matrix.T) will
> return NaNs when the matrix is singular, and LA.cond(matrix.T) will provide
> a very large condition number…. But they do not tell me which columns are
> causing the problem.   For example:
>
>
>
> zt = numpy. array([[ 1.  ,  1.  ,  1.  ,  1.  ,  1.  ],
>
>    [ 0.25,  0.1 ,  0.2 ,  0.25,  0.5 ],
>
>    [ 0.75,  0.9 ,  0.8 ,  0.75,  0.5 ],
>
>    [ 3.  ,  8.  ,  0.  ,  5.  ,  0.  ]])
>
>
>
> How can I identify that columns 0,1,2 are the issue because: column 1 +
> column 2 = column 0?
>
>
>
> Any input would be greatly appreciated.  Thanks much,
>

The way that I know to do this in a regression context for (near
perfect) multicollinearity is VIF. It's long been on my todo list for
statsmodels.

http://en.wikipedia.org/wiki/Variance_inflation_factor

Maybe there are other ways with decompositions. I'd be happy to hear about them.

Please post back if you write any code to do this.

Skipper
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix

2011-08-26 Thread Mark Janikas
I actually use the VIF when the design matrix can be inverted I do it the 
quick and dirty way as opposed to the step regression:

1. Calc the correlation coefficient of the matrix (w/o the intercept)
2. Return the diagonal of the inversion of the correlation matrix in step 1.

Again, the problem lies in the multiple column relationship... I wouldn't be 
able to run sub regressions at all when the columns are perfectly collinear.

MJ

-Original Message-
From: numpy-discussion-boun...@scipy.org 
[mailto:numpy-discussion-boun...@scipy.org] On Behalf Of Skipper Seabold
Sent: Friday, August 26, 2011 10:28 AM
To: Discussion of Numerical Python
Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix

On Fri, Aug 26, 2011 at 1:10 PM, Mark Janikas  wrote:
> Hello All,
>
>
>
> I am trying to identify columns of a matrix that are perfectly collinear.
> It is not that difficult to identify when two columns are identical are have
> zero variance, but I do not know how to ID when the culprit is of a higher
> order. i.e. columns 1 + 2 + 3 = column 4.  NUM.corrcoef(matrix.T) will
> return NaNs when the matrix is singular, and LA.cond(matrix.T) will provide
> a very large condition number.. But they do not tell me which columns are
> causing the problem.   For example:
>
>
>
> zt = numpy. array([[ 1.  ,  1.  ,  1.  ,  1.  ,  1.  ],
>
>    [ 0.25,  0.1 ,  0.2 ,  0.25,  0.5 ],
>
>    [ 0.75,  0.9 ,  0.8 ,  0.75,  0.5 ],
>
>    [ 3.  ,  8.  ,  0.  ,  5.  ,  0.  ]])
>
>
>
> How can I identify that columns 0,1,2 are the issue because: column 1 +
> column 2 = column 0?
>
>
>
> Any input would be greatly appreciated.  Thanks much,
>

The way that I know to do this in a regression context for (near
perfect) multicollinearity is VIF. It's long been on my todo list for
statsmodels.

http://en.wikipedia.org/wiki/Variance_inflation_factor

Maybe there are other ways with decompositions. I'd be happy to hear about them.

Please post back if you write any code to do this.

Skipper
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix

2011-08-26 Thread Mark Janikas
I wonder if my last statement is essentially the only answer... which I wanted 
to avoid... 

Should I just use combinations of the columns and try and construct the 
corrcoef() (then ID whether NaNs are present), or use the condition number to 
ID the singularity?  I just wanted to avoid the whole k! algorithm.

MJ

-Original Message-
From: numpy-discussion-boun...@scipy.org 
[mailto:numpy-discussion-boun...@scipy.org] On Behalf Of Mark Janikas
Sent: Friday, August 26, 2011 10:35 AM
To: Discussion of Numerical Python
Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix

I actually use the VIF when the design matrix can be inverted I do it the 
quick and dirty way as opposed to the step regression:

1. Calc the correlation coefficient of the matrix (w/o the intercept)
2. Return the diagonal of the inversion of the correlation matrix in step 1.

Again, the problem lies in the multiple column relationship... I wouldn't be 
able to run sub regressions at all when the columns are perfectly collinear.

MJ

-Original Message-
From: numpy-discussion-boun...@scipy.org 
[mailto:numpy-discussion-boun...@scipy.org] On Behalf Of Skipper Seabold
Sent: Friday, August 26, 2011 10:28 AM
To: Discussion of Numerical Python
Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix

On Fri, Aug 26, 2011 at 1:10 PM, Mark Janikas  wrote:
> Hello All,
>
>
>
> I am trying to identify columns of a matrix that are perfectly collinear.
> It is not that difficult to identify when two columns are identical are have
> zero variance, but I do not know how to ID when the culprit is of a higher
> order. i.e. columns 1 + 2 + 3 = column 4.  NUM.corrcoef(matrix.T) will
> return NaNs when the matrix is singular, and LA.cond(matrix.T) will provide
> a very large condition number.. But they do not tell me which columns are
> causing the problem.   For example:
>
>
>
> zt = numpy. array([[ 1.  ,  1.  ,  1.  ,  1.  ,  1.  ],
>
>    [ 0.25,  0.1 ,  0.2 ,  0.25,  0.5 ],
>
>    [ 0.75,  0.9 ,  0.8 ,  0.75,  0.5 ],
>
>    [ 3.  ,  8.  ,  0.  ,  5.  ,  0.  ]])
>
>
>
> How can I identify that columns 0,1,2 are the issue because: column 1 +
> column 2 = column 0?
>
>
>
> Any input would be greatly appreciated.  Thanks much,
>

The way that I know to do this in a regression context for (near
perfect) multicollinearity is VIF. It's long been on my todo list for
statsmodels.

http://en.wikipedia.org/wiki/Variance_inflation_factor

Maybe there are other ways with decompositions. I'd be happy to hear about them.

Please post back if you write any code to do this.

Skipper
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix

2011-08-26 Thread Fernando Perez
On Fri, Aug 26, 2011 at 7:41 PM, Mark Janikas  wrote:
> I wonder if my last statement is essentially the only answer... which I 
> wanted to avoid...
>
> Should I just use combinations of the columns and try and construct the 
> corrcoef() (then ID whether NaNs are present), or use the condition number to 
> ID the singularity?  I just wanted to avoid the whole k! algorithm.
>

This is a completely naive, off-the-top of my head reply, so most
likely completely wrong.  But wouldn't a Gram-Schmidt type process let
you identify things here?   You're effectively looking for n vectors
that belong to an m-dimensional subspace with n>m.  As you walk
through the G-S process you could probably track the projections and
identify when one of the vectors in the m-n set is 'emptied out' by
the G-S projections, and would have the info of what it projected
into.

I don't remember the details of G-S so perhaps there's  a really
obvious reason why the above is dumb and doesn't work.  But just in
case it gets you thinking in the right direction... (and I'll learn
something from the corrections)

Cheers,

f
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix

2011-08-26 Thread Charles R Harris
On Fri, Aug 26, 2011 at 11:41 AM, Mark Janikas  wrote:

> I wonder if my last statement is essentially the only answer... which I
> wanted to avoid...
>
> Should I just use combinations of the columns and try and construct the
> corrcoef() (then ID whether NaNs are present), or use the condition number
> to ID the singularity?  I just wanted to avoid the whole k! algorithm.
>
> MJ
>
> -Original Message-
> From: numpy-discussion-boun...@scipy.org [mailto:
> numpy-discussion-boun...@scipy.org] On Behalf Of Mark Janikas
> Sent: Friday, August 26, 2011 10:35 AM
> To: Discussion of Numerical Python
> Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix
>
> I actually use the VIF when the design matrix can be inverted I do it
> the quick and dirty way as opposed to the step regression:
>
> 1. Calc the correlation coefficient of the matrix (w/o the intercept)
> 2. Return the diagonal of the inversion of the correlation matrix in step
> 1.
>
> Again, the problem lies in the multiple column relationship... I wouldn't
> be able to run sub regressions at all when the columns are perfectly
> collinear.
>
> MJ
>
> -Original Message-
> From: numpy-discussion-boun...@scipy.org [mailto:
> numpy-discussion-boun...@scipy.org] On Behalf Of Skipper Seabold
> Sent: Friday, August 26, 2011 10:28 AM
> To: Discussion of Numerical Python
> Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix
>
> On Fri, Aug 26, 2011 at 1:10 PM, Mark Janikas  wrote:
> > Hello All,
> >
> >
> >
> > I am trying to identify columns of a matrix that are perfectly collinear.
> > It is not that difficult to identify when two columns are identical are
> have
> > zero variance, but I do not know how to ID when the culprit is of a
> higher
> > order. i.e. columns 1 + 2 + 3 = column 4.  NUM.corrcoef(matrix.T) will
> > return NaNs when the matrix is singular, and LA.cond(matrix.T) will
> provide
> > a very large condition number.. But they do not tell me which columns are
> > causing the problem.   For example:
> >
> >
> >
> > zt = numpy. array([[ 1.  ,  1.  ,  1.  ,  1.  ,  1.  ],
> >
> >[ 0.25,  0.1 ,  0.2 ,  0.25,  0.5 ],
> >
> >[ 0.75,  0.9 ,  0.8 ,  0.75,  0.5 ],
> >
> >[ 3.  ,  8.  ,  0.  ,  5.  ,  0.  ]])
> >
> >
> >
> > How can I identify that columns 0,1,2 are the issue because: column 1 +
> > column 2 = column 0?
> >
> >
> >
> > Any input would be greatly appreciated.  Thanks much,
> >
>
> The way that I know to do this in a regression context for (near
> perfect) multicollinearity is VIF. It's long been on my todo list for
> statsmodels.
>
> http://en.wikipedia.org/wiki/Variance_inflation_factor
>
> Maybe there are other ways with decompositions. I'd be happy to hear about
> them.
>
> Please post back if you write any code to do this.
>
>
Why not svd?

In [13]: u,d,v = svd(zt)

In [14]: d
Out[14]:
array([  1.01307066e+01,   1.87795095e+00,   3.03454566e-01,
 3.29253945e-16])

In [15]: u[:,3]
Out[15]: array([ 0.57735027, -0.57735027, -0.57735027,  0.])

In [16]: dot(u[:,3], zt)
Out[16]:
array([ -7.77156117e-16,  -6.66133815e-16,  -7.21644966e-16,
-7.77156117e-16,  -8.88178420e-16])

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix

2011-08-26 Thread josef . pktd
On Fri, Aug 26, 2011 at 1:41 PM, Mark Janikas  wrote:
> I wonder if my last statement is essentially the only answer... which I 
> wanted to avoid...
>
> Should I just use combinations of the columns and try and construct the 
> corrcoef() (then ID whether NaNs are present), or use the condition number to 
> ID the singularity?  I just wanted to avoid the whole k! algorithm.
>
> MJ
>
> -Original Message-
> From: numpy-discussion-boun...@scipy.org 
> [mailto:numpy-discussion-boun...@scipy.org] On Behalf Of Mark Janikas
> Sent: Friday, August 26, 2011 10:35 AM
> To: Discussion of Numerical Python
> Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix
>
> I actually use the VIF when the design matrix can be inverted I do it the 
> quick and dirty way as opposed to the step regression:
>
> 1. Calc the correlation coefficient of the matrix (w/o the intercept)
> 2. Return the diagonal of the inversion of the correlation matrix in step 1.
>
> Again, the problem lies in the multiple column relationship... I wouldn't be 
> able to run sub regressions at all when the columns are perfectly collinear.
>
> MJ
>
> -Original Message-
> From: numpy-discussion-boun...@scipy.org 
> [mailto:numpy-discussion-boun...@scipy.org] On Behalf Of Skipper Seabold
> Sent: Friday, August 26, 2011 10:28 AM
> To: Discussion of Numerical Python
> Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix
>
> On Fri, Aug 26, 2011 at 1:10 PM, Mark Janikas  wrote:
>> Hello All,
>>
>>
>>
>> I am trying to identify columns of a matrix that are perfectly collinear.
>> It is not that difficult to identify when two columns are identical are have
>> zero variance, but I do not know how to ID when the culprit is of a higher
>> order. i.e. columns 1 + 2 + 3 = column 4.  NUM.corrcoef(matrix.T) will
>> return NaNs when the matrix is singular, and LA.cond(matrix.T) will provide
>> a very large condition number.. But they do not tell me which columns are
>> causing the problem.   For example:
>>
>>
>>
>> zt = numpy. array([[ 1.  ,  1.  ,  1.  ,  1.  ,  1.  ],
>>
>>    [ 0.25,  0.1 ,  0.2 ,  0.25,  0.5 ],
>>
>>    [ 0.75,  0.9 ,  0.8 ,  0.75,  0.5 ],
>>
>>    [ 3.  ,  8.  ,  0.  ,  5.  ,  0.  ]])
>>
>>
>>
>> How can I identify that columns 0,1,2 are the issue because: column 1 +
>> column 2 = column 0?
>>
>>
>>
>> Any input would be greatly appreciated.  Thanks much,
>>
>
> The way that I know to do this in a regression context for (near
> perfect) multicollinearity is VIF. It's long been on my todo list for
> statsmodels.
>
> http://en.wikipedia.org/wiki/Variance_inflation_factor
>
> Maybe there are other ways with decompositions. I'd be happy to hear about 
> them.
>
> Please post back if you write any code to do this.

Partial answer in a different context. I have written a function that
only adds columns if they maintain invertibility, using brute force:
add each column sequentially, check whether the matrix is singular.
Don't add the columns that already included as linear combination.
(But this doesn't tell which columns are in the colinear vector.)

I did this for categorical variables, so sequence was predefined.

Just finding a non-singular subspace would be easier, PCA, SVD, or
scikits.learn matrix decomposition (?).

(factor models and Johansen's cointegration tests are also just doing
matrix decomposition that identify subspaces)

Maybe rotation in Factor Analysis is able to identify the vectors, but
I don't have much idea about that.

Josef

>
> Skipper
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix

2011-08-26 Thread Mark Janikas
Charles!  That looks like it could be a winner!  It looks like you always 
choose the last column of the U matrix and ID the columns that have the same 
values?  It works when I add extra columns as well!  BTW, sorry for my lack of 
knowledge... but what was the point of the dot multiply at the end?  That they 
add up to essentially zero, indicating singularity?  Thanks so much!

MJ

From: numpy-discussion-boun...@scipy.org 
[mailto:numpy-discussion-boun...@scipy.org] On Behalf Of Charles R Harris
Sent: Friday, August 26, 2011 11:04 AM
To: Discussion of Numerical Python
Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix


On Fri, Aug 26, 2011 at 11:41 AM, Mark Janikas 
mailto:mjani...@esri.com>> wrote:
I wonder if my last statement is essentially the only answer... which I wanted 
to avoid...

Should I just use combinations of the columns and try and construct the 
corrcoef() (then ID whether NaNs are present), or use the condition number to 
ID the singularity?  I just wanted to avoid the whole k! algorithm.

MJ

-Original Message-
From: 
numpy-discussion-boun...@scipy.org 
[mailto:numpy-discussion-boun...@scipy.org]
 On Behalf Of Mark Janikas
Sent: Friday, August 26, 2011 10:35 AM
To: Discussion of Numerical Python
Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix

I actually use the VIF when the design matrix can be inverted I do it the 
quick and dirty way as opposed to the step regression:

1. Calc the correlation coefficient of the matrix (w/o the intercept)
2. Return the diagonal of the inversion of the correlation matrix in step 1.

Again, the problem lies in the multiple column relationship... I wouldn't be 
able to run sub regressions at all when the columns are perfectly collinear.

MJ

-Original Message-
From: 
numpy-discussion-boun...@scipy.org 
[mailto:numpy-discussion-boun...@scipy.org]
 On Behalf Of Skipper Seabold
Sent: Friday, August 26, 2011 10:28 AM
To: Discussion of Numerical Python
Subject: Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix

On Fri, Aug 26, 2011 at 1:10 PM, Mark Janikas 
mailto:mjani...@esri.com>> wrote:
> Hello All,
>
>
>
> I am trying to identify columns of a matrix that are perfectly collinear.
> It is not that difficult to identify when two columns are identical are have
> zero variance, but I do not know how to ID when the culprit is of a higher
> order. i.e. columns 1 + 2 + 3 = column 4.  NUM.corrcoef(matrix.T) will
> return NaNs when the matrix is singular, and LA.cond(matrix.T) will provide
> a very large condition number.. But they do not tell me which columns are
> causing the problem.   For example:
>
>
>
> zt = numpy. array([[ 1.  ,  1.  ,  1.  ,  1.  ,  1.  ],
>
>[ 0.25,  0.1 ,  0.2 ,  0.25,  0.5 ],
>
>[ 0.75,  0.9 ,  0.8 ,  0.75,  0.5 ],
>
>[ 3.  ,  8.  ,  0.  ,  5.  ,  0.  ]])
>
>
>
> How can I identify that columns 0,1,2 are the issue because: column 1 +
> column 2 = column 0?
>
>
>
> Any input would be greatly appreciated.  Thanks much,
>

The way that I know to do this in a regression context for (near
perfect) multicollinearity is VIF. It's long been on my todo list for
statsmodels.

http://en.wikipedia.org/wiki/Variance_inflation_factor

Maybe there are other ways with decompositions. I'd be happy to hear about them.

Please post back if you write any code to do this.

Why not svd?

In [13]: u,d,v = svd(zt)

In [14]: d
Out[14]:
array([  1.01307066e+01,   1.87795095e+00,   3.03454566e-01,
 3.29253945e-16])

In [15]: u[:,3]
Out[15]: array([ 0.57735027, -0.57735027, -0.57735027,  0.])

In [16]: dot(u[:,3], zt)
Out[16]:
array([ -7.77156117e-16,  -6.66133815e-16,  -7.21644966e-16,
-7.77156117e-16,  -8.88178420e-16])

Chuck

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NA mask C-API documentation

2011-08-26 Thread Christopher Jordan-Squire
Regarding ufuncs and NA's, all the mechanics of handling NA from a
ufunc are in the PyUFunc_FromFuncAndData function, right? So the ufunc
creation docs don't have to be updated to include NA's?

-Chris JS

On Wed, Aug 24, 2011 at 7:08 PM, Mark Wiebe  wrote:
> I've added C-API documentation to the missingdata branch. The .rst file
> (beware of the github rst parser though, it drops some of the content) is
> here:
> https://github.com/m-paradox/numpy/blob/missingdata/doc/source/reference/c-api.maskna.rst
> and I made a small example module which goes with it here:
> https://github.com/m-paradox/spdiv
> Cheers,
> Mark
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix

2011-08-26 Thread Charles R Harris
On Fri, Aug 26, 2011 at 12:38 PM, Mark Janikas  wrote:

> Charles!  That looks like it could be a winner!  It looks like you always
> choose the last column of the U matrix and ID the columns that have the same
> values?  It works when I add extra columns as well!  BTW, sorry for my lack
> of knowledge… but what was the point of the dot multiply at the end?  That
> they add up to essentially zero, indicating singularity?  Thanks so much!
>

The indicator of collinearity is the singular value in d, the corresponding
column in u represent the linear combination of rows that are ~0, the
corresponding row in v represents the linear combination of columns that are
~0. If you have several combinations that are ~0, of course you can add them
together and get another. Basically, if you take the rows in v corresponding
to small singular values, you get a basis for the for the null space of the
matrix, the corresponding columns in u are a basis for the orthogonal
complement of the range of the matrix. If that is getting a bit technical
you can just play around with things.



Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] NA mask C-API documentation

2011-08-26 Thread Mark Wiebe
On Fri, Aug 26, 2011 at 11:47 AM, Christopher Jordan-Squire  wrote:

> Regarding ufuncs and NA's, all the mechanics of handling NA from a
> ufunc are in the PyUFunc_FromFuncAndData function, right? So the ufunc
> creation docs don't have to be updated to include NA's?
>

That's correct, any ufunc will automatically support NAs with a propagation
approach. It's probably worth mentioning this in the ufunc docs.

I've added some additional type resolution and loop selection functions, but
I'd rather keep them private in NumPy for a version or two so improvements
can be made as experience is gained with them. Unfortunately some aspects of
this are in public headers because of how the API is designed, ideally more
of the classes struct layouts should be hidden from the ABI just as I've
done in deprecating that access for PyArrayObject.

-Mark


>
> -Chris JS
>
> On Wed, Aug 24, 2011 at 7:08 PM, Mark Wiebe  wrote:
> > I've added C-API documentation to the missingdata branch. The .rst file
> > (beware of the github rst parser though, it drops some of the content) is
> > here:
> >
> https://github.com/m-paradox/numpy/blob/missingdata/doc/source/reference/c-api.maskna.rst
> > and I made a small example module which goes with it here:
> > https://github.com/m-paradox/spdiv
> > Cheers,
> > Mark
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Identifying Colinear Columns of a Matrix

2011-08-26 Thread josef . pktd
On Fri, Aug 26, 2011 at 2:57 PM, Charles R Harris
 wrote:
>
>
> On Fri, Aug 26, 2011 at 12:38 PM, Mark Janikas  wrote:
>>
>> Charles!  That looks like it could be a winner!  It looks like you always
>> choose the last column of the U matrix and ID the columns that have the same
>> values?  It works when I add extra columns as well!  BTW, sorry for my lack
>> of knowledge… but what was the point of the dot multiply at the end?  That
>> they add up to essentially zero, indicating singularity?  Thanks so much!
>
> The indicator of collinearity is the singular value in d, the corresponding
> column in u represent the linear combination of rows that are ~0, the
> corresponding row in v represents the linear combination of columns that are
> ~0. If you have several combinations that are ~0, of course you can add them
> together and get another. Basically, if you take the rows in v corresponding
> to small singular values, you get a basis for the for the null space of the
> matrix, the corresponding columns in u are a basis for the orthogonal
> complement of the range of the matrix. If that is getting a bit technical
> you can just play around with things.

Interpretation is a bit difficult if there are more than one zero eigenvalues

>>> zt2 = np.vstack((zt, zt[2,:] + zt[3,:]))
>>> zt2
array([[ 1.  ,  1.  ,  1.  ,  1.  ,  1.  ],
   [ 0.25,  0.1 ,  0.2 ,  0.25,  0.5 ],
   [ 0.75,  0.9 ,  0.8 ,  0.75,  0.5 ],
   [ 3.  ,  8.  ,  0.  ,  5.  ,  0.  ],
   [ 3.75,  8.9 ,  0.8 ,  5.75,  0.5 ]])
>>> u,d,v = np.linalg.svd(zt2)
>>> d
array([  1.51561431e+01,   1.91327688e+00,   3.25113875e-01,
 1.05664844e-15,   5.29054218e-16])
>>> u[:,-2:]
array([[ 0.59948553, -0.12496837],
   [-0.59948553,  0.12496837],
   [-0.51747833, -0.48188813],
   [ 0.0820072 , -0.60685651],
   [-0.0820072 ,  0.60685651]])

Josef

>
> 
>
> Chuck
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] How to output array with indexes to a text file?

2011-08-26 Thread Brett Olsen
On Thu, Aug 25, 2011 at 2:10 PM, Paul Menzel
 wrote:
> is there an easy way to also save the indexes of an array (columns, rows
> or both) when outputting it to a text file. For saving an array to a
> file I only found `savetxt()` [1] which does not seem to have such an
> option. Adding indexes manually is doable but I would like to avoid
> that.
> Is there a way to accomplish that task without reserving the 0th row or
> column to store the indexes?
>
> I want to process these text files to produce graphs and MetaPost’s [2]
> graph package needs these indexes. (I know about Matplotlib [3], but I
> would like to use MetaPost.)
>
>
> Thanks,
>
> Paul

Why don't you just write a wrapper for numpy.savetxt that adds the
indices?  E.g.:

In [1]: import numpy as N

In [2]: a = N.arange(6,12).reshape((2,3))

In [3]: a
Out[3]:
array([[ 6,  7,  8],
   [ 9, 10, 11]])

In [4]: def save_with_indices(filename, output):
   ...: (rows, cols) = output.shape
   ...: tmp = N.hstack((N.arange(1,rows+1).reshape((rows,1)), output))
   ...: tmp = N.vstack((N.arange(cols+1).reshape((1,cols+1)), tmp))
   ...: N.savetxt(filename, tmp, fmt='%8i')
   ...:

In [5]: N.savetxt('noidx.txt', a, fmt='%8i')

In [6]: save_with_indices('idx.txt', a)

'noidx.txt' looks like:
   678
   9   10   11
'idx.txt' looks like:
   0123
   1678
   29   10   11

~Brett
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] the build and installation process

2011-08-26 Thread Ralf Gommers
On Thu, Aug 25, 2011 at 2:23 PM, srean  wrote:

> Hi,
>
>  I would like to know a bit about how the installation process works. Could
> you point me to a resource. In particular I want to know how the site.cfg
> configuration works. Is it numpy/scipy specific or is it standard with
> distutils. I googled for site.cfg and distutils but did not find any
> authoritative document.


There is not much more than what's described in the site.cfg.example file
that's in the numpy source tree root dir. As far as I know the site.cfg name
is numpy specific, but python distutils uses a distutils.cfg file in the
same format.

>
> I believe many new users trip up on the installation process, especially in
> trying to substitute their favourite library in place os the standard. So a
> canonical document explaining the process will be very helpful.
>
> http://docs.scipy.org/doc/numpy/user/install.html
>

The most up-to-date descriptions for each OS can be found at
http://www.scipy.org/Installing_SciPy

>
> does cover some of the important points but its a bit sketchy, and has a
> "this is all that you need to know" flavor. Doesnt quite enable the reader
> to fix his own problems. So a resource that is somewhere in between reading
> up all the sources that get invoked during the installation and building,
> and the current install document will be very welcome.
>
> English is not my native language, but if there is anyway I can help, I
> would do so gladly.
>

If the above docs don't help as much as you'd want, please point out the
most problematic points. The install instructions are a wiki so you can make
changes yourself. Especially about things like linking to specific versions
of MKL there's not enough or outdated info, any contributions there will be
very useful.

Cheers,
Ralf


> -- srean
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion