Thank you all for your suggestions. I'm taking the time to look into all of them properly.
Giorgio: A cursory glance looks promising there, thanks. Zachary: Brilliant, that looks great. val: What you are saying is correct, at the moment I'm taking 2 lines of context around each instance of the term in question, I might change that to 1, hopefully this will reduce the amount of dimensions. Dave On 5/15/07, val <[EMAIL PROTECTED]> wrote: > Dave, > I'm may be totally wrong, but i have intuitive feeling that > your problem may be reformulated with focus on separation of a "basic" > physical (vs. mathematical) 'core' and the terms which depend on a > reasonable "small parameter". In other words, my point is > to build a simplified model problem with the physically-important > components which would generate 100-500 component vectors > *core* problem (that is, to build "physical" principal componments > first 'manually' based on common/physical sense, guesses, heuristics). > To me, the vectors and matrices of 5-10K components and more is > a sign of suboptimal problem formulation and a need of its reformulation > in "more physical" terms - *if* that means smth significant in your concrete > situation. For instance, the noise reduction can be done at problem > formulation level first by introducing "regularizing" terms such as > artificial friction, viscosity, coupling, etc which used to create > a dramatic regularization effect stabilizing the problem and thus > making it easy-to-interpret in physical (vs. purely math) terms. > The Monte-Carlo technique is also an option in this type of > problems. Of course, more physical details would be helpful in > better understanding your problem. > good luck, > val > > ----- Original Message ----- > From: "Dave P. Novakovic" <[EMAIL PROTECTED]> > To: "Discussion of Numerical Python" <numpy-discussion@scipy.org> > Sent: Sunday, May 13, 2007 2:46 AM > Subject: Re: [Numpy-discussion] very large matrices. > > > > They are very large numbers indeed. Thanks for giving me a wake up call. > > Currently my data is represented as vectors in a vectorset, a typical > > sparse representation. > > > > I reduced the problem significantly by removing lots of noise. I'm > > basically recording traces of a terms occurrence throughout a corpus > > and doing an analysis of the eigenvectors. > > > > I reduced my matrix to 4863 x 4863 by filtering the original corpus. > > Now when I attempt svd, I'm finding a memory error in the svd routine. > > Is there a hard upper limit of the size of a matrix for these > > calculations? > > > > File "/usr/lib/python2.4/site-packages/numpy/linalg/linalg.py", line > > 575, in svd > > vt = zeros((n, nvt), t) > > MemoryError > > > > Cheers > > > > Dave > > > > > > On 5/13/07, Anne Archibald <[EMAIL PROTECTED]> wrote: > >> On 12/05/07, Dave P. Novakovic <[EMAIL PROTECTED]> wrote: > >> > >> > core 2 duo with 4gb RAM. > >> > > >> > I've heard about iterative svd functions. I actually need a complete > >> > svd, with all eigenvalues (not LSI). I'm actually more interested in > >> > the individual eigenvectors. > >> > > >> > As an example, a single row could probably have about 3000 non zero > >> > elements. > >> > >> I think you need to think hard about whether your problem can be done > >> in another way. > >> > >> First of all, the singular values (as returned from the svd) are not > >> eigenvalues - eigenvalue decomposition is a much harder problem, > >> numerically. > >> > >> Second, your full non-sparse matrix will be 8*75000*75000 bytes, or > >> about 42 gibibytes. Put another way, the representation of your data > >> alone is ten times the size of the RAM on the machine you're using. > >> > >> Third, your matrix has 225 000 000 nonzero entries; assuming a perfect > >> sparse representation with no extra bytes (at least two bytes per > >> entry is typical, usually more), that's 1.7 GiB. > >> > >> Recall that basically any matrix operation is at least O(N^3), so you > >> can expect order 10^14 floating-point operations to be required. This > >> is actually the *least* significant constraint; pushing stuff into and > >> out of disk caches will be taking most of your time. > >> > >> Even if you can represent your matrix sparsely (using only a couple of > >> gibibytes), you've said you want the full set of eigenvectors, which > >> is not likely to be a sparse matrix - so your result is back up to 42 > >> GiB. And you should expect an eigenvalue algorithm, if it even > >> survives massive roundoff problems, to require something like that > >> much working space; thus your problem probably has a working size of > >> something like 84 GiB. > >> > >> SVD is a little easier, if that's what you want, but the full solution > >> is twice as large, though if you discard entries corresponding to > >> small values it might be quite reasonable. You'll still need some > >> fairly specialized code, though. Which form are you looking for? > >> > >> Solving your problem in a reasonable amount of time, as described and > >> on the hardware you specify, is going to require some very specialized > >> algorithms; you could try looking for an out-of-core eigenvalue > >> package, but I'd first look to see if there's any way you can simplify > >> your problem - getting just one eigenvector, maybe. > >> > >> Anne > >> _______________________________________________ > >> Numpy-discussion mailing list > >> Numpy-discussion@scipy.org > >> http://projects.scipy.org/mailman/listinfo/numpy-discussion > >> > > _______________________________________________ > > Numpy-discussion mailing list > > Numpy-discussion@scipy.org > > http://projects.scipy.org/mailman/listinfo/numpy-discussion > > > > > _______________________________________________ > Numpy-discussion mailing list > Numpy-discussion@scipy.org > http://projects.scipy.org/mailman/listinfo/numpy-discussion > _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion