Thank you all for your suggestions. I'm taking the time to look into
all of them properly.

Giorgio: A cursory glance looks promising there, thanks.

Zachary: Brilliant, that looks great.

val: What you are saying is correct, at the moment I'm taking 2 lines
of context around each instance of the term in question, I might
change that to 1, hopefully this will reduce the amount of dimensions.

Dave

On 5/15/07, val <[EMAIL PROTECTED]> wrote:
> Dave,
>     I'm may be totally wrong, but i have intuitive feeling that
> your problem may be reformulated  with focus on separation of a "basic"
> physical (vs. mathematical) 'core' and the terms which depend on a
> reasonable "small parameter".  In other words, my point is
> to build a simplified model problem with the physically-important
> components which would generate 100-500 component vectors
> *core* problem (that is, to build "physical" principal componments
> first 'manually' based on common/physical sense, guesses, heuristics).
>     To me, the vectors and matrices of 5-10K components and more is
> a sign of suboptimal problem formulation and a need of its reformulation
> in "more physical" terms - *if* that means smth significant in your concrete
> situation.  For instance, the noise reduction can be done at problem
> formulation level first by introducing "regularizing" terms such as
> artificial friction, viscosity, coupling, etc which used to create
> a dramatic regularization effect stabilizing the problem and thus
> making it easy-to-interpret in physical (vs. purely math) terms.
>     The Monte-Carlo technique is also an option in this type of
> problems.  Of course, more physical details would be helpful in
> better understanding your problem.
>     good luck,
> val
>
> ----- Original Message -----
> From: "Dave P. Novakovic" <[EMAIL PROTECTED]>
> To: "Discussion of Numerical Python" <numpy-discussion@scipy.org>
> Sent: Sunday, May 13, 2007 2:46 AM
> Subject: Re: [Numpy-discussion] very large matrices.
>
>
> > They are very large numbers indeed. Thanks for giving me a wake up call.
> > Currently my data is represented as vectors in a vectorset, a typical
> > sparse representation.
> >
> > I reduced the problem significantly by removing lots of noise. I'm
> > basically recording traces of a terms occurrence throughout a corpus
> > and doing an analysis of the eigenvectors.
> >
> > I reduced my matrix to  4863 x 4863 by filtering the original corpus.
> > Now when I attempt svd, I'm finding a memory error in the svd routine.
> > Is there a hard upper limit of the size of a matrix for these
> > calculations?
> >
> >  File "/usr/lib/python2.4/site-packages/numpy/linalg/linalg.py", line
> > 575, in svd
> >    vt = zeros((n, nvt), t)
> > MemoryError
> >
> > Cheers
> >
> > Dave
> >
> >
> > On 5/13/07, Anne Archibald <[EMAIL PROTECTED]> wrote:
> >> On 12/05/07, Dave P. Novakovic <[EMAIL PROTECTED]> wrote:
> >>
> >> > core 2 duo with 4gb RAM.
> >> >
> >> > I've heard about iterative svd functions. I actually need a complete
> >> > svd, with all eigenvalues (not LSI). I'm actually more interested in
> >> > the individual eigenvectors.
> >> >
> >> > As an example, a single row could probably have about 3000 non zero
> >> > elements.
> >>
> >> I think you need to think hard about whether your problem can be done
> >> in another way.
> >>
> >> First of all, the singular values (as returned from the svd) are not
> >> eigenvalues - eigenvalue decomposition is a much harder problem,
> >> numerically.
> >>
> >> Second, your full non-sparse matrix will be 8*75000*75000 bytes, or
> >> about 42 gibibytes. Put another way, the representation of your data
> >> alone is ten times the size of the RAM on the machine you're using.
> >>
> >> Third, your matrix has 225 000 000 nonzero entries; assuming a perfect
> >> sparse representation with no extra bytes (at least two bytes per
> >> entry is typical, usually more), that's 1.7 GiB.
> >>
> >> Recall that basically any matrix operation is at least O(N^3), so you
> >> can expect order 10^14 floating-point operations to be required. This
> >> is actually the *least* significant constraint; pushing stuff into and
> >> out of disk caches will be taking most of your time.
> >>
> >> Even if you can represent your matrix sparsely (using only a couple of
> >> gibibytes), you've said you want the full set of eigenvectors, which
> >> is not likely to be a sparse matrix - so your result is back up to 42
> >> GiB. And you should expect an eigenvalue algorithm, if it even
> >> survives massive roundoff problems, to require something like that
> >> much working space; thus your problem probably has a working size of
> >> something like 84 GiB.
> >>
> >> SVD is a little easier, if that's what you want, but the full solution
> >> is twice as large, though if you discard entries corresponding to
> >> small values it might be quite reasonable. You'll still need some
> >> fairly specialized code, though. Which form are you looking for?
> >>
> >> Solving your problem in a reasonable amount of time, as described and
> >> on the hardware you specify, is going to require some very specialized
> >> algorithms; you could try looking for an out-of-core eigenvalue
> >> package, but I'd first look to see if there's any way you can simplify
> >> your problem - getting just one eigenvector, maybe.
> >>
> >> Anne
> >> _______________________________________________
> >> Numpy-discussion mailing list
> >> Numpy-discussion@scipy.org
> >> http://projects.scipy.org/mailman/listinfo/numpy-discussion
> >>
> > _______________________________________________
> > Numpy-discussion mailing list
> > Numpy-discussion@scipy.org
> > http://projects.scipy.org/mailman/listinfo/numpy-discussion
> >
>
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to