Re: [R] operations on sparse matrices, and dense intermediary steps

2009-10-27 Thread Martin Maechler
Hi Jose,

> "JQ" == Jose Quesada 
> on Sat, 24 Oct 2009 23:49:08 +0200 writes:

JQ> -BEGIN PGP SIGNED MESSAGE-
JQ> Hash: SHA1

JQ> Hi,

JQ> I'm doing some basic operations on large sparse matrices, for example
JQ> getting a row.
JQ> it takes  close to 30 seconds on a 3Ghz machine, and shots the memory
JQ> usage up to the sky.

Hmm, I can hardly beleave the memory shooting, are you sure about
that ?? .. see below

JQ> I suspect there are dense intermediary steps (which, if true would
JQ> defeat the purpose of trying to use sparse representaitons).

JQ> As much as I try understanding the hierarchy of Matrix classes, it's a
JQ> mystery to me.
JQ> Is subsetting sparse matrices memory-intensive? Does it have to do with
JQ> features of the language, such as pass-by-value bu default?
JQ> Or am I doing something inneficent without knowing?

JQ> (Note: example below would only work with 64bit R and lots of memory;
JQ> reduce size of matrix 2-3 orders of magnitude for 32-bit R)

[... rSpMatrix()  from  help(spMatrix) .. ]

JQ> ## (original ir a term x doc matrix of the entire wikipedia)
JQ> mm <- rSpMatrix(793251, 1027355, nnz = 205746204)

JQ> # make it column based:
JQ> mm <- as(mm, "CsparseMatrix")

{ Note that the above Tsparse -> Csparse is not really beneficial,
  if your main issue is things like  subsetting and sub-assignment:
  these are implemented to go via TsparseMatrix operations.
  However, this is really a side-issue }


JQ> a=mm[1,,drop=F]#this takes close to 30 seconds on a 3Ghz machine

Well, you have nnz := 205 million non-zero entries  in your matrix.
This is quite substantial, also with lots of RAM.

The subsetting indeed uses several dense vectors of size 'nnz'
and it's just these operations (vector operations only) that use
the time (and memory) you see.
Most of these vectors are logical or integers --> ~ 800 MB each,
but of course the x slot of M is 8 * 205 = 1.6 GB alone.
So even thought you have a sparse matrix, it *is* large in
itself, and dealing with it and its components is not inexpensive.

Note that in R  m[i,,]  needs to work for very general i,
e.g., for  i <- c(1,1,1,2:7, 1:10100)
or i a logical vector of length nrow(m), or even a character
vector of rownames ... all these work for 'Matrix' as for
traditional R matrices, so, the function that does the
subsetting needs to be written generally enough.

But of course, we are happy for improvement patches to the Matrix
package, and this part of the code could probably be sped up
by a factor of 10 or 20, e.g., by moving that part of the code
to C; it's currently the code starting at around 
line 194 in  Matrix/R/Tsparse.R, 

setMethod("[", signature(x = "TsparseMatrix", i = "index", j = "missing",
 drop = "logical"),
  function (x, i, j, ..., drop) { ## select rows
...
...
...
  })

which you seem to offer to improve ;-)

{{ A less nice alternative to moving it to C,
   might be to special case  M[i,] , M[i,j]  etc 
   for the special case where  i (and j) is just of length 1. }}

Thank you for the reproducible code, 
and thanks in advance for your improvement patches 
or other propositions...


Best regards,
Martin Maechler, ETH Zurich

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] operations on sparse matrices, and dense intermediary steps

2009-10-24 Thread Jose Quesada
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

I'm doing some basic operations on large sparse matrices, for example
getting a row.
it takes  close to 30 seconds on a 3Ghz machine, and shots the memory
usage up to the sky.
I suspect there are dense intermediary steps (which, if true would
defeat the purpose of trying to use sparse representaitons).

As much as I try understanding the hierarchy of Matrix classes, it's a
mystery to me.
Is subsetting sparse matrices memory-intensive? Does it have to do with
features of the language, such as pass-by-value bu default?
Or am I doing something inneficent without knowing?

(Note: example below would only work with 64bit R and lots of memory;
reduce size of matrix 2-3 orders of magnitude for 32-bit R)
## libraries
library(Matrix)


rSpMatrix <- function(nrow, ncol, nnz, rand.x = function(n)
round(rnorm(nnz), 2))
{
## Purpose: random sparse matrix
## 
--
## Arguments: (nrow,ncol): dimension
##  nnz  :  number of non-zero entries
## rand.x:  random number generator for 'x' slot
## 
--
## Author: Martin Maechler, Date: 14.-16. May 2007
stopifnot((nnz <- as.integer(nnz)) >= 0,
nrow >= 0, ncol >= 0,
nnz <= nrow * ncol)
spMatrix(nrow, ncol,
 i = sample(nrow, nnz, replace 
= TRUE),
 j = sample(ncol, nnz, replace 
= TRUE),
 x = rand.x(nnz))
}


## (original ir a term x doc matrix of the entire wikipedia)
mm <- rSpMatrix(793251, 1027355, nnz = 205746204)
# make it column based:
mm <- as(mm, "CsparseMatrix")
a=mm[1,,drop=F]#this takes close to 30 seconds on a 3Ghz machine

Thanks,
- -Jose
- --
=== I'm learning a new keyboard layout. I type slowly. Please excuse
typos and slower-that-usual responses. Thanks for your patience===
Jose Quesada, PhD.

Max Planck Institute,
Center for Adaptive Behavior and Cognition -ABC-,
Lentzeallee 94, office 224, 14195 Berlin

http://www.josequesada.name/
http://twitter.com/Quesada

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)

iQIcBAEBAgAGBQJK43ZUAAoJEGMobUVGH+HKsZEP/1+R/x2Qv9Kc1LYNP01Hvirz
n8ZVz/fGm+uCt/n4F+cXdXntZ3yoUVZcPy7LwMIo5W4x/saS5cMjBg6Z8hsDvXEQ
2KiiPYJ46C6hEOGJwrpYfOZwm7t5aIMMhaP182P19ziKhpp/Gn4joTnXUzigNHzt
vIPBOBgdZdPvC0h9ByvtCmMSX3roQPv1nMIojrPC0+EzfIVVNIcsfQYYacOyhcuM
/RR97aOzHvcdCln7FbrIT2I0SeeVH3scFGN7q7KFi6Sy+KZmQsv7FxfnUAvf5R+l
DAcm0ekeThksSmJE/Td1220ZguaORjyMHwAKfJH+wiXei24+N+Xf22469g2gCEWb
hiNUzhLHXOSY6mKZ80LKMbhUM4JhKs7K1HImwQmVDa/1UU1WwsjrzZ2fHHqnsjQi
Uysrttu1nbTT5Yvn9CT8gedM7A78sIddjpi1PavbRVJl7/eDN5PGgwilQ70DNetJ
PY8QvLGlA4GGtdvTzxFVP2VK0QgfxmRedrEwqxR1AlIZRn9iK8jrZCXTGLeXe5LX
BX7PQXs11ZfuW/kzDBstqobCrxSERRb/HP5BlY+mKmZ0SieoVkpLeJ7PdeHu+r31
8bsZPkeHaO5MR1KrG155JOK1vgmHPpfSiq0lNUT99hncoFKBvweDnE3Etx+tBra7
h9lxbDeHl6xkbCJj/LDb
=ZVWO
-END PGP SIGNATURE-

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.