[algogeeks] Re: Effictive storage of large document-term-matrix?

Varun Soundararajan Sun, 21 May 2006 07:33:03 -0700

The Term Document matrix is a perfect example of implementing Sparse matrix memory implementation. I have had examples where I was able to represent a 5000 words cross 50,000 documents (a little more than that), much efficiently using the in-memory representation techniques of sparse memories.

For example, you can consider an implementation like this:
001000
010000
000000
000000

the matrix is 4 x 4 matrix, which i now assume to be sparse and can have values 0 and 1 only.

then one representation could be,
4 4
0 2
1 1

(in the 0th row, the 2nd element is alone non zero)
in the 1st rwo the 1st element is alone non zero
(this is one of the file format(s) for sparse matrices -there are several others).

Let say you want to multiply, all you need to do is take AND of the rows of the first matrix with the columns of the second and sum them up. Since the matrix is sparse (if you had sufficiently worked on IR , you will know that the TD matrix is always sparse), it makes sense to write utility routines to make a sparse matrix's row representation to col representation. There are also existing standards on sparse matrices representation; a bit of googling can help you out.

Regards
Varun

[algogeeks] Re: Effictive storage of large document-term-matrix?

Reply via email to