Re: Regarding PCA implementation

2011-04-29 Thread Ted Dunning
Yep. That is a problem. Emit a constant key and a pair containing an integer and a vector. Add them up separately. Divide at the end. The initial value for the integer from the mapper should be 1. On Thu, Apr 28, 2011 at 8:26 PM, Vckay wrote: > However, the problem that I can see is that t

Re: Regarding PCA implementation

2011-04-28 Thread Lance Norskog
The payload for the K/V pair includes a counter of how many raw items that a combiner merged. This is how wordcount works- the combiners send in the word as key and the count as payload. Lance On Thu, Apr 28, 2011 at 8:26 PM, Vckay wrote: > On Wed, Apr 27, 2011 at 8:41 PM, Ted Dunning wrote: >

Re: Regarding PCA implementation

2011-04-28 Thread Vckay
On Wed, Apr 27, 2011 at 8:41 PM, Ted Dunning wrote: > On Wed, Apr 27, 2011 at 5:28 PM, Vckay wrote: > > > Assuming the data is available as a text file with rows representing > > measurements, > > > > > A org.apache.mahout.math.hadoop.DistributedRowMatrix is the traditional > approach to this.

Re: Regarding PCA implementation

2011-04-27 Thread Ted Dunning
No. This is much better than crazy. It is exactly what LinearOperators are good for. On Wed, Apr 27, 2011 at 8:21 PM, Jake Mannix wrote: > Thinking on it a little bit further, this is not so bad: Let's say we had a > finished > patch to the idea discussed in MAHOUT-672 - virtual distributed ma

Re: Regarding PCA implementation

2011-04-27 Thread Jonathan Traupman
On Wed, Apr 27, 2011 at 8:21 PM, Jake Mannix wrote: > > I would love to know the answer to this question. > > Thinking on it a little bit further, this is not so bad: Let's say we had a > finished > patch to the idea discussed in MAHOUT-672 - virtual distributed matrices, > where > in this case,

Re: Regarding PCA implementation

2011-04-27 Thread Jake Mannix
On Wed, Apr 27, 2011 at 6:41 PM, Ted Dunning wrote: > > > 3. Now that I have the centered data, computing the covariance matrix > > shouldn't be too hard if I have represented my matrix as a distributed > row > > matrix. I can then use "times" to produce the covariance matrix. > > > > Actually, t

Re: Regarding PCA implementation

2011-04-27 Thread Ted Dunning
On Wed, Apr 27, 2011 at 5:28 PM, Vckay wrote: > Assuming the data is available as a text file with rows representing > measurements, > A org.apache.mahout.math.hadoop.DistributedRowMatrix is the traditional approach to this. > 1. Have a dataCenteringDriver that calls a empiricalMeanGenerator d

Regarding PCA implementation

2011-04-27 Thread Vckay
Hello all, I am trying to implement PCA using some of the libraries from Mahout. I am following the TODO list posted here : https://issues.apache.org/jira/browse/MAHOUT-512 . I understand conceptually the idea behind the PCA but I am rather new to both Hadoop and Mahout. Here is what I think the