The payload for the K/V pair includes a counter of how many raw items that a combiner merged. This is how wordcount works- the combiners send in the word as key and the count as payload.
Lance On Thu, Apr 28, 2011 at 8:26 PM, Vckay <darkvc...@gmail.com> wrote: > On Wed, Apr 27, 2011 at 8:41 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > >> On Wed, Apr 27, 2011 at 5:28 PM, Vckay <darkvc...@gmail.com> wrote: >> >> > Assuming the data is available as a text file with rows representing >> > measurements, >> > >> >> > > >> A org.apache.mahout.math.hadoop.DistributedRowMatrix is the traditional >> approach to this. >> >> >> > > > All right, I will do that. > > > >> > 1. Have a dataCenteringDriver that calls a empiricalMeanGenerator driver. >> > This would compute the empirical mean. I have done this. I unfortunately >> > have not found an elegant way of using a Single Map Reduce + Combiner Job >> > of >> > finally generating one single output vector. >> >> >> Use a combiner and a single reducer. >> >> >> > Currently, I am generating key >> > value pairs where key refers to the index of the vector and value refers >> to >> > the mean for the dimension and am saving them. >> >> >> I think you should use a constant key and a value that is the original >> vector. Then define combiner and reducer to sum the values and output a >> key >> and vector. All of the combiner outputs will go to the same reducer which >> will output a single vector. >> >> > > I am not completely sure I understand what you are saying here. (I am > probably missing something really obvious). Let me go over this once more: > So mapper transmits every vector with a constant key. The combiner takes in > all the vectors, adds them up and sends reducer the aggregated output as a > <constant key, vector > . However, the problem that I can see is that the > reducer wont know the number of values that were aggregated by combiner > which is needed for it to compute the mean. I am talking about the issue > mentioned here: ( > http://philippeadjiman.com/blog/2010/01/14/hadoop-tutorial-series-issue-4-to-use-or-not-to-use-a-combiner/ > ) > > > 2. Assuming, I get a vector out of the empiricalMeanGenerator phase, I am >> > planning on using the VectorCache as a way of passing this vector onto >> the >> > job that takes the input Matrix (now distributedRowMatrix) to center the >> > data. >> > >> >> Sounds about right. >> >> But when did the input get converted? I think it should be converted >> before >> starting. >> >> >> >> Yep. Sorry. I messed up the work flow. > > >> > 3. Now that I have the centered data, computing the covariance matrix >> > shouldn't be too hard if I have represented my matrix as a distributed >> row >> > matrix. I can then use "times" to produce the covariance matrix. >> > >> >> Actually, this is liable to be a disaster because the covariance matrix >> will >> be dense after you subtract the mean. >> >> If your data is dense to begin with, this is no big deal. If your data is >> sparse, then this is a huge deal. >> >> A more subtle point is that computational cost of PCA on a dense matrix is >> something like k n^3 where k is the number of eigen-values you want and n >> is >> the size of the matrix. If you can compute PCA for an input of size N on a >> single machine in time T, then you are going to have a horrendous scaling >> problem. The issue is that by moving to network transfer instead of memory >> transfer, you are losing several orders of magnitude in speed. That means >> that even to match the speed of the sequential algorithm will take quite a >> few machines. If you increase the size of the problem by, say, 10 to make >> up for this, you now have a problem that will take a thousand times longer >> than before and you are still going to be in the hole. >> >> This sort of thing is why almost all really, really large scale data mining >> exercises work with sparse matrices. That drops most algorithms to >> something like O(n) so that adding more horsepower does some real damage to >> the size of n and so that increasing n is relatively survivable. In fact, >> O(n) algorithms are almost a defining property of scalable data-mining. >> >> > >> >> Back to your algorithm, I have two questions: >> >> a) can you do the SVD of the original matrix rather than the eigen-value >> computation of the covariance? I think that this is likely to be >> numerically better. >> >> b) is there some perturbation trick such that you can do to avoid the mean >> shift problem? I know that you can deal with (A - \lambda I), but you have >> (A - e m') where e is the vector with all ones. >> >> > > This is an interesting issue: I briefly thought about it when I saw a thread > where you had mentioned this issue previously. However, the good thing is > that I don't probably have to worry about this in a first stage > implementation because I plan to test the algorithm on image data which is > pretty dense. I believe however there are ways of doing distributed PCA by > allowing each node to work on a chunk of the matrix separately. For example, > http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4016296&tag=1 . I will > probably look at that once I get done with the first step of wrapping my > head around the mahout library and how a simple PCA implementation can be > done using that. > > > > >> 4. Last lap use DistributedLancsos solver to produce the eigen vectors. >> > >> > What do you guys think? >> > >> >> The last step sounds good! >> > -- Lance Norskog goks...@gmail.com