Re: Regarding PCA implementation

Lance Norskog Thu, 28 Apr 2011 20:45:14 -0700

The payload for the K/V pair includes a counter of how many raw items
that a combiner merged. This is how wordcount works- the combiners
send in the word as key and the count as payload.


Lance

On Thu, Apr 28, 2011 at 8:26 PM, Vckay <darkvc...@gmail.com> wrote:
> On Wed, Apr 27, 2011 at 8:41 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>
>> On Wed, Apr 27, 2011 at 5:28 PM, Vckay <darkvc...@gmail.com> wrote:
>>
>> > Assuming the data is available as a text file with rows representing
>> > measurements,
>> >
>>
>>
>
>
>> A org.apache.mahout.math.hadoop.DistributedRowMatrix is the traditional
>> approach to this.
>>
>>
>>
>
>
> All right, I will do that.
>
>
>
>> > 1. Have a dataCenteringDriver that calls a empiricalMeanGenerator driver.
>> > This would compute the empirical mean. I have done this. I unfortunately
>> > have not found an elegant way of using a Single Map Reduce + Combiner Job
>> > of
>> > finally generating one single output vector.
>>
>>
>> Use a combiner and a single reducer.
>>
>>
>> > Currently, I am generating key
>> > value pairs where key refers to the index of the vector and value refers
>> to
>> > the mean for the dimension and am saving them.
>>
>>
>> I think you should use a constant key and a value that is the original
>> vector.  Then define combiner and reducer to sum the values and output a
>> key
>> and vector.  All of the combiner outputs will go to the same reducer which
>> will output a single vector.
>>
>>
>
> I am not completely sure I understand what you are saying here. (I am
> probably missing something really obvious). Let me go over this once more:
> So mapper transmits every vector with a constant key. The combiner takes in
> all the vectors, adds them up and sends reducer the aggregated output as  a
> <constant key, vector > . However, the problem that I can see is that the
> reducer wont know the number of values that were aggregated by combiner
> which is needed for it to compute the mean. I am talking about the issue
> mentioned here: (
> http://philippeadjiman.com/blog/2010/01/14/hadoop-tutorial-series-issue-4-to-use-or-not-to-use-a-combiner/
> )
>
>
> 2. Assuming, I get a vector out of the empiricalMeanGenerator phase, I am
>> > planning on using the VectorCache as a way of passing this vector onto
>> the
>> > job that takes the input Matrix (now distributedRowMatrix) to center the
>> > data.
>> >
>>
>> Sounds about right.
>>
>> But when did the input get converted?  I think it should be converted
>> before
>> starting.
>>
>>
>>
>> Yep. Sorry. I messed up the work flow.
>
>
>> > 3. Now that I have the centered data, computing the covariance matrix
>> > shouldn't be too hard if I have represented my matrix as a distributed
>> row
>> > matrix. I can then use "times" to produce the covariance matrix.
>> >
>>
>> Actually, this is liable to be a disaster because the covariance matrix
>> will
>> be dense after you subtract the mean.
>>
>> If your data is dense to begin with, this is no big deal.  If your data is
>> sparse, then this is a huge deal.
>>
>> A more subtle point is that computational cost of PCA on a dense matrix is
>> something like k n^3 where k is the number of eigen-values you want and n
>> is
>> the size of the matrix.  If you can compute PCA for an input of size N on a
>> single machine in time T, then you are going to have a horrendous scaling
>> problem.  The issue is that by moving to network transfer instead of memory
>> transfer, you are losing several orders of magnitude in speed.  That means
>> that even to match the speed of the sequential algorithm will take quite a
>> few machines.  If you increase the size of the problem by, say, 10 to make
>> up for this, you now have a problem that will take a thousand times longer
>> than before and you are still going to be in the hole.
>>
>> This sort of thing is why almost all really, really large scale data mining
>> exercises work with sparse matrices.  That drops most algorithms to
>> something like O(n) so that adding more horsepower does some real damage to
>> the size of n and so that increasing n is relatively survivable.  In fact,
>> O(n) algorithms are almost a defining property of scalable data-mining.
>>
>>
>
>>
>> Back to your algorithm, I have two questions:
>>
>> a) can you do the SVD of the original matrix rather than the eigen-value
>> computation of the covariance?  I think that this is likely to be
>> numerically better.
>>
>> b) is there some perturbation trick such that you can do to avoid the mean
>> shift problem?  I know that you can deal with (A - \lambda I), but you have
>> (A -  e m') where e is the vector with all ones.
>>
>>
>
> This is an interesting issue: I briefly thought about it when I saw a thread
> where you had mentioned this issue previously. However, the good thing is
> that I don't probably have to worry about this in a first stage
> implementation because I plan to test the algorithm on image data which is
> pretty dense. I believe however there are ways of doing distributed PCA by
> allowing each node to work on a chunk of the matrix separately. For example,
> http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4016296&tag=1 . I will
> probably look at that once I get done with the first step  of wrapping my
> head around the mahout library and how a simple PCA implementation can be
> done using that.
>
>
>
>
>> 4. Last lap use DistributedLancsos solver to produce the eigen vectors.
>> >
>> > What do you guys think?
>> >
>>
>> The last step sounds good!
>>
>



-- 
Lance Norskog
goks...@gmail.com

Re: Regarding PCA implementation

Reply via email to