On Tuesday 25 March 2008, Hao Zheng wrote:
> 1. Sect. 4.1 Algorithm Time Complexity Analysis.
> the paper assumes m >> n, that is, the training instances are much
> larger than the features. its datasets do have very few features. but
> this may not be true for many tasks, e.g. text classification, where
> feature dimensions will reach 10^4-10^5. then will the analysis still
> hold?

What I could directly read from the paper in the very same section: The 
analysis will not hold in this case for those algorithms that require matrix 
inversions or eigen decompositions as long as these operations are not 
executed in parallel. The authors did not implement parallel versions for 
these operations - the reason they state is the fact that in their datasets m 
>> n.

The authors state themselves that there is extensive research on parallelising 
eigen decomposition and matrix inversion as well - so if we assume that we do 
have a matrix package that can do these operations in a distributed way, IMHO 
the analysis in the paper should still hold even for algorithms that require 
these steps.


> 2. Sect. 4.1, too.
> "reduce phase can minimize communication by combining data as it's
> passed back; this accounts for the logP factor", Could you help me
> figure out how logP is calculated.

Anyone else who can help out here?


> 3. Sect 5.4 Results and Discussion
> "SVM gets about 13.6% speed up on average over 16 cores", it's 13.6%
> or 13.6? From figure 2, it seems should be 13.6?

The axis on the graphs do not have clear titles, but I would agree that it 
should be 13.6 as well.


> 4. Sect 5.4, too.
> "Finally, the above are runs on multiprocessor machines." No matter
> multiprocess or multicore, it runs on a single machine which have a
> share memory.

The main motivation for the paper was the rise of multi core machines that ask 
for parallel algorithms even though one might not have a cluster available.


> But actually, M/R is for multi-machine, which will involve much more cost on
> inter-machine communication. So the results of the paper may be
> questionable? 

I think you should not expect to get the exact same speedups on multi-machine 
clusters. Still I think one can expect faster computation for large datasets 
even in this setting. What do others think?



Isabel


-- 
There is no TRUTH.  There is no REALITY.  There is no CONSISTENCY. There are 
no ABSOLUTE STATEMENTS.   I'm very probably wrong.
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xmpp://[EMAIL PROTECTED]>

Attachment: signature.asc
Description: This is a digitally signed message part.

Reply via email to