Isabel, thanks for your answer. for the 4th question, maybe we can still gain some speedups on multi-machine clusters. But I suspect that we should also explicitly consider the communication cost, which is non-trivial in such setting. What do you think?
On 3/26/08, Isabel Drost <[EMAIL PROTECTED]> wrote: > On Tuesday 25 March 2008, Hao Zheng wrote: > > 1. Sect. 4.1 Algorithm Time Complexity Analysis. > > the paper assumes m >> n, that is, the training instances are much > > larger than the features. its datasets do have very few features. but > > this may not be true for many tasks, e.g. text classification, where > > feature dimensions will reach 10^4-10^5. then will the analysis still > > hold? > > > What I could directly read from the paper in the very same section: The > analysis will not hold in this case for those algorithms that require matrix > inversions or eigen decompositions as long as these operations are not > executed in parallel. The authors did not implement parallel versions for > these operations - the reason they state is the fact that in their datasets m > >> n. > > The authors state themselves that there is extensive research on > parallelising > eigen decomposition and matrix inversion as well - so if we assume that we do > have a matrix package that can do these operations in a distributed way, IMHO > the analysis in the paper should still hold even for algorithms that require > these steps. > > > > > 2. Sect. 4.1, too. > > "reduce phase can minimize communication by combining data as it's > > passed back; this accounts for the logP factor", Could you help me > > figure out how logP is calculated. > > > Anyone else who can help out here? > > > > > 3. Sect 5.4 Results and Discussion > > "SVM gets about 13.6% speed up on average over 16 cores", it's 13.6% > > or 13.6? From figure 2, it seems should be 13.6? > > > The axis on the graphs do not have clear titles, but I would agree that it > should be 13.6 as well. > > > > > 4. Sect 5.4, too. > > "Finally, the above are runs on multiprocessor machines." No matter > > multiprocess or multicore, it runs on a single machine which have a > > share memory. > > > The main motivation for the paper was the rise of multi core machines that ask > for parallel algorithms even though one might not have a cluster available. > > > > > But actually, M/R is for multi-machine, which will involve much more cost > on > > inter-machine communication. So the results of the paper may be > > questionable? > > > I think you should not expect to get the exact same speedups on multi-machine > clusters. Still I think one can expect faster computation for large datasets > even in this setting. What do others think? > > > > Isabel > > > > -- > There is no TRUTH. There is no REALITY. There is no CONSISTENCY. There are > no ABSOLUTE STATEMENTS. I'm very probably wrong. > |\ _,,,---,,_ Web: <http://www.isabel-drost.de> > /,`.-'`' -. ;-;;,_ > |,4- ) )-,_..;\ ( `'-' > '---''(_/--' `-'\_) (fL) IM: <xmpp://[EMAIL PROTECTED]> > >