On Tuesday 25 March 2008, Hao Zheng wrote: > 1. Sect. 4.1 Algorithm Time Complexity Analysis. > the paper assumes m >> n, that is, the training instances are much > larger than the features. its datasets do have very few features. but > this may not be true for many tasks, e.g. text classification, where > feature dimensions will reach 10^4-10^5. then will the analysis still > hold?
What I could directly read from the paper in the very same section: The analysis will not hold in this case for those algorithms that require matrix inversions or eigen decompositions as long as these operations are not executed in parallel. The authors did not implement parallel versions for these operations - the reason they state is the fact that in their datasets m >> n. The authors state themselves that there is extensive research on parallelising eigen decomposition and matrix inversion as well - so if we assume that we do have a matrix package that can do these operations in a distributed way, IMHO the analysis in the paper should still hold even for algorithms that require these steps. > 2. Sect. 4.1, too. > "reduce phase can minimize communication by combining data as it's > passed back; this accounts for the logP factor", Could you help me > figure out how logP is calculated. Anyone else who can help out here? > 3. Sect 5.4 Results and Discussion > "SVM gets about 13.6% speed up on average over 16 cores", it's 13.6% > or 13.6? From figure 2, it seems should be 13.6? The axis on the graphs do not have clear titles, but I would agree that it should be 13.6 as well. > 4. Sect 5.4, too. > "Finally, the above are runs on multiprocessor machines." No matter > multiprocess or multicore, it runs on a single machine which have a > share memory. The main motivation for the paper was the rise of multi core machines that ask for parallel algorithms even though one might not have a cluster available. > But actually, M/R is for multi-machine, which will involve much more cost on > inter-machine communication. So the results of the paper may be > questionable? I think you should not expect to get the exact same speedups on multi-machine clusters. Still I think one can expect faster computation for large datasets even in this setting. What do others think? Isabel -- There is no TRUTH. There is no REALITY. There is no CONSISTENCY. There are no ABSOLUTE STATEMENTS. I'm very probably wrong. |\ _,,,---,,_ Web: <http://www.isabel-drost.de> /,`.-'`' -. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: <xmpp://[EMAIL PROTECTED]>
signature.asc
Description: This is a digitally signed message part.