Hi Ted,

> If you are running SGD on a single node, just open the HDFS files directly.
> You won't have significant benefit to locality unless the files are
> relatively small.

Good point. However, the applicability of it may depend on the network
topology of the cluster:

Reasonably fast implementations of SGD are bandwidth bound even when
reading from local disk on typical machines. Depending on the network
topology of the cluster, the rack-local bandwidth may be an order of
magnitude higher than the bandwidth you get when reading from a node
in another rack. So I believe there is value in data locality for SGD.

Your point is of course universally true for sequential algorithms
that are CPU-bound such as batch learning schemes.

Take care,

Markus

Reply via email to