Hi Ted, > If you are running SGD on a single node, just open the HDFS files directly. > You won't have significant benefit to locality unless the files are > relatively small.
Good point. However, the applicability of it may depend on the network topology of the cluster: Reasonably fast implementations of SGD are bandwidth bound even when reading from local disk on typical machines. Depending on the network topology of the cluster, the rack-local bandwidth may be an order of magnitude higher than the bandwidth you get when reading from a node in another rack. So I believe there is value in data locality for SGD. Your point is of course universally true for sequential algorithms that are CPU-bound such as batch learning schemes. Take care, Markus
