We effectively have this situation on a significant fraction of our work-load as well. Much of our data is summarized hourly and is encrypted and compressed which makes it unsplittable. This means that the map processes are often not local to the data since the data is typically spread only to two or three datanodes. The result is that locality drops from the typical 80-90% to about 20-30% on a small cluster.
The result is measurably lower performance, but the difference is dominated by the limited parallelism and cost of decryption. Also, when people talk about "not decent" disks, they are often referring primarily to seek time and rotational latency in random access job mixes. Even pretty poor disks are useful with map-reduce because so much of the code reads large chunks of the disk in highly sequential order. Moreover, even a poor disk will add to the overall aggregate read and write bandwidth of the cluster ... One or another map task might take a bit longer to run a disk bound job, but the overall result should still be better than not having the disk and node. On 1/20/08 8:29 AM, "Allen Wittenauer" <[EMAIL PROTECTED]> wrote: > > > > On 1/18/08 3:29 PM, "Jason Venner" <[EMAIL PROTECTED]> wrote: >> We were thinking of doing this with some machines that do not have >> decent disks but have plenty of netbandwidth. > > > We were doing it for a while, in particular for our data loaders*.... > but that was months and months ago. > > I don't remember any specific complaints about speed, but you obviously > lose some of the data locality capabilities. Depending upon your workload > and how your network is configured (bandwidth -and- latency), you might be > ok. > > > * - this is to force the data to get spread out amongst the data nodes vs. > filling up the node that the data is being loaded from. >