Ok, ran the job with 1 reducer and with 16 reducers; results pretty much confirm Konstantin's point ...
48 reducers = 1 hour 32 reducers = 44 minutes 1 reducer = 47 minutes 16 reducers = 41 minutes So there is a little benefit to using more than one reducer, but not much ;-) On Tue, Sep 6, 2011 at 5:47 PM, Timothy Potter <[email protected]> wrote: > Hi Ted / Konstantin, > > Thanks for the feedback. You are correct in that some reducers are doing > nothing, but many are doing real work, albeit for a very short period of > time. I'll run this with 1 reducer and post back my results. > > Cheers, > Tim > > > On Tue, Sep 6, 2011 at 3:33 PM, Konstantin Shmakov <[email protected]>wrote: > >> K-means have mappers, combiners and reducers and my experience with >> k-means that mappers and combiners are responsible for most of the >> performance. In fact, most k-means jobs will use 1 reducer by >> default. >> >> Did you verify that multiple reducers are actually doing something? >> >> Mappers write each point with cluster assignment, combiners read these >> points and produce intermediate centroids and reducer doing minimum >> amount of work producing final centroids from intermediate one. One >> possibility is that by specifying #reducers you partially disable >> combiners optimization that run on mapper nodes - this could result in >> more shuffling and date sent between nodes. >> >> -- Konstantin >> >> On Tue, Sep 6, 2011 at 9:40 AM, Ted Dunning <[email protected]> >> wrote: >> > It could also have to do with context switch rate, memory set size or >> memory >> > bandwidth contention. Having two many threads of certain kinds can >> cause >> > contention on all kinds of resources. >> > >> > Without detailed internal diagnostics, these can be very hard to tease >> out. >> > Looking at load average is a good first step. If you can get to some >> > memory diagnostics about cache miss rates, you might be able to get >> further. >> > >> > On Tue, Sep 6, 2011 at 10:50 AM, Timothy Potter <[email protected] >> >wrote: >> > >> >> Hi, >> >> >> >> I'm running a distributed k-means clustering job in a 16 node EC2 >> cluster >> >> (xlarge instances). I've experimented with 3 reducers per node >> >> (mapred.reduce.tasks=48) and 2 reducers per node >> (mapred.reduce.tasks=32). >> >> In one of my runs (k=120) on 6m vectors with roughly 20k dimensions, >> I've >> >> seen a 25% improvement job performance using 2 reducers per node >> instead of >> >> 3 (~45 mins to do 10 iterations with 32 reducers vs. ~1 hour with 48 >> >> reducers). The input data and initial clusters are the same in both >> cases. >> >> >> >> My sense was that maybe I was over-utilizing resources with 3 reducers >> per >> >> node, but in fact the load average remains healthy (< 4 on xlarge >> instances >> >> with 4 virtual cores) and does not swap or anything obvious like that. >> So >> >> the only other variable that I can think of here is the number and size >> of >> >> output files from one iteration being sent as input to the next >> iteration >> >> may have something to do with the performance difference? As further >> >> evidence to this hunch, running the job with vectors with 11k >> dimensions, >> >> the improvement was only about 10% -- so performance gains are better >> with >> >> more data. >> >> >> >> Does anyone else have any insights to what might be leading to this >> result? >> >> >> >> Cheers, >> >> Tim >> >> >> > >> >> >> >> -- >> ksh: >> > >
