Hi Ted / Konstantin,

Thanks for the feedback. You are correct in that some reducers are doing
nothing, but many are doing real work, albeit for a very short period of
time. I'll run this with 1 reducer and post back my results.

Cheers,
Tim

On Tue, Sep 6, 2011 at 3:33 PM, Konstantin Shmakov <[email protected]>wrote:

> K-means have mappers, combiners and reducers and my experience with
> k-means that mappers and combiners are responsible for most of the
> performance.  In fact, most k-means jobs will use 1 reducer by
> default.
>
> Did you verify that multiple reducers are actually doing something?
>
> Mappers write each point with cluster assignment, combiners read these
> points and produce intermediate centroids and reducer doing minimum
> amount of work producing final centroids from intermediate one. One
> possibility is that by specifying #reducers you partially disable
> combiners optimization that run on mapper nodes - this could result in
> more shuffling and date sent between nodes.
>
> -- Konstantin
>
> On Tue, Sep 6, 2011 at 9:40 AM, Ted Dunning <[email protected]> wrote:
> > It could also have to do with context switch rate, memory set size or
> memory
> > bandwidth contention.  Having two many threads of certain kinds can cause
> > contention on all kinds of resources.
> >
> > Without detailed internal diagnostics, these can be very hard to tease
> out.
> >  Looking at load average is a good first step.  If you can get to some
> > memory diagnostics about cache miss rates, you might be able to get
> further.
> >
> > On Tue, Sep 6, 2011 at 10:50 AM, Timothy Potter <[email protected]
> >wrote:
> >
> >> Hi,
> >>
> >> I'm running a distributed k-means clustering job in a 16 node EC2
> cluster
> >> (xlarge instances). I've experimented with 3 reducers per node
> >> (mapred.reduce.tasks=48) and 2 reducers per node
> (mapred.reduce.tasks=32).
> >> In one of my runs (k=120) on 6m vectors with roughly 20k dimensions,
> I've
> >> seen a 25% improvement job performance using 2 reducers per node instead
> of
> >> 3 (~45 mins to do 10 iterations with 32 reducers vs. ~1 hour with 48
> >> reducers). The input data and initial clusters are the same in both
> cases.
> >>
> >> My sense was that maybe I was over-utilizing resources with 3 reducers
> per
> >> node, but in fact the load average remains healthy (< 4 on xlarge
> instances
> >> with 4 virtual cores) and does not swap or anything obvious like that.
> So
> >> the only other variable that I can think of here is the number and size
> of
> >> output files from one iteration being sent as input to the next
> iteration
> >> may have something to do with the performance difference? As further
> >> evidence to this hunch, running the job with vectors with 11k
> dimensions,
> >> the improvement was only about 10% -- so performance gains are better
> with
> >> more data.
> >>
> >> Does anyone else have any insights to what might be leading to this
> result?
> >>
> >> Cheers,
> >> Tim
> >>
> >
>
>
>
> --
> ksh:
>

Reply via email to