[p2p-hackers] Practical clustering

David Barrett Thu, 09 Mar 2006 13:28:24 -0800

So I’ve enjoyed the discussion on the mathematical foundation and taxonomy of networks vis-à-vis clustering, but what techniques have you found useful on a more nuts-and-bolts level? It seems to me the clustering problem can be broken down into (roughly):

1) Measuring the “distance” between any two nodes: How have you defined ‘distance’ in this context: physical distance, latency, network hops, throughput, etc? And for your selection, what techniques have you used to measure it (IP geolocation, synthetic coordinate systems, embedded traceroute, network monitoring, etc)?

2) Measuring the “strength” of any given node: How have you defined ‘strength’ in this context: uptime, CPU, memory, upload capacity, download capacity, etc? How have you measured these?

3) Join algorithm: When a new node comes into the network, how do you determine its placement? Is it a top-down, recursive approach, or a bottom-up, emergent approach?

4) Optimization algorithm: As new nodes come and go (or even as a “background process”) how do you optimize the network by promoting and demoting nodes, and on what basis?

5) Repair algorithm: Given that any node can disappear at any time without warning, how do you mitigate the effect of this and clean up afterwards?

And not all clustering need be in a network sense. With iGlance, for example, I have a “video buddy list” feature where you have a super low-bandwidth video stream to all your peers. My initial thought for this was to have each client build up a “pose library” in order to optimize for the massive similarity between frames in a video stream of you sitting in front of your computer. For example, there’s the “looking at screen” pose and “talking on phone” pose, and “away from computer” pose, and so on. My goal was to accumulate this pose library over time, and then just send out indices into this library.

I’d classify the clustering algorithm I used for my pose library generation as follows:

1) Distance: Use an image differencing function that roughly equated to “number of pixels changed” (it was more sophisticated than this, but not by much)

2) Strength: If a new image coming in is within some tolerance of an existing image, just increment the “strength” of the old image and discard the new. Thus we’d gradually identify which images were most common, and call those the strongest poses.

3) Join algorithm: Trickle a new image down from top to bottom. At each node, if it’s within tolerance, discard and increment its strength. If outside tolerance, compare to children and pick whichever is within second tolerance. If none are, call it a new child.

4) Optimization algorithm: Set a fixed limit on the number of nodes allowed in the tree (based on how much memory I was willing to allocate to the problem) and – before a node creates a new leaf – discard the weakest child branch if we’re at our limit. I think there was also a fan-out limit and when overrun, I’d measure the distances between all children, find the pair that was closest, and then demote the weaker under the stronger.

5) Repair algorithm: Not applicable to this problem because images didn’t just disappear without warning.

Anyway, so that’s a clustering approach I used, and it worked surprisingly well. The big problems were it had too many “magic numbers” (merge threshold, split threshold, total node numbers, etc) and it was hard choosing the right values. That and at the end of the day, the number of poses you have even staring at your computer is actually quite high, and simply sending a 1FPS video feed creates a better experience, is easier to implement, and is sufficiently low bandwidth for my needs. I’m sure with enough tweaking it could be made to work and support a huge number of peers, but meh. Work for a future day. Assuming I even kept the source, when it’d be just like me to throw it away.

So, that’s my real-world clustering story. What story do you have, and what lessons did you learn?

-david

_______________________________________________
p2p-hackers mailing list
p2p-hackers@zgp.org
http://zgp.org/mailman/listinfo/p2p-hackers
_______________________________________________
Here is a web page listing P2P Conferences:
http://www.neurogrid.net/twiki/bin/view/Main/PeerToPeerConferences

[p2p-hackers] Practical clustering

Reply via email to