So I’ve enjoyed the discussion on the mathematical foundation
and taxonomy of networks vis-à-vis clustering, but what techniques have you
found useful on a more nuts-and-bolts level? It seems to me the clustering
problem can be broken down into (roughly): 1) Measuring the “distance”
between any two nodes: How have you defined ‘distance’ in this
context: physical distance, latency, network hops, throughput, etc? And for your
selection, what techniques have you used to measure it (IP geolocation, synthetic
coordinate systems, embedded traceroute, network monitoring, etc)? 2) Measuring the “strength”
of any given node: How have you defined ‘strength’ in this
context: uptime, CPU, memory, upload capacity, download capacity, etc? How
have you measured these? 3) Join algorithm: When a new
node comes into the network, how do you determine its placement? Is it a
top-down, recursive approach, or a bottom-up, emergent approach? 4) Optimization algorithm: As
new nodes come and go (or even as a “background process”) how do
you optimize the network by promoting and demoting nodes, and on what basis? 5) Repair algorithm: Given
that any node can disappear at any time without warning, how do you mitigate
the effect of this and clean up afterwards? And not all clustering need be in a network sense. With
iGlance, for example, I have a “video buddy list” feature where you
have a super low-bandwidth video stream to all your peers. My initial thought
for this was to have each client build up a “pose library” in order
to optimize for the massive similarity between frames in a video stream of you
sitting in front of your computer. For example, there’s the “looking
at screen” pose and “talking on phone” pose, and “away
from computer” pose, and so on. My goal was to accumulate this pose
library over time, and then just send out indices into this library. I’d classify the clustering algorithm I used for my
pose library generation as follows: 1) Distance: Use an image
differencing function that roughly equated to “number of pixels changed”
(it was more sophisticated than this, but not by much) 2) Strength: If a new image
coming in is within some tolerance of an existing image, just increment the “strength”
of the old image and discard the new. Thus we’d gradually identify which
images were most common, and call those the strongest poses. 3) Join algorithm: Trickle a
new image down from top to bottom. At each node, if it’s within tolerance,
discard and increment its strength. If outside tolerance, compare to children
and pick whichever is within second tolerance. If none are, call it a new
child. 4) Optimization algorithm: Set
a fixed limit on the number of nodes allowed in the tree (based on how much memory
I was willing to allocate to the problem) and – before a node creates a new
leaf – discard the weakest child branch if we’re at our limit. I
think there was also a fan-out limit and when overrun, I’d measure the distances
between all children, find the pair that was closest, and then demote the
weaker under the stronger. 5) Repair algorithm: Not
applicable to this problem because images didn’t just disappear without
warning. Anyway, so that’s a clustering approach I used, and it
worked surprisingly well. The big problems were it had too many “magic
numbers” (merge threshold, split threshold, total node numbers, etc) and
it was hard choosing the right values. That and at the end of the day, the
number of poses you have even staring at your computer is actually quite high,
and simply sending a 1FPS video feed creates a better experience, is easier to
implement, and is sufficiently low bandwidth for my needs. I’m sure with
enough tweaking it could be made to work and support a huge number of peers, but
meh. Work for a future day. Assuming I even kept the source, when it’d
be just like me to throw it away. So, that’s my real-world clustering story. What story
do you have, and what lessons did you learn? -david |
_______________________________________________ p2p-hackers mailing list p2p-hackers@zgp.org http://zgp.org/mailman/listinfo/p2p-hackers _______________________________________________ Here is a web page listing P2P Conferences: http://www.neurogrid.net/twiki/bin/view/Main/PeerToPeerConferences