I can see I'm going to have to drop the microphone mathematics.
matt massie wrote:
so i'm pretty certain g3 will be a pure xml beast. no more xdr messages
on the wire. here's my thinking on this...in no necessary order..
I'm going to shock you by saying I don't like this. I know, you're asking
yourself why someone who's watching a dual-processor E420R take over 10
seconds to parse a 3.6MB gmetad output is against the idea of using more
XML elsewhere in the program design.
It's very portable, I'm not arguing that point. On the monitoring cores I
am worried about speed and CPU cycles - I want the monitoring core to be
very high in one respect, very low in the other.
[insert joke here.]
our old messages where not grouped together. and while they where very
small messages.. each message has a 52 byte header and the minimum
ethernet packet size is 64 octets. which means that we are sending 64
bytes of data for each 8-12 byte message (and the header is 6x the size of
the data!).
Why not walk the metric tree and send a branch at a time as an XDR? Or
send the information about the metric tree layout in separate XDR packets
on an on-demand or periodic basis?
another problem with having each individual metric multicast it's own data
is that is disconnects related data.. e.g. CPU (user,sys,nice,idle).
since these 4 related metrics are sent at different times they might now
always represent the same time slice (and therefore might not add up to
exactly 100%.. it's not always good to give 110%).
Sometimes they don't add up to 100% anyway. I think this happens mainly on
Solaris.
"That's Carl's fault. He's new."
"Sorry. My bad."
the solution is to group the metric together somehow to be send at the
same time. we could do that using xdr or xml... but which is more
efficient... in terms of network and CPU?
Different users will answer this question differently. Ganglia's not being
used in just one situation. People managing a few large clusters will say
that CPU usage is more important than network usage (especially if the jobs
being run are CPU-intensive except at either end where there's a relatively
short burst of network traffic).
People linking smaller clusters over a wider area will answer the opposite
- it's worth chewing up a few more CPU cycles if it means using a smaller
percentage of a slow link.
I got an idea.
of course without the newlines and formatting. the length of this example
is 135 bytes... which contains 4 metrics expressed explicitly. in the
past each gmond have a metric lookup table compiled in which reduced the
message size. the explicit message format will mean that all wire data
sources (gmond, gmetric, etc).. will all use the same format. it also
means we have no more metric collisions since everthing is explicit.
Still think we could try sending metrics out in an XDR table with a
hashed-up value for "metric name" which corresponds to an entry in a
previously-transmitted metric attribute lookup table... keeps the
transmitted data simple, after all.
so.. with the current message method.. this message takes at
least 60 + 60 + 60 + 60 = 240 bytes... and it's flat.
Apples, meet Oranges. Oranges, meet Apples. :) I'm sure a
carefully-thought-out XDR scheme wouldn't provide numbers like that...
this new explicit xml format will take 52 + 135 = 187 bytes. more info
sent using less bandwidth... hierarchical too.
How much longer does it take to parse the 187 bytes of XML versus the 240
bytes of XDR? Is there even a difference?
i'm sure we could think of a way to build an explicit hierarchical xdr
format which could rival the efficiency of this xml format.. but it would
not be nearly as accessible to developers. imagine how easy it would be
to plug an app directly into the xml wire ... almost fun. woohoo!
Isn't the plug-in API going to handle that? If they want to put an app
that communicates on the wire, theoretically they would link in
libganglia... (libg3?).
in the past i thought an xdr format would be more efficient on the CPU
side of things.. because i could send the metric branch name "/g/cpu" or
whatever as an xdr_array so it doesn't need to be taken apart/parsed on
the receiving end... just read the data from each array element. there
are tools which use the xdr description file (which we would provide) but
they are MUCH less available and easy to use than xml parsers.
Has there been a tremendous outcry from tool developers that the Ganglia
information isn't as accessible as they'd like it to be? If they want XML
they can query a monitoring core, can't they?
also.. parsing the branch name can be made very efficient using regex
libraries... which use precompiled patterns for matching..
Could it win a bake-off against a similarly tuned XDR method? In terms of
speed, CPU and scalability?
this leads into thoughts from the local wire format to the wide area
format.
i love trees. i have hugged many trees in my life and have been very
lucky that none of them have hugged me back. remember the childhood
trauma of watching dorothy get pummeled with apples from the living
forest? all because she picked fruit from a tree (hmmm).. now back to the
yellow brick road (btw... think out there now.. big.. the world...)
Really? Makes me think of something from some big fantasy movie that came
out last year that had that dude from the Matrix in it. Dungeons and
Dragons or something.
right now gmetad uses a very simple aggregation model. that will not
scale (as we have painfully experienced). imagine a single DNS server
with every host/ip pair in the world being served from it. ha!
what we need is
1. a URL like way of expressing the data we want
2. replace the aggregation model with a delegation model.
3. [you get to this below, but put it on the list, dammit!] A QUERY MODEL!
Not many database apps that talk to a SQL-using back-end are written
without usage of the "WHERE" or "LIMIT" clauses. :)
If I didn't have other coding commitments, I'd probably try and hack this
into gmetad *now* ...
first.. the URL business... here is an example of a g3 URL...
/World/USA/California/Berkeley/UCB/Millennium Cluster//mm56/cpu/number
i'm thinking grand here.. but i really believe that in the end we will
creat a true internet overlay which will empower the internet in ways that
haven't been done before.
You're perilously close to using a Wired buzzword like "digital divide"
right here. You may need to be deprogrammed.
so.. this URL only uses a single delimiter "/". feel free to debate what
you think this delimited should be.. ':' might be a nice way to do it...
World:US:California:Berkeley:UCB:Millennium::mm56:cpu:number
.. i actually like the look of this a little more.. it's easier to read.
My eyes skimmed over the double-delimiter the first time I read it. :)
this is not complete XML at all.. don't want it to be too busy.. i know
steve wagner could handle seeing all the tags since he likes to read raw
xml streams but i'm not sure about the rest of you. :)
I pipe 'em to grep, actually.
'telnet gmetad-host 8651 | grep "HOST " | wc -l'
The reason I always view the raw XML (or pipe it through grep) is that I
don't want any parsing of the data to be done that I don't know about.
btw, mu means a "metric unit". we can change that name but i like how it
matches with organizational unit AND i love the concept of mu from
buddhism matched with the MIU puzzle introduced to me by Hofstadter,
Godel, Escher and Bach. i ramble. (if you want to learn more google "MU
Puzzle").
And it's also a Revenge of the Nerds reference.
so.. let's get back to the delegation model side of things.
For me, the purpose of the metadaemon is to handle requests from monitoring
apps. The metadaemon should be the only thing polling any of the
monitoring cores (which are, after all, on systems that should be working
on producing widgets). It's not entirely clear from this section whether
you're referring just to the "nearest" metadaemon (yay) or actually
referring to an individual monitoring core (boo). So the first thing I
thought of when I read this section was, "Great, can I turn it off?"
Also, does this address the possibility of multiple metadaemons for the
same data source? People might wanna cluster their metadaemons you know...
i wish XPath/XQuery was mature and there was nice multi-platform support.
i don't see that right now and i'm not sure how long it will be until it
happens. most of the good XQuery stuff out there is written in Java. i
don't know if we want to start developing Java code. maybe ...
Hmmmm... that might be fun on my Sun metadaemon box. :)
[on second thought, I'm not sure a :) is appropriate at this point...]
i thinking POSIX regular expressions might be the way to go...
I'm still not entirely convinced that working with strings is the key to
high speed, low CPU usage and high scalability...
i should have the g3 house ready to move into very soon... with a nice
tree in the front yard.
Just remember to add windows, floors, doors and wallpaper in every room.
Otherwise your Sims won't like it and they'll get very depressed and start
slapping each other.
It'll be just like this list!
Oh. Right. My idea.
A metric pipelining plug-in with multicast and unicast support. The
plug-in would have to be configured with a list of nodes that it's
responsible for (or an entire cluster - maybe we could just use URLs?) and
a reporting interval for each. Just like the metadaemon, in reverse.
Every interval seconds, it transmits the appropriate chunk of metrics in
XML to its configured destination. On receiving the metric chunk, it's
treated just as if it had originated locally, and gets re-transmitted over
the locally-configured multicast channel (obviously this only works if we
*don't* break the pipelined data into individual metric chunks).
This would actually increase Ganglia scalability (at the price of some
latency over pipelined links) because it allows a finer degree of control
over multicast traffic, and each individual node in a very large cluster
doesn't have to deal with 50,000 small packets per second being firehosed
at it (instead it's dealing with a few thousand larger packets closer to
the MTU value).
I can see that being a lot of fun for slow links... heck, after releasing
the source it should only be a matter of time before people turn that into
a notifier plug-in. :)
OK, that's all for now, I think...