hi Eugen, In Game Tree Search basically algorithmically it is a century further than many other sciences as the brilliant minds have been busy with it. For the brilliant guys it was possible to make CASH with it.
In Math there is stil many challenges to design 1 kick butt algorithm, but you won't get rich with it. As a result from Alan Turing up to the latest Einstein, they al have put their focus upon Game Tree Search. I'm moving now towards robotica in fact, building a robot. Not the robot, as i suck in building robots, but the software part so far hasn't been realy developed very well for robots. Unexplored area still for civil use that is. But as for the chessprograms now, they combine a bunch of algorithms and every single one of them profits bigtime (exponential) from caching. That caching is of course random. So the cluster we look at in number of nodes you can probably count at one hand, yet i intend to put 4 network cards (4 rails for insiders here) into each single machine. Machine is a big word, it wil be stand alone mainboard of course to save costs. So the price of each network card is fairly important. As it seems now, the old quadrics network cards QM500-B that you can pick up for $30 each or so on ebay are most promising. At Home i have a full working QM400 setup which is 2.1 us latency one way ping pong. So i'd guess a blocked read has a latency not much above that. I can choose myself whether i want to do reads of 128 bytes or 256 bytes. No big deal in fact. It's scathered through the RAM, so each read is a random read fromthe RAM. With 4 nodes that would mean of course odds 25% it's a local RAM read (no nothing network read then), and 75% odds it's somewhere in the gigabytes of RAM from a remote machine. As it seems now 4x infiniband has a blocked read latency that's too slow and i don't know for which sockets 4x works, as all testreports i read the 4x infiniband just works for old socket 604. So am not sure it works for socket 1366 let alone socket 1155; those have a different memory architecture so it's never sure whether a much older network card that works DMA will work for it. Also i hear nothing about putting several cards in 1 machine. I want at least 4 rails of course from those old crap cards. You'll argue that for 4x infiniband this is not very cost effective, as the price of 4 cards and 4 cables is already gonna be nearly 400 dollar. That's also what i noticed. But if i put in 2x QM500-B in for example a P6T professional, that's gonna be cheaper including the cables than $200 and it will be able to deliver i'd guess over a million blocked reads per second. By already doing 8 probes which is 192-256 bytes currently i already 'bandwidth optimized' the algorithm. Back in the days that Leierson at MiT ran cilkchess and other engines at the origin3800 there and some Sun supercomputers, they requested in slow manner a single probe of what will it have been, a byte or 8-12. So far it seems that 4 infiniband cards 4x can deliver me only 400k blocked reads a second, which is a workable number in fact (the amount i need depends largely upon how fast the node is) for a single socket machine. Yet i'm not aware whether infiniband allows multiple rails. Does it? The QM400 cards i have here, i'd guess can deliver with 4 rails around 1.2 million blocked reads a second, which already allows a lot faster nodes. The ideal kick butt machine so far is a simple supermicro mainboard with 4x pci-x and 4 sockets. Now it'll depend upon which cpu's i can get cheapest whether that's intel or AMD. If the 8 core socket 1366 cpu's are going to be cheap @ 22 nm, that's of course with some watercooling, say clock them to 4.5Ghz, gonna be kick butt nodes. Those mainboards allow "only" 2 rails, which definitely means that the QM400 cards, not to mention 4x infiniband is an underperformer. Up to 24 nodes, infiniband has cheap switches. But it seems only the newer infiniband cards have a latency that's sufficient, and all of them are far over $500, so that's far outside of budget. Even then they still can't beat a single QM500-B card. It's more than said that the top500 sporthall hardly needs bandwidth let alone latency. I saw that exactly a cluster in the same sporthall top500 with simple built in gigabit that isn't even DMA was only 2x slower than the same machines equipped with infiniband. Now some wil cry here that gigabit CAN have reasonable one way pingpong's, not to mention the $5k solarflare cards of 10 gigabit ethernet, yet in all sanity we must be honest that the built in gigabits from practical performance reasons are more like 500 microseconds latency if you have all cores busy. In fact even the realtime linux kernel will central lock every udp packet you ship or receive. Ugly ugly. That's no compare with the latencies of the HPC cards of course, whether you use MPI or SHMEM doesn't really matter there. That difference is so huge. As a result it seems there was never much of a push to having great network cards. That might change now with gpu's kicking butt, though those need of course massive bandwidth, not latency. For my tiny cluster latency is what matters. Usually 'one way pingpong' is a good representation of the speed of blocked reads, Quadrics excepted, as the SHMEM allows way faster blocked reads there than 2 times the price for a MPI one-way pingpong. Quadrics is dead and gone. Old junk. My cluster also will be old junk probably, with exception maybe of the cpu's. Yet if i don't find sponsorship for the cpu's, of course i'm on a big budget there as well. On Nov 7, 2011, at 12:35 PM, Eugen Leitl wrote: > On Mon, Nov 07, 2011 at 11:10:50AM +0000, John Hearns wrote: >> Vincent, >> I cannot answer all of your questions. >> I have a couple of answers: >> >> Regarding MPI, you will be looking for OpenMPI >> >> You will need a subnet manager running somewhere on the fabric. >> These can either run on the switch or on a host. >> If you are buying this equipment from eBay I would imagine you >> will be >> running the Open Fabrics subnet manager >> on a host on your cluster, rather than on a switch. >> I might be wrong - depends if the switch has a SM license. > > Assuming ebay-sourced equipment, what price tag > are we roughly looking at, per node, assuming small > (8-16 nodes) cluster sizes? > > -- > Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org > ______________________________________________________________ > ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org > 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE > _______________________________________________ > Beowulf mailing list, [email protected] sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
