Re: Systematic Process to Reduce Linux OS Jitter

2017-01-05 Thread Jean-Philippe BEMPEL
Hi,

Those servers seem nice, but there is a huge no-go for us and our workload: 
There is only one socket.
The actual CPUs have a L3 cache that is shared among cores, so depending on 
your workload other processes & threads can pollute this level of cache.
In our application this pollution cause 2x or more increase of latency 
compared to dedicated socket for critical threads (Thread affinity + 
isolcpus/cpuset)

Cheers

On Thursday, January 5, 2017 at 10:49:15 PM UTC+1, NeT MonK wrote:
>
> Thank you for this post. 
>
> My question is, where do you use your setup ? In market colocation ? 
>
> Where i work we have bunch of Ciara server : 
> http://www.ciaratech.com/category.php?id_cat1=3&id_cat2=61&lang=en
> with Intel i7 cpu, kingston ram, asus mother boad, watercooled, and 
> overclocked to 4.9ghz
>
> We are soon going to receive some blackcore servers to poc : 
> http://www.exactatech.com/hft/
>
> Ciara performs far better than HPgen8 xeon server, on our application. 
>
> On Monday, December 26, 2016 at 12:05:31 PM UTC+1, Lex Barringer wrote:
>>
>> I realize this post is a little late to the party but it's good for 
>> people looking at tweaking their hardware and software for high frequency 
>> binary options trading, including crypto-currency on the various exchanges. 
>>
>> As a note to all people seeking to create ultra low latency systems, not 
>> just network components / accessories. The clock rate (clock speed) of the 
>> CPU, it's multipliers vs. the memory multipliers, voltages, clock speeds of 
>> the modules as well as the CAS timing (as well as other associated memory 
>> timing parameters) can have a huge impact on how fast your overall system 
>> performance is. Let alone it's actual reaction speed. One of the most 
>> important areas are the ratios of the multipliers in the system itself. 
>>
>> While many operations are handled by the NIC in hardware and dedicated 
>> FIFO buffers of said devices, it still is necessary to have a tuned system 
>> hardware wise to give you the best performance, with the lowest overhead.
>>
>> Depending on if you use Intel Xeon or the newer 68 or 69 series Intel i7 
>> Extreme series, you may get better trading performance using a consumer 
>> grade (non Xeon) processor. I recommend using the Intel i7 6950X, it's 
>> comparable in latency, it can handle
>> memory speeds well in excess of 2400 MHz (2.4 GHz). The key here is to 
>> find memory and motherboards that can handle DDR4
>> 3200 MHz memory at CAS 14. You can get faster memory with the same CAS, 
>> if your motherboard supports it, then do so at this time.
>>
>> I've used the following configuration:
>>
>> Asus E-99 WS workstation board
>> Intel i7-6950X CPU
>> 64 GiB of CAS 14 @ 3200 MHz RAM
>> 1 Intel P3608 4 TB drive (it's the faster, bigger brother to the consumer 
>> Intel 750 NVMe SSD in a PCIe slot)
>> 1 NewWave Design & Verification V5022 Quad Channel 10 Gigabit Ethernet 
>> FPGA Server Card
>>
>> The NIC on the motherboard is used for PXE Netboot and once the computer 
>> is booted, it loads in the trading software, then starts it. The V5022 is 
>> used for the actual trading because it is very high speed but ultra low 
>> latency. You can runs these computers with or without heads (monitors 
>> plugged into the video ports). I can't strongly emphasize that all logs 
>> from each trading computer must not be stored in the computer's memory, not 
>> on the NVMe or some other local disk on the trading computers. You want to 
>> make these computers as hardened as possible, you will need a dedicated 
>> computer to receive, store and display the messages from each machine in an 
>> hardened environment. You competition may send in hackers to try to down 
>> your network and computer systems. Don't make it easy for them to do so by 
>> keeping local logs of activities occurring over your networks and on the 
>> computers themselves. The computer that gathers all the logs need not be a 
>> top of the line computer, a minimum of a quad core or dual core with two 
>> hyper-threads is sufficient.  You could get by with an Intel i3-6100 based 
>> system to save money, 16 GiB RAM is plenty for what this machine will be 
>> doing. 
>>
>> A note about memory size, if you're worrying about clock jitter and 
>> jitter from other sources, having a specific memory size can either work 
>> for or against you. Not many people know this but the safe sizes for 64-bit 
>> computing memory are the following in GiB; 16, 64, 256. Anything else and 
>> you're going to be risking a lot of cache misses and misalignments of data 
>> which has a huge latency penalty both on the hardware and in the software. 
>> Many people are tempted to put the maximum amount of RAM in their system 
>> when they're doing trading but you have to realize how the memory is 
>> actually access by the hardware and how the operating system in this case; 
>> different distributions of Linux. I see people using; 6, 8, 12, 24, 36, 48, 
>> 72, 

Re: Systematic Process to Reduce Linux OS Jitter

2017-01-05 Thread NeT MonK
Thank you for this post. 

My question is, where do you use your setup ? In market colocation ? 

Where i work we have bunch of Ciara server 
: http://www.ciaratech.com/category.php?id_cat1=3&id_cat2=61&lang=en
with Intel i7 cpu, kingston ram, asus mother boad, watercooled, and 
overclocked to 4.9ghz

We are soon going to receive some blackcore servers to poc 
: http://www.exactatech.com/hft/

Ciara performs far better than HPgen8 xeon server, on our application. 

On Monday, December 26, 2016 at 12:05:31 PM UTC+1, Lex Barringer wrote:
>
> I realize this post is a little late to the party but it's good for people 
> looking at tweaking their hardware and software for high frequency binary 
> options trading, including crypto-currency on the various exchanges. 
>
> As a note to all people seeking to create ultra low latency systems, not 
> just network components / accessories. The clock rate (clock speed) of the 
> CPU, it's multipliers vs. the memory multipliers, voltages, clock speeds of 
> the modules as well as the CAS timing (as well as other associated memory 
> timing parameters) can have a huge impact on how fast your overall system 
> performance is. Let alone it's actual reaction speed. One of the most 
> important areas are the ratios of the multipliers in the system itself. 
>
> While many operations are handled by the NIC in hardware and dedicated 
> FIFO buffers of said devices, it still is necessary to have a tuned system 
> hardware wise to give you the best performance, with the lowest overhead.
>
> Depending on if you use Intel Xeon or the newer 68 or 69 series Intel i7 
> Extreme series, you may get better trading performance using a consumer 
> grade (non Xeon) processor. I recommend using the Intel i7 6950X, it's 
> comparable in latency, it can handle
> memory speeds well in excess of 2400 MHz (2.4 GHz). The key here is to 
> find memory and motherboards that can handle DDR4
> 3200 MHz memory at CAS 14. You can get faster memory with the same CAS, if 
> your motherboard supports it, then do so at this time.
>
> I've used the following configuration:
>
> Asus E-99 WS workstation board
> Intel i7-6950X CPU
> 64 GiB of CAS 14 @ 3200 MHz RAM
> 1 Intel P3608 4 TB drive (it's the faster, bigger brother to the consumer 
> Intel 750 NVMe SSD in a PCIe slot)
> 1 NewWave Design & Verification V5022 Quad Channel 10 Gigabit Ethernet 
> FPGA Server Card
>
> The NIC on the motherboard is used for PXE Netboot and once the computer 
> is booted, it loads in the trading software, then starts it. The V5022 is 
> used for the actual trading because it is very high speed but ultra low 
> latency. You can runs these computers with or without heads (monitors 
> plugged into the video ports). I can't strongly emphasize that all logs 
> from each trading computer must not be stored in the computer's memory, not 
> on the NVMe or some other local disk on the trading computers. You want to 
> make these computers as hardened as possible, you will need a dedicated 
> computer to receive, store and display the messages from each machine in an 
> hardened environment. You competition may send in hackers to try to down 
> your network and computer systems. Don't make it easy for them to do so by 
> keeping local logs of activities occurring over your networks and on the 
> computers themselves. The computer that gathers all the logs need not be a 
> top of the line computer, a minimum of a quad core or dual core with two 
> hyper-threads is sufficient.  You could get by with an Intel i3-6100 based 
> system to save money, 16 GiB RAM is plenty for what this machine will be 
> doing. 
>
> A note about memory size, if you're worrying about clock jitter and jitter 
> from other sources, having a specific memory size can either work for or 
> against you. Not many people know this but the safe sizes for 64-bit 
> computing memory are the following in GiB; 16, 64, 256. Anything else and 
> you're going to be risking a lot of cache misses and misalignments of data 
> which has a huge latency penalty both on the hardware and in the software. 
> Many people are tempted to put the maximum amount of RAM in their system 
> when they're doing trading but you have to realize how the memory is 
> actually access by the hardware and how the operating system in this case; 
> different distributions of Linux. I see people using; 6, 8, 12, 24, 36, 48, 
> 72, 96, 128, 224, 512 GiB of RAM, strange sizes like this can give you 
> problems because of the way in which the memory managers in Linux are 
> designed. While, technically, yes, the kernel can handle very large memory 
> system and strange sizes like these, it's not a good practice for system 
> designers and builders to get into. Something else to note, the more chips 
> a RAM module has, the more likely you are to have clock jitter, which can 
> lead to some not too nice effects and are very hard to track down from the 
> software side.
>
> You also need to keep your systems a

Re: Systematic Process to Reduce Linux OS Jitter

2016-12-27 Thread Gil Tene


On Monday, December 26, 2016 at 10:28:48 AM UTC-8, Daniel Eloff wrote:
>
> >Not many people know this but the safe sizes for 64-bit computing memory 
> are the following in GiB; 16, 64, 256. Anything else and you're going to be 
> risking a lot of cache misses and misalignments of data which has a huge 
> latency penalty both on the hardware and in the software.
>
> Can someone explain how this works to me? As I understand it, caches work 
> from the physical addresses and are indexed based on the lowest bits of the 
> address. At the gigabyte level only bits above 30 would see an uneven 
> distribution, and only then with odd memory sizes (e.g. 48gb.) I don't see 
> how 32gb can be worse than 16gb or 64gb from a caching or alignment 
> perspective. Obviously the more memory in active use by the system, the 
> more chances of cache conflicts, but apart from that I don't understand it.
>

Yeh, GB-level memory sizes have no impact on cache misalignment issues... 
As you note, cache lines are 64bytes, and physical addresses are 
distributed across the system at page levels, so even when doing wider 
accesses (e.g. two adjacent cache lines for 128 bytes) you'd be aligned and 
hitting the same DIMM no matter what DIMM size mixes you end up using.

There are certainly some impacts that comes from choosing memory sizes, but 
those mostly have to do with filling the various memory controller channels 
evenly (to make sure all that memory bandwidth is useable), and dealing 
with the depth (number of ranks) that each channel ends up driving (which 
can effect access speed).

In systems with 3 memory channels per socket (Intel 55xx and 56xx), the 
"natural" balanced sizes for a 2 socket systems were actually multiples of 
6 (e.g. 24GB, 48GB, 72GB, 144GB), and normal power of 2 memory sizes (e.g. 
64GB) actually would result in unbalanced memory controller loads. From 
E5-26xx on, sockets have had 4 memory channels, moving natural sizes back 
to multiples of 8. So yes, 64, 128, 256, 512, but also also 96, 192, 384 
and 768.

The number of ranks thing is more complicated, since a single DIMM could 
have differing numbers of ranks (1, 2, 4, 8). On most E5-26xx systems, you 
can drive up to 3 DIMMS per memory channel (for a total of 12 DIMMS per 
socket, 24 DIMMs per 2-socket system), and up to 8 ranks per channel. But 
at least in some of those systems, the frequency and latency of DRAM access 
may be worse when more DIMMs and/or ranks are populated in a channel. When 
looking for maximum DRAM performance, you typically only populate 8 
(RDIMMs) or 16 (LRDIMMs) per system on current systems (e.g. see this 
config guide 

 
for how choices affect memory frequencies). Since the cheapest DIMMs to use 
at this time appear to be 32GB DIMMs (cheaper per GB than 16GB or smaller), 
this probably means a 256GB or 512GB on newer systems.

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Systematic Process to Reduce Linux OS Jitter

2016-12-26 Thread Dan Eloff
>Not many people know this but the safe sizes for 64-bit computing memory
are the following in GiB; 16, 64, 256. Anything else and you're going to be
risking a lot of cache misses and misalignments of data which has a huge
latency penalty both on the hardware and in the software.

Can someone explain how this works to me? As I understand it, caches work
from the physical addresses and are indexed based on the lowest bits of the
address. At the gigabyte level only bits above 30 would see an uneven
distribution, and only then with odd memory sizes (e.g. 48gb.) I don't see
how 32gb can be worse than 16gb or 64gb from a caching or alignment
perspective. Obviously the more memory in active use by the system, the
more chances of cache conflicts, but apart from that I don't understand it.

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Systematic Process to Reduce Linux OS Jitter

2016-12-26 Thread Greg Young
@Gil this should be a blog post.

On Mon, Dec 26, 2016 at 5:09 PM, Gil Tene  wrote:
> One of the biggest reasons folks tend to stay away from the consumer CPUs in
> this space (like the is the i7-6950X you mentioned below) is the  ack of ECC
> memory support. I really wish Intel provided ECC support in those chips, but
> they don't. And ECC is usually a must when driving hardware performance to
> the edge, especially in FiunServ. The nightmare scenarios that happen when
> you aggressively choose your parts and push their performance to the edge
> (and even if you don't) with no ECC are very real. The soft-error correcting
> capabilities (ECC is usually SECDED) is crucial for avoid actually wrong
> computation results from occurring on a regular basis from simple things
> like cosmic ray effects on your DRAM, and with the many-GBs capacities we
> have tin those servers, going without a cosmic-ray-driven bit-flip in DRAM .
>
> To move from hand waving and to actual numbers for the notion that ECC is
> critical (and to hopefully scare the s*&t out of people running business
> stuff with no soft-error correcting hardware), this 2009 Google Study paper
> makes for a good read. It covers field data collected between 2006 and 2008.
> Fast forward to section 3.1 if you are looking for some per-machine summary
> numbers. The simple takeaway summary is this: Even with ECC support, you
> have a ~1% of your machine experiencing an Uncorrectable Error (UE) once per
> year. But the chance of a machine encountering a Correctable Error (CE) at
> least once per year is somewhere in the 12-50% range, and the machines that
> do (which can be as many as half) will see those errors hundreds of times
> per year (so once every day or two).
>
> One liner Summary: without hardware ECC support, random bits are probably
> flipping in your system memory, undetected, on a daily basis.
>
> I believe that the current ECC-capable chips that would come close to the
> i7-6950X you mentioned below are the E5-1680V4 (for 1 socket setups, peaks
> at 4.0GHz) and either E5-2687W v4 or E5-2697A v4 (peak at 3.,5 and 3.6GHz
> respectively, but you'd need to carefully avoid using on the core on the
> 2697 to get there probably). The E3 series (e.g. E3-1280 v5) usually have
> the latest cores first, but their core counts tend to be tiny (4 physical
> cores compared to 8-12 in the others listed above).
>
> On Monday, December 26, 2016 at 3:05:31 AM UTC-8, Lex Barringer wrote:
>>
>> I realize this post is a little late to the party but it's good for people
>> looking at tweaking their hardware and software for high frequency binary
>> options trading, including crypto-currency on the various exchanges.
>>
>> As a note to all people seeking to create ultra low latency systems, not
>> just network components / accessories. The clock rate (clock speed) of the
>> CPU, it's multipliers vs. the memory multipliers, voltages, clock speeds of
>> the modules as well as the CAS timing (as well as other associated memory
>> timing parameters) can have a huge impact on how fast your overall system
>> performance is. Let alone it's actual reaction speed. One of the most
>> important areas are the ratios of the multipliers in the system itself.
>>
>> While many operations are handled by the NIC in hardware and dedicated
>> FIFO buffers of said devices, it still is necessary to have a tuned system
>> hardware wise to give you the best performance, with the lowest overhead.
>>
>> Depending on if you use Intel Xeon or the newer 68 or 69 series Intel i7
>> Extreme series, you may get better trading performance using a consumer
>> grade (non Xeon) processor. I recommend using the Intel i7 6950X, it's
>> comparable in latency, it can handle
>> memory speeds well in excess of 2400 MHz (2.4 GHz). The key here is to
>> find memory and motherboards that can handle DDR4
>> 3200 MHz memory at CAS 14. You can get faster memory with the same CAS, if
>> your motherboard supports it, then do so at this time.
>>
>> I've used the following configuration:
>>
>> Asus E-99 WS workstation board
>> Intel i7-6950X CPU
>> 64 GiB of CAS 14 @ 3200 MHz RAM
>> 1 Intel P3608 4 TB drive (it's the faster, bigger brother to the consumer
>> Intel 750 NVMe SSD in a PCIe slot)
>> 1 NewWave Design & Verification V5022 Quad Channel 10 Gigabit Ethernet
>> FPGA Server Card
>>
>> The NIC on the motherboard is used for PXE Netboot and once the computer
>> is booted, it loads in the trading software, then starts it. The V5022 is
>> used for the actual trading because it is very high speed but ultra low
>> latency. You can runs these computers with or without heads (monitors
>> plugged into the video ports). I can't strongly emphasize that all logs from
>> each trading computer must not be stored in the computer's memory, not on
>> the NVMe or some other local disk on the trading computers. You want to make
>> these computers as hardened as possible, you will need a dedicated computer
>> to receive, store and

Re: Systematic Process to Reduce Linux OS Jitter

2016-12-26 Thread Gil Tene
One of the biggest reasons folks tend to stay away from the consumer CPUs 
in this space (like the is the i7-6950X 

 
you mentioned below) is the  ack of ECC memory support. I really wish Intel 
provided ECC support in those chips, but they don't. And ECC is usually a 
must when driving hardware performance to the edge, especially in FiunServ. 
The nightmare scenarios that happen when you aggressively choose your parts 
and push their performance to the edge (and even if you don't) with no ECC 
are very real. The soft-error correcting capabilities (ECC is usually 
SECDED) is crucial for avoid actually wrong computation results from 
occurring on a regular basis from simple things like cosmic ray effects on 
your DRAM, and with the many-GBs capacities we have tin those servers, 
going without a cosmic-ray-driven bit-flip in DRAM .

To move from hand waving and to actual numbers for the notion that ECC is 
critical (and to hopefully scare the s*&t out of people running business 
stuff with no soft-error correcting hardware), this 2009 Google Study 

 paper 
makes for a good read. It covers field data collected between 2006 and 
2008. Fast forward to section 3.1 if you are looking for some per-machine 
summary numbers. The simple takeaway summary is this: Even with ECC 
support, you have a ~1% of your machine experiencing an Uncorrectable Error 
(UE) once per year. But the chance of a machine encountering a Correctable 
Error (CE) at least once per year is somewhere in the 12-50% range, and the 
machines that do (which can be as many as half) will see those errors 
hundreds of times per year (so once every day or two).

One liner Summary: without hardware ECC support, random bits are probably 
flipping in your system memory, undetected, on a daily basis.

I believe that the current ECC-capable chips that would come close to the 
i7-6950X 

 you 
mentioned below are the E5-1680V4 

 (for 
1 socket setups, peaks at 4.0GHz) and either E5-2687W v4 

 
or E5-2697A v4 

 (peak 
at 3.,5 and 3.6GHz respectively, but you'd need to carefully avoid using on 
the core on the 2697 to get there probably). The E3 series (e.g. E3-1280 v5 
)
 
usually have the latest cores first, but their core counts tend to be tiny 
(4 physical cores compared to 8-12 in the others listed above).

On Monday, December 26, 2016 at 3:05:31 AM UTC-8, Lex Barringer wrote:
>
> I realize this post is a little late to the party but it's good for people 
> looking at tweaking their hardware and software for high frequency binary 
> options trading, including crypto-currency on the various exchanges. 
>
> As a note to all people seeking to create ultra low latency systems, not 
> just network components / accessories. The clock rate (clock speed) of the 
> CPU, it's multipliers vs. the memory multipliers, voltages, clock speeds of 
> the modules as well as the CAS timing (as well as other associated memory 
> timing parameters) can have a huge impact on how fast your overall system 
> performance is. Let alone it's actual reaction speed. One of the most 
> important areas are the ratios of the multipliers in the system itself. 
>
> While many operations are handled by the NIC in hardware and dedicated 
> FIFO buffers of said devices, it still is necessary to have a tuned system 
> hardware wise to give you the best performance, with the lowest overhead.
>
> Depending on if you use Intel Xeon or the newer 68 or 69 series Intel i7 
> Extreme series, you may get better trading performance using a consumer 
> grade (non Xeon) processor. I recommend using the Intel i7 6950X, it's 
> comparable in latency, it can handle
> memory speeds well in excess of 2400 MHz (2.4 GHz). The key here is to 
> find memory and motherboards that can handle DDR4
> 3200 MHz memory at CAS 14. You can get faster memory with the same CAS, if 
> your motherboard supports it, then do so at this time.
>
> I've used the following configuration:
>
> Asus E-99 WS workstation board
> Intel i7-6950X CPU
> 64 GiB of CAS 14 @ 3200 MHz RAM
> 1 Intel P3608 4 TB drive (it's the faster, bigger brother to the consumer 
> Intel 750 NVMe SSD in a PCIe slot)
> 1 NewWave Design & Verification V5022 Quad Channel 10 Gigabit Ethernet 
> FPGA Server Card
>
> The NIC on the motherboard is used for PXE Netboot and once the computer 
> is booted, it l

Re: Systematic Process to Reduce Linux OS Jitter

2016-12-26 Thread Lex Barringer
I realize this post is a little late to the party but it's good for people 
looking at tweaking their hardware and software for high frequency binary 
options trading, including crypto-currency on the various exchanges. 

As a note to all people seeking to create ultra low latency systems, not 
just network components / accessories. The clock rate (clock speed) of the 
CPU, it's multipliers vs. the memory multipliers, voltages, clock speeds of 
the modules as well as the CAS timing (as well as other associated memory 
timing parameters) can have a huge impact on how fast your overall system 
performance is. Let alone it's actual reaction speed. One of the most 
important areas are the ratios of the multipliers in the system itself. 

While many operations are handled by the NIC in hardware and dedicated FIFO 
buffers of said devices, it still is necessary to have a tuned system 
hardware wise to give you the best performance, with the lowest overhead.

Depending on if you use Intel Xeon or the newer 68 or 69 series Intel i7 
Extreme series, you may get better trading performance using a consumer 
grade (non Xeon) processor. I recommend using the Intel i7 6950X, it's 
comparable in latency, it can handle
memory speeds well in excess of 2400 MHz (2.4 GHz). The key here is to find 
memory and motherboards that can handle DDR4
3200 MHz memory at CAS 14. You can get faster memory with the same CAS, if 
your motherboard supports it, then do so at this time.

I've used the following configuration:

Asus E-99 WS workstation board
Intel i7-6950X CPU
64 GiB of CAS 14 @ 3200 MHz RAM
1 Intel P3608 4 TB drive (it's the faster, bigger brother to the consumer 
Intel 750 NVMe SSD in a PCIe slot)
1 NewWave Design & Verification V5022 Quad Channel 10 Gigabit Ethernet FPGA 
Server Card

The NIC on the motherboard is used for PXE Netboot and once the computer is 
booted, it loads in the trading software, then starts it. The V5022 is used 
for the actual trading because it is very high speed but ultra low latency. 
You can runs these computers with or without heads (monitors plugged into 
the video ports). I can't strongly emphasize that all logs from each 
trading computer must not be stored in the computer's memory, not on the 
NVMe or some other local disk on the trading computers. You want to make 
these computers as hardened as possible, you will need a dedicated computer 
to receive, store and display the messages from each machine in an hardened 
environment. You competition may send in hackers to try to down your 
network and computer systems. Don't make it easy for them to do so by 
keeping local logs of activities occurring over your networks and on the 
computers themselves. The computer that gathers all the logs need not be a 
top of the line computer, a minimum of a quad core or dual core with two 
hyper-threads is sufficient.  You could get by with an Intel i3-6100 based 
system to save money, 16 GiB RAM is plenty for what this machine will be 
doing. 

A note about memory size, if you're worrying about clock jitter and jitter 
from other sources, having a specific memory size can either work for or 
against you. Not many people know this but the safe sizes for 64-bit 
computing memory are the following in GiB; 16, 64, 256. Anything else and 
you're going to be risking a lot of cache misses and misalignments of data 
which has a huge latency penalty both on the hardware and in the software. 
Many people are tempted to put the maximum amount of RAM in their system 
when they're doing trading but you have to realize how the memory is 
actually access by the hardware and how the operating system in this case; 
different distributions of Linux. I see people using; 6, 8, 12, 24, 36, 48, 
72, 96, 128, 224, 512 GiB of RAM, strange sizes like this can give you 
problems because of the way in which the memory managers in Linux are 
designed. While, technically, yes, the kernel can handle very large memory 
system and strange sizes like these, it's not a good practice for system 
designers and builders to get into. Something else to note, the more chips 
a RAM module has, the more likely you are to have clock jitter, which can 
lead to some not too nice effects and are very hard to track down from the 
software side.

You also need to keep your systems and network switches below 50 degrees 
centigrade (ideally 45 C), not only does it extend the life of your 
equipment, low latency and jitter are kept within acceptable limits. The 
warmer the items are, the more unpredictable they become.
 
If you want more processing power, I would suggest using the Intel Xeon Phi 
co-processor cards, which use a Linux kernel on each card to manage the 
software kernel (from an OpenCL or some other computing language to run on 
said cards). This requires additional software programming, debugging and 
profiling. It's not a plug and play solution, it can't be used as an 
automatic extension of the main CPU in the system. I shall not get int