[amibroker] Re: Dual-core vs. quad-core

markhoff Tue, 13 May 2008 12:14:02 -0700

@dloyer123: Yes, please let us know your benchmark results. Since I'm
also thinking about an hardware upgrade it would be helpfull to see
them ...


@Tomasz: thanks for your detailed explanations!

Best regards,
Markus

--- In [email protected], "dloyer123" <[EMAIL PROTECTED]> wrote:
>
> Nice, tight loop.  It is good to see someone that has made the effort 
> to make the most out of every cycle and the result shows.
> 
> My new E8400 (45nm 3GHz, dual core) system should arrive tomorrow.  
> The first thing I will do will be to benchmark it running ami.  I run 
> portfolio backtests over a few years of 5 minute data over a thousand 
> or so symbols.  Plenty of data to overflow the cache, but still fit 
> in memory.  No trig.  
> 
> I'll post what I find.
> 
> If what you say is true, and one core alone fills the memory 
> bandwidth, then there should be a net loss of performance while 
> running two copies of ami.  
> 
> 
> 
> --- In [email protected], "Tomasz Janeczko" <groups@> 
> wrote:
> >
> > Hello,
> > 
> > FYI: SINGLE processor core running an AFL formula is able to 
> saturate memory bandwidth
> > in majority of most common operations/functions
> > if total array sizes used in given formula exceedes DATA cache size.
> > 
> > You need to understand that AFL runs with native assembly speed
> > when using array operations. 
> > A simple array multiplication like this
> > 
> > X = Close  * H; // array multiplication
> > 
> > gets compiled to just 8 assembly instructions:
> > 
> > loop:    8B 54 24 58          mov         edx,dword ptr [esp+58h]
> > 00465068 46                   inc         
> esi                       ; increase counters 
> > 00465069 83 C0 04             add         eax,4
> > 0046506C 3B F7                cmp         esi,edi
> > 0046506E D9 44 B2 FC          fld         dword ptr [edx+esi*4-
> 4]   ; get element of close array
> > 00465072 D8 4C 08 FC          fmul        dword ptr [eax+ecx-
> 4]     ; multiply by element of high array
> > 00465076 D9 58 FC             fstp        dword ptr [eax-
> 4]         ; store result
> > 00465079 7C E9                jl          
> loop                      ; continue until all elements are processed 
> > 
> > As you can see there are three 4 byte memory accesses per loop 
> iteration (2 reads each 4 bytes long and 1 write 4 byte long)
> > 
> > On my (2 year old) 2GHz Athlon x2 64 single iteration of this loop 
> takes 6 nanoseconds (see benchmark code below).
> > So, during 6 nanoseconds we have 8 byte reads and 4 byte store. 
> Thats (8/(6e-9))  bytes per second = 1333 MB per second read
> > and 667 MB per second write simultaneously i.e. 2GB/sec combined !
> > 
> > Now if you look at memory benchmarks:
> > http://community.compuserve.com/n/docs/docDownload.aspx?webtag=ws-
> pchardware&guid=6827f836-8c33-4063-aaf5-c93605dd1dc6
> > you will see that 2GB/s is THE LIMIT of system memory speed on 
> Athlon x64 (DDR2 dual channel)
> > And that's considering the fact that Athlon has superior-to-intel 
> on-die integrated memory controller (hypertransfer)
> > 
> > // benchmark code - for accurrate results run it on LARGE arrays - 
> intraday database, 1-minute interval, 50K bars or more)
> > GetPerformanceCounter(1); 
> > for(k = 0; k < 1000; k++ ) X = C * H; 
> > "Time per single iteration [s]="+1e-3*GetPerformanceCounter()/
> (1000*BarCount); 
> > 
> > Only really complex operations that use *lots* of FPU (floating 
> point) cycles
> > such as trigonometric (sin/cos/tan) functions are slow enough for 
> the memory
> > to keep up.
> > 
> > Of course one may say that I am using "old" processor, and new 
> computers have faster RAM and that's true
> > but processor speeds increase FASTER than bus speeds and the gap 
> between processor and RAM
> > becomes larger and larger so with newer CPUs the situation will be 
> worse, not better.
> > 
> > 
> > Best regards,
> > Tomasz Janeczko
> > amibroker.com
> > ----- Original Message ----- 
> > From: "dloyer123" <dloyer123@>
> > To: <[email protected]>
> > Sent: Tuesday, May 13, 2008 5:02 PM
> > Subject: [amibroker] Re: Dual-core vs. quad-core
> > 
> > 
> > > All of the cores have to share the same front bus and 
> northbridge.  
> > > The northbridge connects the cpu to memory and has limited 
> bandwidth.
> > > 
> > > If several cores are running memory hungry applications, the 
> front 
> > > buss will saturate.
> > > 
> > > The L2 cache helps for most applications, but not if you are 
> burning 
> > > through a few G of quote data.  The L2 cache is just 4-8MB.
> > > 
> > > The newer multi core systems have much faster front buses and 
> that 
> > > trend is likely to continue.
> > > 
> > > So, it would be nice if AMI could support running multi cores, 
> even 
> > > if it was just running different optimization passes on different 
> > > cores.  That would saturate the front bus, but take advantage of 
> all 
> > > of the memory bandwidth you have.  It would really help those 
> multi 
> > > day walkforward runs.
> > > 
> > > 
> > > 
> > > --- In [email protected], "markhoff" <markhoff@> wrote:
> > >>
> > >> 
> > >> If you have a runtime penalty when running 2 independent AB jobs 
> on 
> > > a
> > >> Core Duo CPU it might be caused by too less memory (swapping to 
> > > disk)
> > >> or other tasks which are also running (e.g. a web browser, audio
> > >> streamer or whatever). You can check this with a process explorer
> > >> which shows each tasks CPU utilisation. Similar, 4 AB jobs on a 
> Core
> > >> Quad should have nearly no penalty in runtime.
> > >> 
> > >> Tomasz stated that multi-thread optimization does not scale good 
> > > with
> > >> the CPU number, but it is not clear to me why this is the case. 
> In 
> > > my
> > >> understanding, AA optimization is a sequential process of 
> running 
> > > the
> > >> same AFL script with different parameters. If I have an AFL with
> > >> significantly long runtime per optimization step (e.g. 1 minute) 
> the
> > >> overhead for the multi-threading should become quite small and
> > >> independent tasks should scale nearly with the number of CPUs 
> (as 
> > > long
> > >> as there is sufficient memory, n threads might need n-times more
> > >> memory than a single thread). For sure the situation is 
> different if
> > >> my single optimization run takes only a few millisecs or 
> seconds, 
> > > then
> > >>  the overhead for multi-thread-managment goes up ...
> > >> 
> > >> Maybe Tomasz can give some detailed comments on that issue?
> > >> 
> > >> Best regards,
> > >> Markus
> > >> 
> > > 
> > > 
> > > ------------------------------------
> > > 
> > > Please note that this group is for discussion between users only.
> > > 
> > > To get support from AmiBroker please send an e-mail directly to 
> > > SUPPORT {at} amibroker.com
> > > 
> > > For NEW RELEASE ANNOUNCEMENTS and other news always check DEVLOG:
> > > http://www.amibroker.com/devlog/
> > > 
> > > For other support material please check also:
> > > http://www.amibroker.com/support.html
> > > Yahoo! Groups Links
> > > 
> > > 
> > >
> >
>

[amibroker] Re: Dual-core vs. quad-core

Reply via email to