Hello,

FYI: SINGLE processor core running an AFL formula is able to saturate memory 
bandwidth
in majority of most common operations/functions
if total array sizes used in given formula exceedes DATA cache size.

You need to understand that AFL runs with native assembly speed
when using array operations. 
A simple array multiplication like this

X = Close  * H; // array multiplication

gets compiled to just 8 assembly instructions:

loop:    8B 54 24 58          mov         edx,dword ptr [esp+58h]
00465068 46                   inc         esi                       ; increase 
counters 
00465069 83 C0 04             add         eax,4
0046506C 3B F7                cmp         esi,edi
0046506E D9 44 B2 FC          fld         dword ptr [edx+esi*4-4]   ; get 
element of close array
00465072 D8 4C 08 FC          fmul        dword ptr [eax+ecx-4]     ; multiply 
by element of high array
00465076 D9 58 FC             fstp        dword ptr [eax-4]         ; store 
result
00465079 7C E9                jl          loop                      ; continue 
until all elements are processed 

As you can see there are three 4 byte memory accesses per loop iteration (2 
reads each 4 bytes long and 1 write 4 byte long)

On my (2 year old) 2GHz Athlon x2 64 single iteration of this loop takes 6 
nanoseconds (see benchmark code below).
So, during 6 nanoseconds we have 8 byte reads and 4 byte store. Thats 
(8/(6e-9))  bytes per second = 1333 MB per second read
and 667 MB per second write simultaneously i.e. 2GB/sec combined !

Now if you look at memory benchmarks:
http://community.compuserve.com/n/docs/docDownload.aspx?webtag=ws-pchardware&guid=6827f836-8c33-4063-aaf5-c93605dd1dc6
you will see that 2GB/s is THE LIMIT of system memory speed on Athlon x64 (DDR2 
dual channel)
And that's considering the fact that Athlon has superior-to-intel on-die 
integrated memory controller (hypertransfer)

// benchmark code - for accurrate results run it on LARGE arrays - intraday 
database, 1-minute interval, 50K bars or more)
GetPerformanceCounter(1); 
for(k = 0; k < 1000; k++ ) X = C * H; 
"Time per single iteration [s]="+1e-3*GetPerformanceCounter()/(1000*BarCount); 

Only really complex operations that use *lots* of FPU (floating point) cycles
such as trigonometric (sin/cos/tan) functions are slow enough for the memory
to keep up.

Of course one may say that I am using "old" processor, and new computers have 
faster RAM and that's true
but processor speeds increase FASTER than bus speeds and the gap between 
processor and RAM
becomes larger and larger so with newer CPUs the situation will be worse, not 
better.


Best regards,
Tomasz Janeczko
amibroker.com
----- Original Message ----- 
From: "dloyer123" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Tuesday, May 13, 2008 5:02 PM
Subject: [amibroker] Re: Dual-core vs. quad-core


> All of the cores have to share the same front bus and northbridge.  
> The northbridge connects the cpu to memory and has limited bandwidth.
> 
> If several cores are running memory hungry applications, the front 
> buss will saturate.
> 
> The L2 cache helps for most applications, but not if you are burning 
> through a few G of quote data.  The L2 cache is just 4-8MB.
> 
> The newer multi core systems have much faster front buses and that 
> trend is likely to continue.
> 
> So, it would be nice if AMI could support running multi cores, even 
> if it was just running different optimization passes on different 
> cores.  That would saturate the front bus, but take advantage of all 
> of the memory bandwidth you have.  It would really help those multi 
> day walkforward runs.
> 
> 
> 
> --- In [email protected], "markhoff" <[EMAIL PROTECTED]> wrote:
>>
>> 
>> If you have a runtime penalty when running 2 independent AB jobs on 
> a
>> Core Duo CPU it might be caused by too less memory (swapping to 
> disk)
>> or other tasks which are also running (e.g. a web browser, audio
>> streamer or whatever). You can check this with a process explorer
>> which shows each tasks CPU utilisation. Similar, 4 AB jobs on a Core
>> Quad should have nearly no penalty in runtime.
>> 
>> Tomasz stated that multi-thread optimization does not scale good 
> with
>> the CPU number, but it is not clear to me why this is the case. In 
> my
>> understanding, AA optimization is a sequential process of running 
> the
>> same AFL script with different parameters. If I have an AFL with
>> significantly long runtime per optimization step (e.g. 1 minute) the
>> overhead for the multi-threading should become quite small and
>> independent tasks should scale nearly with the number of CPUs (as 
> long
>> as there is sufficient memory, n threads might need n-times more
>> memory than a single thread). For sure the situation is different if
>> my single optimization run takes only a few millisecs or seconds, 
> then
>>  the overhead for multi-thread-managment goes up ...
>> 
>> Maybe Tomasz can give some detailed comments on that issue?
>> 
>> Best regards,
>> Markus
>> 
> 
> 
> ------------------------------------
> 
> Please note that this group is for discussion between users only.
> 
> To get support from AmiBroker please send an e-mail directly to 
> SUPPORT {at} amibroker.com
> 
> For NEW RELEASE ANNOUNCEMENTS and other news always check DEVLOG:
> http://www.amibroker.com/devlog/
> 
> For other support material please check also:
> http://www.amibroker.com/support.html
> Yahoo! Groups Links
> 
> 
> 

Reply via email to