[amibroker] Re: Freakishly fast backtest using 64 cores

Paul Ho Tue, 05 Aug 2008 16:35:23 -0700

thanks for your insight.
I hope you dont mind sharing a little bit more detail
You said "
Get get the best performance, my AFL code makes one pass over the 
> data, calling a Dll.  The Dll takes all of the data needed by the 
> calculation and loads a copy to the video card.  This upload is 
slow, 
> the entire upload takes about 45 seconds for all 1000 symbols. 
> 
> Once all of the data is uploaded, the Dll loads a "kernel" into the 
> graphics cores that perform the actual computation and generates 
the 
> trade list.


normally AB loads the data from database as needed, and calls a 
function in a dll, and passes data in arrays or whatever as arguments 
of the function. The function will be called for every ticker in the 
watchlist, and data pertaining that symbol is passed each time. I 
wonder how you do a "single pass" over the data. Because AB passes 
the data as part of the argument regardless of how many optimizations 
It had previously with the same data. I just wonder you do it.
cheers
Paul.

--- In amibroker@yahoogroups.com, "dloyer123" <[EMAIL PROTECTED]> wrote:
>
> This uses the mid range video card that happened to come with my 
> system, a 9800GT.  The newer 260 and 280 cards are 3 to 4 times 
> faster.  The 260 can be found at best buy for $300.  Some laptops 
> have compatible cards as well. 
> 
> The video card has its own memory, mine has 512MB, some have as 
much 
> as 1GB.  This memory is very fast, once it is loaded from the main 
> system.  Nvidia has a professional line of products that have much 
> more memory.  
> 
> Get get the best performance, my AFL code makes one pass over the 
> data, calling a Dll.  The Dll takes all of the data needed by the 
> calculation and loads a copy to the video card.  This upload is 
slow, 
> the entire upload takes about 45 seconds for all 1000 symbols. 
> 
> Once all of the data is uploaded, the Dll loads a "kernel" into the 
> graphics cores that perform the actual computation and generates 
the 
> trade list.  This part is very fast and performs all of the same 
> functions that my AFL version does.  The resulting trade list is 
the 
> same.  
> 
> Because the data loaded into video memory, it can be resused for 
many 
> passes over the data with different optimization values.  So, 
> hundreds of combinations of optimization values can be tried per 
> second.  
> 
> For non optimization runs, the Dll just loads one symbol into video 
> memory and processes it.  Counting the overhead of moving data to 
the 
> video card and extracting the trade list for a single symbol, the 
> result is similar to AFL code alone.  This lets me test the code 
and 
> make sure it is correct.
> 
> This approach works best when the data only needs to be loaded 
once, 
> then "resused" many times.  It also works best when there is a lot 
of 
> data to work with. 
> 
> What is more interesting to me and what would be more useful for 
> others would be a general drive that requires no Dll changes to 
> modify the system.  The performance would not be as good as hand 
> optimized code, but would still be much better than AFL code 
alone.  
> It would take trading system design to a whole new level.  It would 
> provide enough performance to make working with Intra day data as 
> easy as daily data is today.
> 
> Writing such a driver would be hard, but I have already done some 
> prototypes and design work.  I am tempted to do it for my own use.  
> If I made it available to others supporting it would be a PITA.  
> 
> 
> 
> 
> --- In amibroker@yahoogroups.com, "Paul Ho" <paul.tsho@> wrote:
> >
> > I'm very interested
> > could you elaborate a bit more
> > What model of Nvidia chipset are you using, and with how much 
> memory?
> > Not sure exactly what you mean when you say
> > It uses AmiBroker to load the symbol data and perform 
calculations 
> > that do not depend on the optimization parameters. Once loaded 
into 
> > video memory, repeated passes can be made with different 
> parameters, 
> > avoiding any overhead. 
> > Can you give me some examples. I presume when your dll is called. 
> AB passes
> > one or more arrays of data belonging to 1 symbol, is that true?
> > Not sure exactly what the rest mean either. How many functions 
are 
> you
> > running in your dll, and what does each of the do?
> > Great of you to share your insight.
> > Cheers
> > Paul.
> >  
> > 
> > 
> >   _____  
> > 
> > From: amibroker@yahoogroups.com 
[mailto:[EMAIL PROTECTED] 
> On Behalf
> > Of dloyer123
> > Sent: Tuesday, 5 August 2008 9:19 AM
> > To: amibroker@yahoogroups.com
> > Subject: [amibroker] Freakishly fast backtest using 64 cores
> > 
> > 
> > 
> > Greetings,
> > 
> > I ported part of my AFL backtest code to a plugin, that takes 
> > advantage of the graphics math cores on the video card that are 
> > normally used for 3d graphics. 
> > 
> > I was able to get a several thousand fold performance improvement 
> > over AFL code alone.
> > 
> > My goal was to reduce the 25 seconds AFL code alone uses for a 
> single 
> > portfolio level back test to less than 1 second, allowing multi 
day 
> > optimization and walkforward runs to complete in a more 
reasonable 
> > time, and also just to see how fast I could get it to run.
> > 
> > The backtest runs over 1 year of 5 minute bars for about 1000 
> > symbols. 1 year of data normally takes 25 seconds for AmiBroker 
> > alone, or 18 seconds for 6 months of data. A typical optimization 
> > run takes hundreds of these passes per walk forward step, taking 
> > hours.
> > 
> > Using the Nvidia CUDA API, running on my mid range video card. It 
> > was much faster. Much, much, much faster. How fast?
> > 
> > It reduced the run time from 25s to... 4.4ms. That is more than 
> > 200/s! 
> > 
> > I didnt believe the timing when I saw it at first. So, I put 
1,000 
> > runs in a loop and sure enough, it ran 1,000 iterations in about 
4 
> > 1/2 seconds. This far exceeded my gaol or expectations.
> > 
> > The resulting trade list matches that obtained by the AFL version 
> of 
> > this code. 
> > 
> > I estimate that it is processing 32GB of bar data/sec.
> > 
> > Getting this to work at peak performance was tricky. Most of what 
I 
> > have learned about code optimization does not apply. 
> > 
> > It uses AmiBroker to load the symbol data and perform 
calculations 
> > that do not depend on the optimization parameters. Once loaded 
into 
> > video memory, repeated passes can be made with different 
> parameters, 
> > avoiding any overhead. 
> > 
> > For non backtest/optimization runs, the code just evaluates one 
> > symbol and passes the data back to AmiBroker buy/sell/short/cover 
> > arrays, making it easy to test, validate and visualize the 
trades. 
> > There is very little performance gain in this case. 
> > 
> > There are problems, however. To run optimizations at peak speed, 
I 
> > can not use AmiBroker to calculate the optimization goal 
function. 
> > So, I am in the process of writing code to match signals and 
> > calculate the portfolio fitness function. Once I do this, I will 
be 
> > able to perform full optimizations and walk forwards at 3 orders 
of 
> > magnitude faster than is possible with AmiBroker alone.
> > 
> > Also, this is not general purpose code. Changing the system code 
> > means changing a dll written in C. However, there is no reason 
that 
> > this could not be made more general. 
> > 
> > I have made some prototypes of "Cuda" versions of basic AFL 
> > functions. The idea is to queue the function calls into a 
> definition 
> > executed by a micro kernel running on the graphics cores. The 
> result 
> > would be the ability to use the full power of the graphics cores 
by 
> > modifying AFL code to use Cuda aware versions with no changes to 
C 
> > code. It would be an interesting, but big project.
> >
>

[amibroker] Re: Freakishly fast backtest using 64 cores

Reply via email to