Re: Vim 7 performance notes

2007-02-10 Thread Alexei Alexandrov
Hi Bram Moolenaar, you wrote:

 It's Windows, but on Linux it should be similar.
 
 I would not assume it's similar until there is proof.
 
Of course. I'm going to investigate it there.

 
 This sounds like a bug in getc().  It should know that only one thread
 is reading the file and skip the locking.  I don't think we should fix
 library bugs in Vim, unless it's crashing or another significant problem.
 
It can't be a bug. I might be missing what you mean, but I can't see how it can 
know that only one thread is reading a file. It doesn't have a clue whether you 
gave this FILE * to other threads or not. It tries to be lightweight - as I 
described in a separate mail it uses InterlockedIncrement/Decrement but they 
are not that lightweight - they don't require switching to kernel mode but 
still take about 200 cycles of CPU each.

The only optimization that I see could be avoiding blocking (and even trying to 
block) in case if there is only one thread in current process and if there is 
guarantee that this particular call is guaranteed not to create any threads. 
But 1) it still may be expensive and 2) Vim has some background threads anyway, 
probably.

 Perhaps you can already get a big increase by not compiling for
 debugging?  With MSVC this usually has a big impact.  Also largely
 defeats profiling with debugging enabled.
 
I do _all_ performance measurements using optimized version of binary with 
symbols. This is just a must for performance tuning. 

-- 
Alexei Alexandrov


Re: Vim 7 performance notes

2007-02-10 Thread Bram Moolenaar

Alexei Alexandrov wrote:

 Hi Bram Moolenaar, you wrote:
 
  
  It sounds like keeping only 1024 bytes would already work for most
  situations.  That would be an acceptable amount to keep allocated at
  all times.  So why don't we use this as the initial size, and when it
  grows larger we free it when finished.  The growth size can be doubled
  each time perhaps.
 
 I chose 8192/16384 pair because it's the closest to original 1
 bytes. 1 itself would also be fine but I like round numbers...
 
 The patch with changes which, I think, close to what you describe
 above is attached. Could you please take a look at it?

It's starting to look better, less disadvantages and should still give a
fair speedup.

1 bytes is OK for when it's going to be freed again soon.  If you
want to keep the memory allocated something less would be more
appropriate.  And thus you need to start low, but could increase it in
larger steps (since it's going to be freed on return anyway).

The growarray is used in lots of places.  Adding another field to it
will cause more memory to be used.  Isn't it easier to make another
version of ga_grow() that increases the growsize when allocating another
block?

Instead of doubling each time, which is going to be big chunks quickly,
another way would be to first allocate one block at the start size of
about 1000 bytes, then set the growsize to 1.  So we grow at the
same speed as before.  Then no extra field or function is needed, it's
local change in the regexp code.

Something like:
if (regstack.ga_data == NULL)
{
ga_init2(regstack, 1, 1000);
ga_grow(regstack, 1);
}
regstack.ga_growsize = 1;

I do wonder if the memory needs to be cleared when re-using the
allocated memory.  Can you check that?

-- 
A meeting is an event at which the minutes are kept and the hours are lost.

 /// Bram Moolenaar -- [EMAIL PROTECTED] -- http://www.Moolenaar.net   \\\
///sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\download, build and distribute -- http://www.A-A-P.org///
 \\\help me help AIDS victims -- http://ICCF-Holland.org///


Re: Vim 7 performance notes

2007-02-08 Thread Alexei Alexandrov
Hi Mikolaj Machowski, you wrote:

 
 When testing it with VST it gave 3.4% speed improvements (the same
 metodology - 3 tests before and after, choose the best results).
 

Well, it's not that much but it's still positive result. :-)

-- 
Alexei Alexandrov


Re: Vim 7 performance notes

2007-02-08 Thread Bram Moolenaar

Alexei Alexandrov wrote:

 I've also noticed that Vim spends somewhat significant time on startup
 loading spell files (I have 2 languages in my .vimrc: set
 spelllang=en,ru). The time is mostly spent in
 EnterCriticalSection/LeaveCriticalSection with getc() upper the stack.
 The reason for this is CRT blocking since the runtime is
 multithreaded. It's Windows, but on Linux it should be similar.

I would not assume it's similar until there is proof.

 As far as I understand, Vim doesn't access the spell file from
 multiple threads. Thus, the situation can be improved a lot: on Linux
 by using getc_unlocked. On Windows, starting VS 2005 there is a
 function _getc_nolock. Before VS 2005 this function can be emulated by
 macro:
 
 #define _getc_nolock(_stream) (--(_stream)-_cnt = 0 ? \
   0xff  *(_stream)-_ptr++ : _filbuf(_stream))
 
 By switching to non-blocking getc() in spell.c I was able to reduce
 Vim startup time from about 0.9 seconds to about 0.7 seconds.

This sounds like a bug in getc().  It should know that only one thread
is reading the file and skip the locking.  I don't think we should fix
library bugs in Vim, unless it's crashing or another significant problem.

Perhaps you can already get a big increase by not compiling for
debugging?  With MSVC this usually has a big impact.  Also largely
defeats profiling with debugging enabled.

-- 
From know your smileys:
 :-EHas major dental problems

 /// Bram Moolenaar -- [EMAIL PROTECTED] -- http://www.Moolenaar.net   \\\
///sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\download, build and distribute -- http://www.A-A-P.org///
 \\\help me help AIDS victims -- http://ICCF-Holland.org///


Re: Vim 7 performance notes

2007-02-07 Thread Alexei Alexandrov
Hi George V. Reilly, you wrote:

 
 How did you measure the time in EnterCriticalSection and 
 LeaveCriticalSection?

I discovered the problem by using a performance tuning tool which uses sampling 
approach to get statistical profile data. The intrusivity of this class of 
tools is very low. I verified that the problem exists by compiling Vim with 
unlocked getc() and measuring the difference (without any tool, just by $ time 
...).

 If there's no lock contention, these routines are little more than 
 InterlockedIncrement and InterlockedDecrement, without a kernel transition
 or blocking.

You're absolutely right, it doesn't need to switch the context when the CS is 
free. But InterlockedIncrement/Decrement is not that cheap. On uniprocessor 
machine it takes about 200 cycles of CPU cycles per atomic operation. Thus, 
EnterCS/LeaveCS pair will take about 400 cycles of CPU. The program which 
confirms these numbers is attached. So it means that to read 1 Mbyte of data 
with locking getc (this is roughly the size of Russian + English spl files) we 
need to pay 400e6 cycles for these useless attempts to syncronize. Given the 
frequency of my machine 2.4 GHz = 2400e6 we get that 400e6 means 1/6 of second 
which is 0.17 seconds - exactly the speedup I observed.

And remember that you need to pay this price on every Vim startup.

 
 In other words, if you're seeing significant time in Enter/LeaveCS, I 
 can think of two causes. Either your measurement tool has perturbed the 
 results, or there really is some multithreaded lock contention. The 
 former seems more likely, as Vim is single-threaded, but who knows what 
 some DLLs in the Vim process might be doing.

No, there isn't any contention. The critical section in Microsoft 
multi-threaded CRT is per-FILE* so it's impossible that any guy competes with 
you unless you give them the FILE *. As far as I can say, descriptor to opened 
spell file is absolutetly private inside spell.c

Also, the numbers above show that the overhead is exactly this without any 
contention. If there were competition, the overhead would be much bigger.

 
 I would be vary wary of using the _getc_nolock macro until we understand 
 why you are seeing those results.
 

-- 
Alexei Alexandrov

test_cs.c_
Description: Binary data


Re: Vim 7 performance notes

2007-02-07 Thread Mikolaj Machowski
Dnia środa 07 luty 2007, Alexei Alexandrov napisał:
 Hi Mikolaj Machowski, you wrote:
  Nice work. Could you send or place somewhere patches? I'd like to test
  them on more complex regexps.

 Here it is. Note that the biggest speed-up is observed when regexp is
 matched a lot of times. The regexp mechanism itself is not affected at
 all here - so if you have one regexp which runs very long you won't
 probably notice any major difference.

When testing it with VST it gave 3.4% speed improvements (the same
metodology - 3 tests before and after, choose the best results).

m.



Re: Vim 7 performance notes

2007-02-06 Thread Alexei Alexandrov
Hi Mikolaj Machowski, you wrote:

 Nice work. Could you send or place somewhere patches? I'd like to test
 them on more complex regexps.

Here it is. Note that the biggest speed-up is observed when regexp is matched a 
lot of times. The regexp mechanism itself is not affected at all here - so if 
you have one regexp which runs very long you won't probably notice any major 
difference.

-- 
Alexei Alexandrov


regexp.c.diff
Description: Binary data


Re: Vim 7 performance notes

2007-02-06 Thread Alexei Alexandrov
Hi Alexei Alexandrov, you wrote:

 Hi Bram et al.,
 
 I'm doing some performance investigations of Vim code trying to understand
 whether there are any possibilities to improve it.

I've also noticed that Vim spends somewhat significant time on startup loading 
spell files (I have 2 languages in my .vimrc: set spelllang=en,ru). The time is 
mostly spent in EnterCriticalSection/LeaveCriticalSection with getc() upper the 
stack. The reason for this is CRT blocking since the runtime is multithreaded. 
It's Windows, but on Linux it should be similar.

As far as I understand, Vim doesn't access the spell file from multiple 
threads. Thus, the situation can be improved a lot: on Linux by using 
getc_unlocked. On Windows, starting VS 2005 there is a function _getc_nolock. 
Before VS 2005 this function can be emulated by macro:

#define _getc_nolock(_stream) (--(_stream)-_cnt = 0 ? \
  0xff  *(_stream)-_ptr++ : _filbuf(_stream))

By switching to non-blocking getc() in spell.c I was able to reduce Vim startup 
time from about 0.9 seconds to about 0.7 seconds.

-- 
Alexei Alexandrov


Re: Vim 7 performance notes

2007-02-06 Thread George V. Reilly

Alexei Alexandrov wrote:

I've also noticed that Vim spends somewhat significant time on startup loading 
spell files (I have 2 languages in my .vimrc: set spelllang=en,ru). The time is 
mostly spent in EnterCriticalSection/LeaveCriticalSection with getc() upper the 
stack. The reason for this is CRT blocking since the runtime is multithreaded. 
It's Windows, but on Linux it should be similar.

As far as I understand, Vim doesn't access the spell file from multiple 
threads. Thus, the situation can be improved a lot: on Linux by using 
getc_unlocked. On Windows, starting VS 2005 there is a function _getc_nolock. 
Before VS 2005 this function can be emulated by macro:

#define _getc_nolock(_stream) (--(_stream)-_cnt = 0 ? \
  0xff  *(_stream)-_ptr++ : _filbuf(_stream))

By switching to non-blocking getc() in spell.c I was able to reduce Vim startup 
time from about 0.9 seconds to about 0.7 seconds.
  


How did you measure the time in EnterCriticalSection and 
LeaveCriticalSection? If there's no lock contention, these routines are 
little more than InterlockedIncrement and InterlockedDecrement, without 
a kernel transition or blocking. If the lock is already held, then by 
definition, EnterCriticalSection has to block until the lock is 
available. Similarly, if LeaveCriticalSection detects that there are 
other callers waiting, it will signal one of the waiters.


In other words, if you're seeing significant time in Enter/LeaveCS, I 
can think of two causes. Either your measurement tool has perturbed the 
results, or there really is some multithreaded lock contention. The 
former seems more likely, as Vim is single-threaded, but who knows what 
some DLLs in the Vim process might be doing.


I would be vary wary of using the _getc_nolock macro until we understand 
why you are seeing those results.


--
/George V. Reilly  [EMAIL PROTECTED]
http://www.georgevreilly.com/blog
The biggest mistake is not learning from all your other mistakes.



Re: Vim 7 performance notes

2007-02-05 Thread Alexei Alexandrov
Hi Bram Moolenaar, you wrote:

 
 Speed should be OK this way, but it does keep up to 32 Kbyte allocated.
 That may not seem much, but if we do this in many places it quickly adds
 up.

Any keep limit greater than initial size (e.g. 16384 bytes) will give the 
same effect in many cases. By many cases here I mean cases when the stack 
just doesn't grow more than 512 bytes (very usual for syntax highlighting). Is 
16384 still too big? Regexp matching and syntax highlighting is one of the most 
important things in VIM so maybe it's tolerable? Also, as far as I understand, 
previosly (version 6) VIM used program stack as regstack, right? Compared with 
program stack solution current solution with the change proposed is much better 
because program stack never shrinks - once it grows to 1M, for example, the 
memory won't be ever given back. At least, this is so on Windows, Linux and I 
guess most Unixes.
 
 Can you show the benchmarks you used to see the performance and the
 stack space that is being used?  Otherwise we keep guessing the best
 numbers.

OK. Platform used for investigations: x86, Windows XP SP 2. Pentium 4 
Northwood, 2.4 GHz, 512M RAM.
I did 2 things: understanding stack usage and performance measurement. To 
understand the stack usage I added some simple logging to regexp.c: printing 
ga_maxlen before regstack and backpos clearing and forced the arrays to have 
the grow size 1 (so that ga_maxlen will be high watermark in bytes). Of course, 
for performance investigations I used version with normal grow size and without 
logging.

The version with logging was used to perform the following:

1. With syntax highlighting on, the following files were viewed from gg to G 
(with PgDn) and the following high watermark of stack size was observed: 
spell.c (444 bytes), $VIMRUNTIME/filetype.vim (820 bytes), big text file with a 
lot of syntax errors (252 bytes)

2. Command

gvim -c vimgrep /a\(.\)*z/ *.c | q

was executed in VIM 7 source directory. Stack watermark - 31008 bytes. This is 
example of non-optimal regexp which tends to take a lot of stack space.

Similar other test cases were tried leading to the following conclusions: 1) 
there is a lot of vim_regexec_both calls during syntax highlighting which work 
in very shallow stacks (1K); 2) when user searches for something with regexp 
there are cases when regular expression can require big amount of memory (10K).

The performance measurements were done against original version (7.0.188) and 
modified regexp.c (initial: 8192, keep limit: 16384). Each measurement was 
performed 3 times, minimal time was picked up.

First, I test the syntax highlighting speed:
Command: gvim.exe -f $VIMRUNTIME/filetype.vim -c for i in range(199) | redraw! 
| endfor | q
Original version: 10.6 seconds
Modified version: 8.5 seconds
The difference is about 25%.

Second, I did some grepping through Vim sources again:
Command: gvim.exe -c vimgrep /a.*z/ *.c | q
Original version: 6.6 seconds
Modified version: 5.6 seconds
The difference is about 15%.

 Coding detail: please don't use if (!number), use if (number == 0),
 that is so much easier to read.  Checking if ga_data is NULL would be
 simpler.

Got it - no problem.

-- 
Alexei Alexandrov


Re: Vim 7 performance notes

2007-02-05 Thread Bram Moolenaar

Alexei Alexandrov wrote:

 OK. Platform used for investigations: x86, Windows XP SP 2. Pentium 4
 Northwood, 2.4 GHz, 512M RAM.
 I did 2 things: understanding stack usage and performance measurement.
 To understand the stack usage I added some simple logging to regexp.c:
 printing ga_maxlen before regstack and backpos clearing and forced the
 arrays to have the grow size 1 (so that ga_maxlen will be high
 watermark in bytes). Of course, for performance investigations I used
 version with normal grow size and without logging.
 
 The version with logging was used to perform the following:
 
 1. With syntax highlighting on, the following files were viewed from
 gg to G (with PgDn) and the following high watermark of stack size was
 observed: spell.c (444 bytes), $VIMRUNTIME/filetype.vim (820 bytes),
 big text file with a lot of syntax errors (252 bytes)

It sounds like keeping only 1024 bytes would already work for most
situations.  That would be an acceptable amount to keep allocated at
all times.  So why don't we use this as the initial size, and when it
grows larger we free it when finished.  The growth size can be doubled
each time perhaps.

 2. Command
 
 gvim -c vimgrep /a\(.\)*z/ *.c | q
 
 was executed in VIM 7 source directory. Stack watermark - 31008 bytes.
 This is example of non-optimal regexp which tends to take a lot of
 stack space.

Right, this may happen and stack size wil greatly depend on the line
length.

 Similar other test cases were tried leading to the following
 conclusions: 1) there is a lot of vim_regexec_both calls during syntax
 highlighting which work in very shallow stacks (1K); 2) when user
 searches for something with regexp there are cases when regular
 expression can require big amount of memory (10K).
 
 The performance measurements were done against original version
 (7.0.188) and modified regexp.c (initial: 8192, keep limit: 16384).
 Each measurement was performed 3 times, minimal time was picked up.
 
 First, I test the syntax highlighting speed:
 Command: gvim.exe -f $VIMRUNTIME/filetype.vim -c for i in range(199) | 
 redraw! | endfor | q
 Original version: 10.6 seconds
 Modified version: 8.5 seconds
 The difference is about 25%.
 
 Second, I did some grepping through Vim sources again:
 Command: gvim.exe -c vimgrep /a.*z/ *.c | q
 Original version: 6.6 seconds
 Modified version: 5.6 seconds
 The difference is about 15%.

That's very useful, thanks for diving into this.

-- 
From know your smileys:
 8}}   Glasses, big nose, beard

 /// Bram Moolenaar -- [EMAIL PROTECTED] -- http://www.Moolenaar.net   \\\
///sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\download, build and distribute -- http://www.A-A-P.org///
 \\\help me help AIDS victims -- http://ICCF-Holland.org///


Re: Vim 7 performance notes

2007-02-04 Thread Yakov Lerner

On 2/4/07, Bram Moolenaar [EMAIL PROTECTED] wrote:


Alexei Alexandrov wrote:

 I'm doing some performance investigations of Vim code trying to
 understand whether there are any possibilities to improve it.
 Currently I've made the following observations (all investigations are
 done on Windows):

 Redundant work is done during regexp operations in syntax
 highlighting. On some files it is very noticable. The stack of the
 hotspot is ...  syn_current_attr  syn_regexec  vim_regexec_multi 
 vim_regexec_both  regtry  regmatch  ga_grow  alloc_clear  memset.
 So alloc_clear spends quite a few clockticks in lalloc() and memset().
 The reason for this is pessimistically big grow size for regset
 growing array:

 ga_init2(regstack, 1, 1);

 This is not very good: many regexp operations don't go deep -
 non-match is detected very quickly. But even one element on the stack
 will lead to allocating at least 1 bytes (which should be fast
 with good CRT memory allocator) and (worse) initializing these 1
 bytes with zeros (won't be that fast).

 One possible solution would be to keep regstack alive across calls to
 vim_regexec_both, but I'm not sure if it's can be done safely. What I
 did was replacing the grow size with smaller number and making the
 grow size for growing arrays dynamic with increase of 25%:

 --- regexp.c (revision 136)
 +++ regexp.c (working copy)
 @@ -3350,7 +3350,7 @@
  /* Init the regstack empty.  Use an item size of 1 byte, since we push
   * different things onto it.  Use a large grow size to avoid reallocating
   * it too often. */
 -ga_init2(regstack, 1, 1);
 +ga_init2(regstack, 1, 64);

  /* Init the backpos table empty. */
  ga_init2(backpos, sizeof(backpos_T), 10);

 --- misc2.c (revision 136)
 +++ misc2.c (working copy)
 @@ -1905,6 +1905,7 @@

  {
   if (n  gap-ga_growsize)
   n = gap-ga_growsize;
 +gap-ga_growsize += (gap-ga_growsize  2);
   len = gap-ga_itemsize * (gap-ga_len + n);
   pp = alloc_clear((unsigned)len);
   if (pp == NULL)

 With this change I can see serious performance improvements, but I'm
 not sure if they are safe.
 Bram, does it look making any sense?

This is a tradeoff between allocating too much and allocating too often.
What works best completely depends on the implementation of malloc() and
free().  I know that there are many implementations where they are quite
slow.  Then allocating larger chunks less often is better.  On some
systems they are fast and the chunks can be smaller.


Gnu malloc (glibc) is exceptionally fast, iirc. It is possible
to benchmark the malloc speed during  the ./configure  time.
And auto-select the initital size depending on the results.

The procmail this similar technique in configure: It automatically
benchmarks  it's own builtin strstr() vs system's strstr() and selects
the one which is faster.

Yakov


Re: Vim 7 performance notes

2007-02-04 Thread Bram Moolenaar

Alexei Alexandrov wrote:

 
  The idea to gradually increase the chunk size makes sense, but not
  everywhere.  For the syntax stack it's probably better to start with a
  stack that is mostly needed, then growing quite quickly (say double the
  chunk size every time).  That's because when recursive things are
  involved we need much more space.  And copying the stack to another
  place is more expensive than clearing with zeroes.
 
  Perhaps you can do some investigation about what the size mostly ends up
  to be.  Then use that with a special version of ga_grow() that increases
  the chunk size every time.
 
 I've also tried the approach with persisting the regstack and backpos
 allocation across calls to vim_regexec_both. It seems to work even
 better than increasing the grow size in my particular test cases. And
 I can't think up any situation when it can slow things down. Could you
 please take a look at the patch attached and provide your opinion?

Speed should be OK this way, but it does keep up to 32 Kbyte allocated.
That may not seem much, but if we do this in many places it quickly adds
up.

Can you show the benchmarks you used to see the performance and the
stack space that is being used?  Otherwise we keep guessing the best
numbers.

Coding detail: please don't use if (!number), use if (number == 0),
that is so much easier to read.  Checking if ga_data is NULL would be
simpler.

-- 
From know your smileys:
 :q vi user saying, How do I get out of this damn emacs editor?

 /// Bram Moolenaar -- [EMAIL PROTECTED] -- http://www.Moolenaar.net   \\\
///sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\download, build and distribute -- http://www.A-A-P.org///
 \\\help me help AIDS victims -- http://ICCF-Holland.org///


Vim 7 performance notes

2007-02-01 Thread Alexei Alexandrov
Hi Bram et al.,

I'm doing some performance investigations of Vim code trying to understand 
whether there are any possibilities to improve it.
Currently I've made the following observations (all investigations are done on 
Windows):

Redundant work is done during regexp operations in syntax highlighting. On some 
files it is very noticable. The stack of the hotspot is ...  syn_current_attr 
 syn_regexec  vim_regexec_multi  vim_regexec_both  regtry  regmatch  
ga_grow  alloc_clear  memset. So alloc_clear spends quite a few clockticks in 
lalloc() and memset().
The reason for this is pessimistically big grow size for regset growing array:

ga_init2(regstack, 1, 1);

This is not very good: many regexp operations don't go deep - non-match is 
detected very quickly. But even one element on the stack will lead to 
allocating at least 1 bytes (which should be fast with good CRT memory 
allocator) and (worse) initializing these 1 bytes with zeros (won't be that 
fast).

One possible solution would be to keep regstack alive across calls to 
vim_regexec_both, but I'm not sure if it's can be done safely. What I did was 
replacing the grow size with smaller number and making the grow size for 
growing arrays dynamic with increase of 25%:

--- regexp.c (revision 136)
+++ regexp.c (working copy)
@@ -3350,7 +3350,7 @@
 /* Init the regstack empty.  Use an item size of 1 byte, since we push
  * different things onto it.  Use a large grow size to avoid reallocating
  * it too often. */
-ga_init2(regstack, 1, 1);
+ga_init2(regstack, 1, 64);

 /* Init the backpos table empty. */
 ga_init2(backpos, sizeof(backpos_T), 10);

--- misc2.c (revision 136)
+++ misc2.c (working copy)
@@ -1905,6 +1905,7 @@

 {
  if (n  gap-ga_growsize)
  n = gap-ga_growsize;
+gap-ga_growsize += (gap-ga_growsize  2);
  len = gap-ga_itemsize * (gap-ga_len + n);
  pp = alloc_clear((unsigned)len);
  if (pp == NULL)

With this change I can see serious performance improvements, but I'm not sure 
if they are safe.
Bram, does it look making any sense?

-- 
Alexei Alexandrov

Re: Vim 7 performance notes

2007-02-01 Thread Martin Stubenschrott
On Thu, Feb 01, 2007 at 11:20:16PM +0300, Alexei Alexandrov wrote:
 Hi Bram et al.,
 
 I'm doing some performance investigations of Vim code trying to understand 
 whether there are any possibilities to improve it.
 Currently I've made the following observations (all investigations are done 
 on Windows):
 
 Redundant work is done during regexp operations in syntax highlighting. On 
 some files it is very noticable. The stack of the hotspot is ...  
 syn_current_attr  syn_regexec  vim_regexec_multi  vim_regexec_both  
 regtry  regmatch  ga_grow  alloc_clear  memset. So alloc_clear spends 
 quite a few clockticks in lalloc() and memset().
 The reason for this is pessimistically big grow size for regset growing array:
 
 ga_init2(regstack, 1, 1);
 
 This is not very good: many regexp operations don't go deep - non-match is 
 detected very quickly. But even one element on the stack will lead to 
 allocating at least 1 bytes (which should be fast with good CRT memory 
 allocator) and (worse) initializing these 1 bytes with zeros (won't be 
 that fast).

I am not sure if this is really relevant to vim, because I don't have a
clue which regexp matching algorithm it is using, but you might want to
look into this article, when it comes to regexp speed:

http://swtch.com/~rsc/regexp/regexp1.html

regards,

Martin