Re: ntdll: Fixed some heap allocation stalls

2012-11-07 Thread Alexandre Julliard
Steaphan Greene  writes:

> On 11/03/2012 10:28 AM, Matteo Bruni wrote:
>> Speaking of which, it might be a nice followup patch to add some free
>> lists usage stats, to get some idea of what different applications do
>> in that regard.
>
> Well, my code instrumentation consisted of fprintf()s, which I
> redirected through grep, sort, and uniq.  So, I don't have anything
> sophisticated to share on that end.

Actually, what would be useful would be a test program that replicates
the allocation pattern of that app, so we can experiment with various
settings and measure the difference.

-- 
Alexandre Julliard
julli...@winehq.org




Re: ntdll: Fixed some heap allocation stalls

2012-11-03 Thread Steaphan Greene

On 11/03/2012 10:28 AM, Matteo Bruni wrote:

2012/11/3 Steaphan Greene:

On 11/03/2012 09:04 AM, Matteo Bruni wrote:

2012/11/2 Steaphan Greene:

Running a game in wine showed it performing terribly.  I traced this to
the
fact that it allocates and deallocates tiny memory chunks over and over
(I
suspect it's in C++ and passing things by value everywhere).  This led to
huge stalls because the heap bins weren't fine-grained enough (these
differed in size in steps of 8 bytes, but the bins differed by 16+, so it
spent a lot of time searching each bin for a bigger block).  I added more
fine-grained sizes to the smaller end of this, and now it runs faster in
wine than it does natively. :)

This was run on Debian squeeze, amd64.

Note, this is my first submission to wine in nearly 15 years.  So, of
course, everything has changed with how this works now.  Hope I have this
all right.

---
   dlls/ntdll/heap.c |4 +++-
   1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/dlls/ntdll/heap.c b/dlls/ntdll/heap.c
index a9044714..eb7406b 100644
--- a/dlls/ntdll/heap.c
+++ b/dlls/ntdll/heap.c
@@ -116,7 +116,9 @@ C_ASSERT( sizeof(ARENA_LARGE) % LARGE_ALIGNMENT == 0
);
   /* Max size of the blocks on the free lists */
   static const SIZE_T HEAP_freeListSizes[] =
   {
-0x10, 0x20, 0x30, 0x40, 0x60, 0x80, 0x100, 0x200, 0x400, 0x1000,
~0UL
+0x10, 0x18, 0x20, 0x28, 0x30, 0x38, 0x40, 0x48, 0x50, 0x58, 0x60,
0x68,
+0x70, 0x78, 0x80, 0x88, 0x90, 0x98, 0xA0, 0xA8, 0xB0, 0xB8, 0xC0,
0xC8,
+0xD0, 0xD8, 0xE0, 0E88, 0xF0, 0xF8, 0x100, 0x200, 0x400, 0x1000,
~0UL
   };
   #define HEAP_NB_FREE_LISTS
(sizeof(HEAP_freeListSizes)/sizeof(HEAP_freeListSizes[0]))

That 0E88 looks quite wrong ;)

Apart from that, although I'm not an expert for this code, this patch
looks reasonable to me. Maybe we don't want so many free lists, but I
don't see big downsides for that (e.g. the linear search can be
replaced by a binary search, if need be). Maybe adding a smaller list
at the start (e.g. 0x8) makes sense too?


Yep, that's a typo.  Don't know how that got past me.  Sorry.  Should I
resend a corrected version?


Yes. You should also add a (try 2) to the email subject.


Done.



I did try with fewer (every 16 instead of every 8), and, though it was still
a dramatic improvement, it was still slow.

I was thinking about e.g. going every 16 after 0x80, or some similar
pseudo-exponential growing, but that really depends on the allocation
pattern of the applications.


That might be fine, though I haven't found any evidence that this number 
of bins has any significant downside other than one pointer per bin per 
heap, which seems tiny for the gains it produces.  It does search them 
linearly, which, as you mention, is easy to fix with a binary search, 
but this number is still so small, it's really not that significant.


With a larger number of bins, it might be more of an issue, but if it 
were also setup with a regular pattern of bins, then finding the right 
bin could be a simple O(1) calculation, and even faster than a binary 
search.


I might be curious enough right now to implement that whole thing 
myself, if it's desired.




Speaking of which, it might be a nice followup patch to add some free
lists usage stats, to get some idea of what different applications do
in that regard.


Well, my code instrumentation consisted of fprintf()s, which I 
redirected through grep, sort, and uniq.  So, I don't have anything 
sophisticated to share on that end.



--
Steaphan Greene





Re: ntdll: Fixed some heap allocation stalls

2012-11-03 Thread Steaphan Greene

On 11/03/2012 02:09 AM, Matej Špindler wrote:

On 03. 11. 2012 01:20, Dan Kegel wrote:

Which game were you testing?




League of Legends has similar issues with memory allocation.

This patch does something similar:
http://uz.sns.it/~ranma42/iLoL/spectator-fix-v2/0001-ntdll-Improve-performace-of-heap-allocation-v2.patch 



I have it in my tree since February and no problems so far.


Yes, that patch looks like it is targeted at speeding larger allocations 
(where mine speeds smaller ones). These are similar fixes for opposite 
ends of the spectrum. And, the two should be simple to merge. Though, I 
don't believe the 0x08 on that one actually does anything (since these 
values include the arena overhead, and the smallest allocation I've seen 
is 8). However, I haven't looked at many examples, and I believe if an 
allocation request of 4 is possible, then that *would* hit that "0x08" bin.


--
Steaphan Greene





Re: ntdll: Fixed some heap allocation stalls

2012-11-03 Thread Matteo Bruni
2012/11/3 Steaphan Greene :
> On 11/03/2012 09:04 AM, Matteo Bruni wrote:
>>
>> 2012/11/2 Steaphan Greene:
>>>
>>> Running a game in wine showed it performing terribly.  I traced this to
>>> the
>>> fact that it allocates and deallocates tiny memory chunks over and over
>>> (I
>>> suspect it's in C++ and passing things by value everywhere).  This led to
>>> huge stalls because the heap bins weren't fine-grained enough (these
>>> differed in size in steps of 8 bytes, but the bins differed by 16+, so it
>>> spent a lot of time searching each bin for a bigger block).  I added more
>>> fine-grained sizes to the smaller end of this, and now it runs faster in
>>> wine than it does natively. :)
>>>
>>> This was run on Debian squeeze, amd64.
>>>
>>> Note, this is my first submission to wine in nearly 15 years.  So, of
>>> course, everything has changed with how this works now.  Hope I have this
>>> all right.
>>>
>>> ---
>>>   dlls/ntdll/heap.c |4 +++-
>>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/dlls/ntdll/heap.c b/dlls/ntdll/heap.c
>>> index a9044714..eb7406b 100644
>>> --- a/dlls/ntdll/heap.c
>>> +++ b/dlls/ntdll/heap.c
>>> @@ -116,7 +116,9 @@ C_ASSERT( sizeof(ARENA_LARGE) % LARGE_ALIGNMENT == 0
>>> );
>>>   /* Max size of the blocks on the free lists */
>>>   static const SIZE_T HEAP_freeListSizes[] =
>>>   {
>>> -0x10, 0x20, 0x30, 0x40, 0x60, 0x80, 0x100, 0x200, 0x400, 0x1000,
>>> ~0UL
>>> +0x10, 0x18, 0x20, 0x28, 0x30, 0x38, 0x40, 0x48, 0x50, 0x58, 0x60,
>>> 0x68,
>>> +0x70, 0x78, 0x80, 0x88, 0x90, 0x98, 0xA0, 0xA8, 0xB0, 0xB8, 0xC0,
>>> 0xC8,
>>> +0xD0, 0xD8, 0xE0, 0E88, 0xF0, 0xF8, 0x100, 0x200, 0x400, 0x1000,
>>> ~0UL
>>>   };
>>>   #define HEAP_NB_FREE_LISTS
>>> (sizeof(HEAP_freeListSizes)/sizeof(HEAP_freeListSizes[0]))
>>
>> That 0E88 looks quite wrong ;)
>>
>> Apart from that, although I'm not an expert for this code, this patch
>> looks reasonable to me. Maybe we don't want so many free lists, but I
>> don't see big downsides for that (e.g. the linear search can be
>> replaced by a binary search, if need be). Maybe adding a smaller list
>> at the start (e.g. 0x8) makes sense too?
>
>
> Yep, that's a typo.  Don't know how that got past me.  Sorry.  Should I
> resend a corrected version?
>

Yes. You should also add a (try 2) to the email subject.

> 0x08 was left off because these values include the overhead of the arena
> info, so the smallest requested size of 8 actually searches for larger than
> 8 (12, I think).

Ah, good point, I misread the code. Yes, it makes perfectly sense as it is.

>
> I did try with fewer (every 16 instead of every 8), and, though it was still
> a dramatic improvement, it was still slow.
>

I was thinking about e.g. going every 16 after 0x80, or some similar
pseudo-exponential growing, but that really depends on the allocation
pattern of the applications.

Speaking of which, it might be a nice followup patch to add some free
lists usage stats, to get some idea of what different applications do
in that regard.




Re: ntdll: Fixed some heap allocation stalls

2012-11-03 Thread Steaphan Greene

On 11/03/2012 09:04 AM, Matteo Bruni wrote:

2012/11/2 Steaphan Greene:

Running a game in wine showed it performing terribly.  I traced this to the
fact that it allocates and deallocates tiny memory chunks over and over (I
suspect it's in C++ and passing things by value everywhere).  This led to
huge stalls because the heap bins weren't fine-grained enough (these
differed in size in steps of 8 bytes, but the bins differed by 16+, so it
spent a lot of time searching each bin for a bigger block).  I added more
fine-grained sizes to the smaller end of this, and now it runs faster in
wine than it does natively. :)

This was run on Debian squeeze, amd64.

Note, this is my first submission to wine in nearly 15 years.  So, of
course, everything has changed with how this works now.  Hope I have this
all right.

---
  dlls/ntdll/heap.c |4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/dlls/ntdll/heap.c b/dlls/ntdll/heap.c
index a9044714..eb7406b 100644
--- a/dlls/ntdll/heap.c
+++ b/dlls/ntdll/heap.c
@@ -116,7 +116,9 @@ C_ASSERT( sizeof(ARENA_LARGE) % LARGE_ALIGNMENT == 0 );
  /* Max size of the blocks on the free lists */
  static const SIZE_T HEAP_freeListSizes[] =
  {
-0x10, 0x20, 0x30, 0x40, 0x60, 0x80, 0x100, 0x200, 0x400, 0x1000, ~0UL
+0x10, 0x18, 0x20, 0x28, 0x30, 0x38, 0x40, 0x48, 0x50, 0x58, 0x60, 0x68,
+0x70, 0x78, 0x80, 0x88, 0x90, 0x98, 0xA0, 0xA8, 0xB0, 0xB8, 0xC0, 0xC8,
+0xD0, 0xD8, 0xE0, 0E88, 0xF0, 0xF8, 0x100, 0x200, 0x400, 0x1000, ~0UL
  };
  #define HEAP_NB_FREE_LISTS  
(sizeof(HEAP_freeListSizes)/sizeof(HEAP_freeListSizes[0]))

That 0E88 looks quite wrong ;)

Apart from that, although I'm not an expert for this code, this patch
looks reasonable to me. Maybe we don't want so many free lists, but I
don't see big downsides for that (e.g. the linear search can be
replaced by a binary search, if need be). Maybe adding a smaller list
at the start (e.g. 0x8) makes sense too?


Yep, that's a typo.  Don't know how that got past me.  Sorry.  Should I 
resend a corrected version?


0x08 was left off because these values include the overhead of the arena 
info, so the smallest requested size of 8 actually searches for larger 
than 8 (12, I think).


I did try with fewer (every 16 instead of every 8), and, though it was 
still a dramatic improvement, it was still slow.


--
Steaphan Greene





Re: ntdll: Fixed some heap allocation stalls

2012-11-03 Thread Matteo Bruni
2012/11/2 Steaphan Greene :
> Running a game in wine showed it performing terribly.  I traced this to the
> fact that it allocates and deallocates tiny memory chunks over and over (I
> suspect it's in C++ and passing things by value everywhere).  This led to
> huge stalls because the heap bins weren't fine-grained enough (these
> differed in size in steps of 8 bytes, but the bins differed by 16+, so it
> spent a lot of time searching each bin for a bigger block).  I added more
> fine-grained sizes to the smaller end of this, and now it runs faster in
> wine than it does natively. :)
>
> This was run on Debian squeeze, amd64.
>
> Note, this is my first submission to wine in nearly 15 years.  So, of
> course, everything has changed with how this works now.  Hope I have this
> all right.
>
> ---
>  dlls/ntdll/heap.c |4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/dlls/ntdll/heap.c b/dlls/ntdll/heap.c
> index a9044714..eb7406b 100644
> --- a/dlls/ntdll/heap.c
> +++ b/dlls/ntdll/heap.c
> @@ -116,7 +116,9 @@ C_ASSERT( sizeof(ARENA_LARGE) % LARGE_ALIGNMENT == 0 );
>  /* Max size of the blocks on the free lists */
>  static const SIZE_T HEAP_freeListSizes[] =
>  {
> -0x10, 0x20, 0x30, 0x40, 0x60, 0x80, 0x100, 0x200, 0x400, 0x1000, ~0UL
> +0x10, 0x18, 0x20, 0x28, 0x30, 0x38, 0x40, 0x48, 0x50, 0x58, 0x60, 0x68,
> +0x70, 0x78, 0x80, 0x88, 0x90, 0x98, 0xA0, 0xA8, 0xB0, 0xB8, 0xC0, 0xC8,
> +0xD0, 0xD8, 0xE0, 0E88, 0xF0, 0xF8, 0x100, 0x200, 0x400, 0x1000, ~0UL
>  };
>  #define HEAP_NB_FREE_LISTS  
> (sizeof(HEAP_freeListSizes)/sizeof(HEAP_freeListSizes[0]))

That 0E88 looks quite wrong ;)

Apart from that, although I'm not an expert for this code, this patch
looks reasonable to me. Maybe we don't want so many free lists, but I
don't see big downsides for that (e.g. the linear search can be
replaced by a binary search, if need be). Maybe adding a smaller list
at the start (e.g. 0x8) makes sense too?




Re: ntdll: Fixed some heap allocation stalls

2012-11-03 Thread Matej Špindler

On 03. 11. 2012 01:20, Dan Kegel wrote:

Which game were you testing?




League of Legends has similar issues with memory allocation.

This patch does something similar:
http://uz.sns.it/~ranma42/iLoL/spectator-fix-v2/0001-ntdll-Improve-performace-of-heap-allocation-v2.patch

I have it in my tree since February and no problems so far.




RE: ntdll: Fixed some heap allocation stalls

2012-11-02 Thread Daniel Lehman
>> Which game were you testing?

i dunno about the game, but i saw similar stalls in our software

a huge batch process that consumed about 1 GB on Windows would both run slower 
and eat up about 2 GB on Wine before crashing

i made a similar change to Steaphan's that not only improved performance, it 
brought the memory usage down to roughly what it was on Windows - and didn't 
crash

i was planning on submitting it at some point if i could think of a good test 
to validate it.  the only difference was that i added more lists, but i'd say 
adding more free lists definitely makes it faster

daniel






re: ntdll: Fixed some heap allocation stalls

2012-11-02 Thread Dan Kegel
Which game were you testing?