On 27.11.2012 23:28, Jakub Jermar wrote:
> On 11/27/2012 01:27 AM, Petr Koupý wrote:
>>> Revision 1712 added PIO tracing, which may be contributing to this. 
>>> [...]
>>>
>>> You may also want to try to disable the tracing code completely and see
>>> what happens.
>>
>> Thanks for the tip. I tried to disable PIO tracing but the CPU utilization 
>> did not change. It got me wondering whether the issue is really caused by 
>> revision 1712 or not. So I tried 1711 and it is happening there as well.
>> On the other hand, 1710 and below is not affected by this issue. I don't
>> see anything obvious in 1711 diff that could manifest liked that. Maybe
>> a syscall was introduced into some frequently executed code path which
>> was syscall-free previously. But that is pure guess on my part. You will
>> probably shed more light on it, Jakub, as you are the author of the 1711.
> 
> Hm, this seems to be related to the userspace stack size. Even though I
> don't understand why (yet). For me, the kernel utilization is around 26%
> on amd64 with 8K stacks and becomes over 40% with 1M stacks for the
> i8042 task.
> 
> The reason I don't understand why is that there should be no real
> difference because the amount of allocated physical memory and mapped
> pages should be similar in both cases (assuming the threads and fibrils
> remain confined to 8K of stack anyway).
> 
> I was, for a short time, suspecting as_area_destroy() could be the
> culprit (i.e. going through the 1M range instead of an 8K range and
> freeing pages from there), but then realized that since it is only
> inspecting used space in the area, i.e. one or two pages, it should be
> equally fast for 1M stacks as it is for 8K stacks.
> 
> I will keep searching.

Ok, here is a hypothesis:

The increased stack size forces the kernel to allocate more last-level
page tables. This and the sub-optimal cleanup of the page tables
contributes to the increased kernel utilization.

Rationale:

On my virtualbox, there are approximately 500 i8042 IRQs per second when
continuously moving the mouse. Each of these IRQs is handled by an
interrupt fibril, for which we need to create stack.

Now, with 4K pages, 4B PTEs and 1024-entry page tables, the last level
page table maps 4M of virtual address space. With 8K stacks (in fact 12K
counting the empty guard page), the last level PTE can map some 341
stacks and so the kernel only has to allocate 2 pages for the last-level
page tables per second.

The situation changes dramatically when the stacks are 1M large, because
the last level PTE can only map 3 stacks and leave space for 4K guard
pages in between. In order to keep up with the virtual memory demand,
the kernel needs to be allocating 167 pages for the last-level page
tables per second to allocate enough stacks for ~500 fibrils, which is
~80times more than with the 8K stacks.

Of course, the situation is slightly more complicated than this, because
the finishing interrupt fibrils will be releasing their stacks in
parallel with the new allocations, so some of the new allocations will
be able to reuse the emptied PTEs. But I think the above worst-case
situation illustrates the problem well.

Also, with the 1M stacks with only one or two valid pages,
page_mapping_remove() will spend longer time figuring out whether it is
safe to deallocate the last-level PTE, because the valid entries will be
farther away of each other than when there are only minimal gaps as is
the case with 8K stacks. This particular inefficiency could be improved,
but I am not sure how much it contributes to the problem.

Jakub

_______________________________________________
HelenOS-devel mailing list
[email protected]
http://lists.modry.cz/cgi-bin/listinfo/helenos-devel

Reply via email to