On 27.11.2012 23:28, Jakub Jermar wrote: > On 11/27/2012 01:27 AM, Petr Koupý wrote: >>> Revision 1712 added PIO tracing, which may be contributing to this. >>> [...] >>> >>> You may also want to try to disable the tracing code completely and see >>> what happens. >> >> Thanks for the tip. I tried to disable PIO tracing but the CPU utilization >> did not change. It got me wondering whether the issue is really caused by >> revision 1712 or not. So I tried 1711 and it is happening there as well. >> On the other hand, 1710 and below is not affected by this issue. I don't >> see anything obvious in 1711 diff that could manifest liked that. Maybe >> a syscall was introduced into some frequently executed code path which >> was syscall-free previously. But that is pure guess on my part. You will >> probably shed more light on it, Jakub, as you are the author of the 1711. > > Hm, this seems to be related to the userspace stack size. Even though I > don't understand why (yet). For me, the kernel utilization is around 26% > on amd64 with 8K stacks and becomes over 40% with 1M stacks for the > i8042 task. > > The reason I don't understand why is that there should be no real > difference because the amount of allocated physical memory and mapped > pages should be similar in both cases (assuming the threads and fibrils > remain confined to 8K of stack anyway). > > I was, for a short time, suspecting as_area_destroy() could be the > culprit (i.e. going through the 1M range instead of an 8K range and > freeing pages from there), but then realized that since it is only > inspecting used space in the area, i.e. one or two pages, it should be > equally fast for 1M stacks as it is for 8K stacks. > > I will keep searching.
Ok, here is a hypothesis: The increased stack size forces the kernel to allocate more last-level page tables. This and the sub-optimal cleanup of the page tables contributes to the increased kernel utilization. Rationale: On my virtualbox, there are approximately 500 i8042 IRQs per second when continuously moving the mouse. Each of these IRQs is handled by an interrupt fibril, for which we need to create stack. Now, with 4K pages, 4B PTEs and 1024-entry page tables, the last level page table maps 4M of virtual address space. With 8K stacks (in fact 12K counting the empty guard page), the last level PTE can map some 341 stacks and so the kernel only has to allocate 2 pages for the last-level page tables per second. The situation changes dramatically when the stacks are 1M large, because the last level PTE can only map 3 stacks and leave space for 4K guard pages in between. In order to keep up with the virtual memory demand, the kernel needs to be allocating 167 pages for the last-level page tables per second to allocate enough stacks for ~500 fibrils, which is ~80times more than with the 8K stacks. Of course, the situation is slightly more complicated than this, because the finishing interrupt fibrils will be releasing their stacks in parallel with the new allocations, so some of the new allocations will be able to reuse the emptied PTEs. But I think the above worst-case situation illustrates the problem well. Also, with the 1M stacks with only one or two valid pages, page_mapping_remove() will spend longer time figuring out whether it is safe to deallocate the last-level PTE, because the valid entries will be farther away of each other than when there are only minimal gaps as is the case with 8K stacks. This particular inefficiency could be improved, but I am not sure how much it contributes to the problem. Jakub _______________________________________________ HelenOS-devel mailing list [email protected] http://lists.modry.cz/cgi-bin/listinfo/helenos-devel
