[lng-odp] [Bug 3774] Shmem validation test runs indefinitely with 1GB huge pages
https://bugs.linaro.org/show_bug.cgi?id=3774 Bill Fischofer changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #15 from Bill Fischofer --- Fix has been merged. -- You are receiving this mail because: You are on the CC list for the bug.
[lng-odp] [Bug 3774] Shmem validation test runs indefinitely with 1GB huge pages
https://bugs.linaro.org/show_bug.cgi?id=3774 --- Comment #14 from Maxim Uvarov --- https://github.com/Linaro/odp/commit/c46f54d8c708d6335b0288ff4a5aad3a3b93e41c refs/heads/master 2018-09-12T17:36:54+03:00 Josep Puigdemont josep.puigdem...@linaro.org linux-gen: ishm: implement huge page cache With this patch, ODP will pre-allocate several huge pages at init time. When memory is to be mapped into a huge page, one that was pre-allocated will be used, if available, this way ODP won't have to trap into the kernel to allocate huge pages. The idea with this implementation is to trick ishm into thinking that a file descriptor where to map the memory was provided, this way it it won't try to allocate one itself. This file descriptor is one of those previously allocated at init time. When the system is done with this file descriptor, instead of closing it, it is put back into the list of available huge pages, ready to be reused. A collateral effect of this patch is that memory is not zeroed out when it is reused. WARNING: This patch will not work when using process mode threads. For several reasons, this may not work when using ODP_ISHM_SINGLE_VA either, so when this flag is set, the list of pre-allocated files is not used. By default ODP will not reserve any huge pages, to tell ODP to do that, update the ODP configuration file with something like this: shm: { num_cached_hp = 32 } Example usage: $ echo odp.config odp_implementation = "linux-generic" config_file_version = "0.0.1" shm: { num_cached_hp = 32 } $ ODP_CONFIG_FILE=odp.conf ./test/validation/api/shmem/shmem_main This patch solves bug #3774: https://bugs.linaro.org/show_bug.cgi?id=3774 Signed-off-by: Josep Puigdemont Reviewed-and-tested-by: Matias Elo Signed-off-by: Maxim Uvarov -- You are receiving this mail because: You are on the CC list for the bug.
[lng-odp] [Bug 3774] Shmem validation test runs indefinitely with 1GB huge pages
https://bugs.linaro.org/show_bug.cgi?id=3774 --- Comment #13 from Josep Puigdemont --- A patch that fixes/mitigates this issue can be found here: https://github.com/joseppc/odp/tree/fix/cache_huge_pages This patch will pre-allocate huge pages at init time and never release them to the kernel until the application finishes, instead they will be kept in a list ready to be reused, thus avoiding the time spent in the kernel zeroing out the memory. -- You are receiving this mail because: You are on the CC list for the bug.
[lng-odp] [Bug 3774] Shmem validation test runs indefinitely with 1GB huge pages
https://bugs.linaro.org/show_bug.cgi?id=3774 --- Comment #12 from Matias Elo --- Hi, I'm currently on vacation with limited access to email. I'll be returning to office on September 3rd. Best regards, Matias Elo -- You are receiving this mail because: You are on the CC list for the bug.
[lng-odp] [Bug 3774] Shmem validation test runs indefinitely with 1GB huge pages
https://bugs.linaro.org/show_bug.cgi?id=3774 --- Comment #11 from Josep Puigdemont --- The issue here seems to be the same as for bug 3867. When allocating a huge page, the kernel zeroes it out before handing it over to user space, and this is one of the causes of the delays. Another cause is that there is a single lock taken when entering any of the shared memory module functions, and released only on exit (roughly speaking), causing lock contention for other threads and CPU usage. Due to the nature of this test application, which has many threads allocating and freeing shared memory in rather rapid succession, all of the above becomes a problem that results in the delays observed. Maybe this is not an issue for real world applications, which probably allocate memory once at start-up or, at least, not so often. One initial idea to mitigate this problem was to keep the huge pages that where to be freed in a list of "available" pages rather than closing the file descriptor. However, this approach would create problems with "process mode" threads in ODP, as we would need to find a way to know when all threads have freed a given huge page (it might be possible to implement this functionality in fdserver, but it doesn't feel right). -- You are receiving this mail because: You are on the CC list for the bug.
[lng-odp] [Bug 3774] Shmem validation test runs indefinitely with 1GB huge pages
https://bugs.linaro.org/show_bug.cgi?id=3774 --- Comment #10 from Matias Elo --- I timed shmem test runs with both 2MB and 1GB pages (28 thread system): 2MB pages - Run Summary:Type Total Ran Passed Failed Inactive suites 1 1 n/a 00 tests 5 5 5 00 asserts 1812146 1812146 1812146 0 n/a Elapsed time = 105.336 seconds 101.08user 4.85system 0:05.31elapsed 1995%CPU (0avgtext+0avgdata 15968maxresident)k 0inputs+0outputs (0major+17910minor)pagefaults 0swaps 1GB pages - Run Summary:Type Total Ran Passed Failed Inactive suites 1 1 n/a 00 tests 5 5 5 00 asserts 1807525 1807525 1807525 0 n/a Elapsed time = 16599.636 second 15850.52user 751.34system 12:31.74elapsed 2208%CPU (0avgtext+0avgdata 17880maxresident)k 5480inputs+0outputs (20major+46992minor)pagefaults 0swaps Perf cycles with 1GB pages: 97.37% shmem_main[.] odp_spinlock_lock 2.44% [kernel] [k] clear_page_erms 0.08% [kernel] [k] clear_huge_page -- You are receiving this mail because: You are on the CC list for the bug.
[lng-odp] [Bug 3774] Shmem validation test runs indefinitely with 1GB huge pages
https://bugs.linaro.org/show_bug.cgi?id=3774 --- Comment #9 from Josep Puigdemont --- (In reply to Brian Brooks from comment #7) > I cannot reproduce this issue. Instead I see it hanging on this test: > > make[3]: Entering directory '/home/brian/odp/example/l2fwd_simple' > make[4]: Entering directory '/home/brian/odp/example/l2fwd_simple' For this issue I opened bug #3879. -- You are receiving this mail because: You are on the CC list for the bug.
[lng-odp] [Bug 3774] Shmem validation test runs indefinitely with 1GB huge pages
https://bugs.linaro.org/show_bug.cgi?id=3774 --- Comment #8 from Bill Fischofer --- Thanks, Brian. Were you testing on Arm? You may have noticed that Josep posted https://github.com/Linaro/odp/pull/609 as a fix for this. Can you verify that it works on Arm as well? -- You are receiving this mail because: You are on the CC list for the bug.
[lng-odp] [Bug 3774] Shmem validation test runs indefinitely with 1GB huge pages
https://bugs.linaro.org/show_bug.cgi?id=3774 --- Comment #7 from Brian Brooks --- I cannot reproduce this issue. Instead I see it hanging on this test: make[3]: Entering directory '/home/brian/odp/example/l2fwd_simple' make[4]: Entering directory '/home/brian/odp/example/l2fwd_simple' The perf output Matias shared is clearly related to x86 Linux kernel. -- You are receiving this mail because: You are on the CC list for the bug.
[lng-odp] [Bug 3774] Shmem validation test runs indefinitely with 1GB huge pages
https://bugs.linaro.org/show_bug.cgi?id=3774 Josep Puigdemont changed: What|Removed |Added CC||josep.puigdem...@linaro.org --- Comment #6 from Josep Puigdemont --- In my laptop, with just 6 1G huge pages (what I had at hand), the issue is not reproducible but, like Matias, I did see the timer test failing due to the first timer being delayed. Maybe we should open another bug report for that: Test: timer_test_plain_queue ...odp_ishmphy.c:151:_odp_ishmphy_map():mmap failed:Cannot allocate memory odp_ishmphy.c:151:_odp_ishmphy_map():mmap failed:Cannot allocate memory timer.c:261:timer_test_queue_type(): Timer pool parameters: timer.c:262:timer_test_queue_type(): res_ns 2000 timer.c:263:timer_test_queue_type(): min_tmo 1 timer.c:264:timer_test_queue_type(): max_tmo 1 timer.c:288:timer_test_queue_type(): period_ns 4 timer.c:289:timer_test_queue_type(): period_tick 20 timer.c:308:timer_test_queue_type():abs timer tick 20 timer.c:308:timer_test_queue_type():abs timer tick 40 timer.c:308:timer_test_queue_type():abs timer tick 60 timer.c:308:timer_test_queue_type():abs timer tick 80 timer.c:308:timer_test_queue_type():abs timer tick 100 timer.c:308:timer_test_queue_type():abs timer tick 120 timer.c:308:timer_test_queue_type():abs timer tick 140 timer.c:308:timer_test_queue_type():abs timer tick 160 timer.c:308:timer_test_queue_type():abs timer tick 180 timer.c:308:timer_test_queue_type():abs timer tick 200 odp_timer.c:883:timer_notify(): 3 ticks overrun on timer pool "timer_pool", timer resolution too high timer.c:342:timer_test_queue_type():timeout tick 20, timeout period 488145691 timer.c:342:timer_test_queue_type():timeout tick 40, timeout period 371834630 timer.c:342:timer_test_queue_type():timeout tick 60, timeout period 36114 timer.c:342:timer_test_queue_type():timeout tick 80, timeout period 38216 timer.c:342:timer_test_queue_type():timeout tick 100, timeout period 35336 timer.c:342:timer_test_queue_type():timeout tick 120, timeout period 36924 timer.c:342:timer_test_queue_type():timeout tick 140, timeout period 36870 timer.c:342:timer_test_queue_type():timeout tick 160, timeout period 36823 timer.c:342:timer_test_queue_type():timeout tick 180, timeout period 36815 timer.c:342:timer_test_queue_type():timeout tick 200, timeout period 36778 timer.c:352:timer_test_queue_type():test period 4059954202 FAILED 1. timer.c:338 - diff_period < (period_ns + (4 * res_ns)) -- You are receiving this mail because: You are on the CC list for the bug.
[lng-odp] [Bug 3774] Shmem validation test runs indefinitely with 1GB huge pages
https://bugs.linaro.org/show_bug.cgi?id=3774 --- Comment #5 from Matias Elo --- I ran the same test on odp-dpdk with 1GB pages without problems, so the issue is restricted to odp-linux. -- You are receiving this mail because: You are on the CC list for the bug.
[lng-odp] [Bug 3774] Shmem validation test runs indefinitely with 1GB huge pages
https://bugs.linaro.org/show_bug.cgi?id=3774 Bill Fischofer changed: What|Removed |Added Assignee|christophe.mil...@linaro.or |brian.bro...@linaro.org |g | CC||bill.fischo...@linaro.org --- Comment #4 from Bill Fischofer --- Brian will investigate. Matias will double check on odp-dpdk to see if issue is restricted to odp-linux or has wider scope. -- You are receiving this mail because: You are on the CC list for the bug.
[lng-odp] [Bug 3774] Shmem validation test runs indefinitely with 1GB huge pages
https://bugs.linaro.org/show_bug.cgi?id=3774 --- Comment #3 from Matias Elo --- shmem_test_stress actually passes, it just takes really long time. Perf shows that almost all time is spent in kernel. 94.17% [kernel][k] clear_page_erms 2.55% [kernel][k] clear_huge_page 2.15% [kernel][k] _cond_resched 0.88% [kernel][k] rcu_all_qs 0.01% [kernel][k] _raw_spin_lock 0.01% [kernel][k] update_load_avg -- You are receiving this mail because: You are on the CC list for the bug.
[lng-odp] [Bug 3774] Shmem validation test runs indefinitely with 1GB huge pages
https://bugs.linaro.org/show_bug.cgi?id=3774 --- Comment #2 from Matias Elo --- May be related. With 1GB huge pages the first timer is always delayed, which causes timer validation test to fail: Test: timer_test_sched_queue ...timer.c:261:timer_test_queue_type(): Timer pool parameters: timer.c:262:timer_test_queue_type(): res_ns 2000 timer.c:263:timer_test_queue_type(): min_tmo 1 timer.c:264:timer_test_queue_type(): max_tmo 1 timer.c:288:timer_test_queue_type(): period_ns 4 timer.c:289:timer_test_queue_type(): period_tick 20 timer.c:308:timer_test_queue_type():abs timer tick 20 timer.c:308:timer_test_queue_type():abs timer tick 40 timer.c:308:timer_test_queue_type():abs timer tick 60 timer.c:308:timer_test_queue_type():abs timer tick 80 timer.c:308:timer_test_queue_type():abs timer tick 100 timer.c:308:timer_test_queue_type():abs timer tick 120 timer.c:308:timer_test_queue_type():abs timer tick 140 timer.c:308:timer_test_queue_type():abs timer tick 160 timer.c:308:timer_test_queue_type():abs timer tick 180 timer.c:308:timer_test_queue_type():abs timer tick 200 odp_timer.c:880:timer_notify(): 9 ticks overrun on timer pool "timer_pool", timer resolution too high timer.c:342:timer_test_queue_type():timeout tick 20, timeout period 604149774 timer.c:342:timer_test_queue_type():timeout tick 40, timeout period 375843228 timer.c:342:timer_test_queue_type():timeout tick 60, timeout period 38013 timer.c:342:timer_test_queue_type():timeout tick 80, timeout period 39128 timer.c:342:timer_test_queue_type():timeout tick 100, timeout period 38088 timer.c:342:timer_test_queue_type():timeout tick 120, timeout period 40630 timer.c:342:timer_test_queue_type():timeout tick 140, timeout period 38801 timer.c:342:timer_test_queue_type():timeout tick 160, timeout period 40252 timer.c:342:timer_test_queue_type():timeout tick 180, timeout period 37224 timer.c:342:timer_test_queue_type():timeout tick 200, timeout period 39459 timer.c:352:timer_test_queue_type():test period 4179984604 FAILED 1. timer.c:338 - diff_period < (period_ns + (4 * res_ns)) -- You are receiving this mail because: You are on the CC list for the bug.
[lng-odp] [Bug 3774] Shmem validation test runs indefinitely with 1GB huge pages
https://bugs.linaro.org/show_bug.cgi?id=3774 --- Comment #1 from Matias Elo --- Log from shmem validation test getting stuck (shmem_test_stress): https://pastebin.com/Pq2ieuwW -- You are receiving this mail because: You are on the CC list for the bug.