Yes, so far this has only been observed in VASP and a specific dataset. Thanks,
On Wed, Sep 5, 2012 at 4:52 AM, Yevgeny Kliteynik <klit...@dev.mellanox.co.il> wrote: > On 9/4/2012 7:21 PM, Yong Qin wrote: >> On Tue, Sep 4, 2012 at 5:42 AM, Yevgeny Kliteynik >> <klit...@dev.mellanox.co.il> wrote: >>> On 8/30/2012 10:28 PM, Yong Qin wrote: >>>> On Thu, Aug 30, 2012 at 5:12 AM, Jeff Squyres<jsquy...@cisco.com> wrote: >>>>> On Aug 29, 2012, at 2:25 PM, Yong Qin wrote: >>>>> >>>>>> This issue has been observed on OMPI 1.6 and 1.6.1 with openib btl but >>>>>> not on 1.4.5 (tcp btl is always fine). The application is VASP and >>>>>> only one specific dataset is identified during the testing, and the OS >>>>>> is SL 6.2 with kernel 2.6.32-220.23.1.el6.x86_64. The issue is that >>>>>> when a certain type of load is put on OMPI 1.6.x, khugepaged thread >>>>>> always runs with 100% CPU load, and it looks to me like that OMPI is >>>>>> waiting for some memory to be available thus appears to be hung. >>>>>> Reducing the per node processes would sometimes ease the problem a bit >>>>>> but not always. So I did some further testing by playing around with >>>>>> the kernel transparent hugepage support. >>>>>> >>>>>> 1. Disable transparent hugepage support completely (echo never >>>>>>> /sys/kernel/mm/redhat_transparent_hugepage/enabled). This would allow >>>>>> the program to progress as normal (as in 1.4.5). Total run time for an >>>>>> iteration is 3036.03 s. >>>>> >>>>> I'll admit that we have not tested using transparent hugepages. I wonder >>>>> if there's some kind of bad interaction going on here... >>>> >>>> The transparent hugepage is "transparent", which means it is >>>> automatically applied to all applications unless it is explicitly told >>>> otherwise. I highly suspect that it is not working properly in this >>>> case. >>> >>> Like Jeff said - I don't think we've ever tested OMPI with transparent >>> huge pages. >>> >> >> Thanks. But have you tested OMPI under RHEL 6 or its variants (CentOS >> 6, SL 6)? THP is on by default in RHEL 6 so no matter you want it or >> not it's there. > > Interesting. Indeed, THP is on be default in RHEL 6.x. > I run OMPI 1.6.x constantly on RHEL 6.2, and I've never seen this problem. > > I'm checking it with OFED folks, but I doubt that there are some dedicated > tests for THP. > > So do you see it only with a specific application and only on a specific > data set? Wonder if I can somehow reproduce it in-house... > > -- YK