Hi Thomas based on your last comment I found the root cause of the problem. A couple of weeks ago the server crashed a couple of times a cause of a kernel panic problem. Analyzing the vmcore-dmesg.txt I found a possible issue with the lustre client 2.1 installed. So I decided to compile and install lustre client 2.5.3 but I didn't think about the dependency with robinhood. Today I was trying to downgrade lustre client to the version 2.1 (same of the server) but I got the problem of dependency and I realized where it was the problem. I'm sorry but I didn't think about this before.
I keep lustre client 2.5.3 and I compiled and installed robinhood again. I restarted it at beginning it worked very well (GET_INFO_FS decrease from 1500 to 1.5 ms/op) and all migration processes load disappeared. Now, After ca 2h, the migration processes appear again and the GET_INFO_FS is increasing slowly slowly (5.38 ms/op in this moment). I'm guesting that this is caused of the number of Changelod entries (ca. 50 million) and for this reason we are deciding to recreate the DB and perform a new scan during the week-end. I'm also asking myself if keep the fix I did with numactl or remove it and increase the number of threads before the new scan. What do you think? Carmelo On Thu, 2015-04-23 at 15:47 +0200, LEIBOVICI Thomas wrote: > top - 12:25:30 up 8 days, 21:57, 6 users, load average: 16.77, 17.89, > 15.97 > Tasks: 2 total, 0 running, 2 sleeping, 0 stopped, 0 zombie > Cpu(s): 0.0%us, 12.9%sy, 0.0%ni, 87.0%id, 0.0%wa, 0.0%hi, 0.0%si, > 0.0%st > Mem: 131999964k total, 125632212k used, 6367752k free, 207536k > buffers > Swap: 6291448k total, 16352k used, 6275096k free, 6655152k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > COMMAND > 6915 mysql 20 0 34.2g 1.5g 5372 S 59.7 1.2 0:55.34 > mysqld > 7012 root 20 0 3189m 1.3g 1468 S 15.1 1.1 2:40.51 > > > I really think there is something wrong on your system related to this > migration threads. > You still have a load of 17 with CPU 87% idle... Strange. > And even if they are more active now, mysql and robinhood only produce a > load of 0.7. > > It sounds more like a driver or hardware issue, or a RT kernel mode... > Do you run a specific kernel? or with realtime options? > > If you have a spare node, it would be worthwhile to run robinhood on it > and see if you have the same strange load. > > Regards > Thomas. > > On 04/23/15 12:49, Carmelo Ponti (CSCS) wrote: > > I divided the two processes between the two sockets and now I can see > > them using some CPU time to time: > > > > # top -p 7012,6915 -b > > > > top - 12:25:27 up 8 days, 21:57, 6 users, load average: 16.77, 17.89, > > 15.97 > > Tasks: 2 total, 1 running, 1 sleeping, 0 stopped, 0 zombie > > Cpu(s): 0.0%us, 12.7%sy, 0.0%ni, 87.2%id, 0.0%wa, 0.0%hi, 0.0%si, > > 0.0%st > > Mem: 131999964k total, 125626996k used, 6372968k free, 207532k > > buffers > > Swap: 6291448k total, 16352k used, 6275096k free, 6655084k cached > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > > COMMAND > > 7012 root 20 0 3189m 1.3g 1468 S 1.6 1.1 2:40.02 > > robinhood > > 6915 mysql 20 0 34.2g 1.5g 5372 R 0.0 1.2 0:53.40 > > mysqld > > > > top - 12:25:30 up 8 days, 21:57, 6 users, load average: 16.77, 17.89, > > 15.97 > > Tasks: 2 total, 0 running, 2 sleeping, 0 stopped, 0 zombie > > Cpu(s): 0.0%us, 12.9%sy, 0.0%ni, 87.0%id, 0.0%wa, 0.0%hi, 0.0%si, > > 0.0%st > > Mem: 131999964k total, 125632212k used, 6367752k free, 207536k > > buffers > > Swap: 6291448k total, 16352k used, 6275096k free, 6655152k cached > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > > COMMAND > > 6915 mysql 20 0 34.2g 1.5g 5372 S 59.7 1.2 0:55.34 > > mysqld > > 7012 root 20 0 3189m 1.3g 1468 S 15.1 1.1 2:40.51 > > robinhood > > > > top - 12:25:33 up 8 days, 21:57, 6 users, load average: 16.39, 17.79, > > 15.94 > > Tasks: 2 total, 0 running, 2 sleeping, 0 stopped, 0 zombie > > Cpu(s): 0.0%us, 13.7%sy, 0.0%ni, 86.2%id, 0.0%wa, 0.0%hi, 0.0%si, > > 0.0%st > > Mem: 131999964k total, 125631972k used, 6367992k free, 207540k > > buffers > > Swap: 6291448k total, 16352k used, 6275096k free, 6655116k cached > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > > COMMAND > > 7012 root 20 0 3189m 1.3g 1468 S 21.3 1.1 2:41.17 > > robinhood > > 6915 mysql 20 0 34.2g 1.5g 5372 S 0.0 1.2 0:55.34 > > mysqld > > > > In this moment we have 24 million of Changelog lines so I guest we need > > some time to see if there is an improvement. By sure the load average > > decreased a lot. > > > > Today I also noticed many messages as the following on dmesg and > > on /var/log/messages: > > > > Lustre: 24416:0:(kernel_user_comm.c:201:libcfs_kkuc_msg_put()) message > > send failed (-32) > > Lustre: 24416:0:(kernel_user_comm.c:201:libcfs_kkuc_msg_put()) Skipped 1 > > previous similar message > > > > I searched on google and I found an old request on robinhood-support > > (http://sourceforge.net/p/robinhood/mailman/message/31162194/) which > > explain the messages and how to fix it. Could these messages explain in > > part the problem we have or it a consequence of the problem? > > > > Carmelo > > > > On Thu, 2015-04-23 at 10:00 +0200, LEIBOVICI Thomas wrote: > >> On 04/22/15 16:26, Carmelo Ponti (CSCS) wrote: > >>> I will wait until tomorrow to see if the situation will go better but I > >>> can immediately noticed that the cpu usage of robinhood now is between > >>> 40% and 100%. The load of mysql didn't change: > >>> > >>> 2847 root 20 0 3867m 1.7g 1528 S 63.9 1.4 47:41.73 robinhood > >>> 3217 mysql 20 0 37.6g 1.3g 4500 S 0.0 1.0 1603:03 mysqld > >> It may be a good sign that robinhood now does something :) > >> > >> mysqld should be much more active however. > >> What about pinning it too? > >> > >> Thomas. > -- ---------------------------------------------------------------------- Carmelo Ponti System Engineer CSCS Swiss Center for Scientific Computing Via Trevano 131 Email: [email protected] CH-6900 Lugano http://www.cscs.ch Phone: +41 91 610 82 15/Fax: +41 91 610 82 82 ---------------------------------------------------------------------- ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ robinhood-support mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/robinhood-support
