Re: Post-XDR CLD cannot keep session up
On Tue, 09 Feb 2010 07:06:39 -0500 Jeff Garzik wrote: > There is definitely something strange going on in the timer routines, > that is causing session_timeout() not to run even though it re-adds > itself to the timer list using cld_timer_add(). fprintf() debug output > in cld_timer_add and cld_timers_run are yielding unexpected results. Shoot, I think I know what this is, and it's my fault. The list is "cached" improperly inside cld_timers_run. I remember that at some point I added a mutex to every list and noticed that the list wasn't locked correctly, so fixed it. But then I dropped those mutexes because of some recursion issues and undone the fix. I'll retest and send a patch in a few. -- Pete -- To unsubscribe from this list: send the line "unsubscribe hail-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Post-XDR CLD cannot keep session up
On 02/09/2010 05:34 AM, Jeff Garzik wrote: On 02/07/2010 02:00 AM, Pete Zaitcev wrote: Hi, Jeff& Colin: It looks like you broke something in CLD, not sure if server or client. There are two possibly related bugs. But first, here's the messages (The chunkd is run with -D). Note that I have 2 servers listed in DNS (both on port 4499), but only one is up. Feb 6 23:36:10 hitlain cld[1934]: databases up Feb 6 23:36:10 hitlain cld[1934]: Listening on :: port 4499 Feb 6 23:36:10 hitlain cld[1934]: initialized: verbose 0 Feb 6 23:37:10 hitlain chunkd[1967]: Verbose debug output enabled Feb 6 23:37:10 hitlain chunkd[1968]: cldc_saveaddr: found CLD host hitlain.zaitcev.lan prio 10 weight 50 Feb 6 23:37:10 hitlain chunkd[1968]: cldc_saveaddr: found CLD host elanor.zaitcev.lan prio 10 weight 50 Feb 6 23:37:10 hitlain chunkd[1968]: Selected CLD host hitlain.zaitcev.lan port 4499 Feb 6 23:37:10 hitlain chunkd[1968]: Listening on host :: port 8082 Feb 6 23:37:10 hitlain chunkd[1968]: initialized Feb 6 23:37:10 hitlain chunkd[1968]: New CLD session created, sid 05B521BF4071EBA2 Feb 6 23:37:10 hitlain chunkd[1968]: CLD file "/chunk-default/2" created Feb 6 23:37:10 hitlain chunkd[1968]: CLD file "/chunk-default/2" written Feb 6 23:39:45 hitlain chunkd[1968]: Session failed, sid 05B521BF4071EBA2 Feb 6 23:39:45 hitlain chunkd[1968]: Selected CLD host elanor.zaitcev.lan port 4499 Feb 6 23:39:45 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:39:50 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:39:55 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:00 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:05 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:10 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:15 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:21 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:26 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:31 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:36 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:41 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:46 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:51 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:56 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:01 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:06 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:11 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:16 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:21 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:26 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:31 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:36 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:41 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:46 hitlain chunkd[1968]: New CLD session creation failed: 17 Feb 6 23:41:46 hitlain chunkd[1968]: Session failed, sid 6C5A5E5D4D8F2270 Feb 6 23:41:46 hitlain chunkd[1968]: Selected CLD host hitlain.zaitcev.lan port 4499 Feb 6 23:41:46 hitlain chunkd[1968]: New CLD session created, sid 4E2A8ED73878F038 Feb 6 23:41:46 hitlain chunkd[1968]: CLD file "/chunk-default/2" created Feb 6 23:41:46 hitlain chunkd[1968]: CLD lock(/chunk-default/2) failed: 11 So, first regression: session ALWAYS fails, for no reason I can see. It takes 2 minutes 35 seconds, as you can observe from the "Session failed" message. Well, session_timeout() is not being executed like it should be, by the core timer code. This could be memory corruption, a libtimer bug, or something else entirely. I can observe session_timeout() being updated to a new timer expiration, and then never being called again. There is definitely something strange going on in the timer routines, that is causing session_timeout() not to run even though it re-adds itself to the timer list using cld_timer_add(). fprintf() debug output in cld_timer_add and cld_timers_run are yielding unexpected results. More debugging after sleep. Jeff -- To unsubscribe from this list: send the line "unsubscribe hail-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Post-XDR CLD cannot keep session up
On 02/07/2010 02:00 AM, Pete Zaitcev wrote: Hi, Jeff& Colin: It looks like you broke something in CLD, not sure if server or client. There are two possibly related bugs. But first, here's the messages (The chunkd is run with -D). Note that I have 2 servers listed in DNS (both on port 4499), but only one is up. Feb 6 23:36:10 hitlain cld[1934]: databases up Feb 6 23:36:10 hitlain cld[1934]: Listening on :: port 4499 Feb 6 23:36:10 hitlain cld[1934]: initialized: verbose 0 Feb 6 23:37:10 hitlain chunkd[1967]: Verbose debug output enabled Feb 6 23:37:10 hitlain chunkd[1968]: cldc_saveaddr: found CLD host hitlain.zaitcev.lan prio 10 weight 50 Feb 6 23:37:10 hitlain chunkd[1968]: cldc_saveaddr: found CLD host elanor.zaitcev.lan prio 10 weight 50 Feb 6 23:37:10 hitlain chunkd[1968]: Selected CLD host hitlain.zaitcev.lan port 4499 Feb 6 23:37:10 hitlain chunkd[1968]: Listening on host :: port 8082 Feb 6 23:37:10 hitlain chunkd[1968]: initialized Feb 6 23:37:10 hitlain chunkd[1968]: New CLD session created, sid 05B521BF4071EBA2 Feb 6 23:37:10 hitlain chunkd[1968]: CLD file "/chunk-default/2" created Feb 6 23:37:10 hitlain chunkd[1968]: CLD file "/chunk-default/2" written Feb 6 23:39:45 hitlain chunkd[1968]: Session failed, sid 05B521BF4071EBA2 Feb 6 23:39:45 hitlain chunkd[1968]: Selected CLD host elanor.zaitcev.lan port 4499 Feb 6 23:39:45 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:39:50 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:39:55 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:00 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:05 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:10 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:15 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:21 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:26 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:31 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:36 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:41 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:46 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:51 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:56 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:01 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:06 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:11 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:16 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:21 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:26 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:31 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:36 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:41 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:46 hitlain chunkd[1968]: New CLD session creation failed: 17 Feb 6 23:41:46 hitlain chunkd[1968]: Session failed, sid 6C5A5E5D4D8F2270 Feb 6 23:41:46 hitlain chunkd[1968]: Selected CLD host hitlain.zaitcev.lan port 4499 Feb 6 23:41:46 hitlain chunkd[1968]: New CLD session created, sid 4E2A8ED73878F038 Feb 6 23:41:46 hitlain chunkd[1968]: CLD file "/chunk-default/2" created Feb 6 23:41:46 hitlain chunkd[1968]: CLD lock(/chunk-default/2) failed: 11 So, first regression: session ALWAYS fails, for no reason I can see. It takes 2 minutes 35 seconds, as you can observe from the "Session failed" message. Well, session_timeout() is not being executed like it should be, by the core timer code. This could be memory corruption, a libtimer bug, or something else entirely. I can observe session_timeout() being updated to a new timer expiration, and then never being called again. Off to run valgrind... Jeff -- To unsubscribe from this list: send the line "unsubscribe hail-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Post-XDR CLD cannot keep session up
On 02/07/2010 02:00 AM, Pete Zaitcev wrote: Hi, Jeff& Colin: It looks like you broke something in CLD, not sure if server or client. There are two possibly related bugs. But first, here's the messages (The chunkd is run with -D). Note that I have 2 servers listed in DNS (both on port 4499), but only one is up. Feb 6 23:36:10 hitlain cld[1934]: databases up Feb 6 23:36:10 hitlain cld[1934]: Listening on :: port 4499 Feb 6 23:36:10 hitlain cld[1934]: initialized: verbose 0 Feb 6 23:37:10 hitlain chunkd[1967]: Verbose debug output enabled Feb 6 23:37:10 hitlain chunkd[1968]: cldc_saveaddr: found CLD host hitlain.zaitcev.lan prio 10 weight 50 Feb 6 23:37:10 hitlain chunkd[1968]: cldc_saveaddr: found CLD host elanor.zaitcev.lan prio 10 weight 50 Feb 6 23:37:10 hitlain chunkd[1968]: Selected CLD host hitlain.zaitcev.lan port 4499 Feb 6 23:37:10 hitlain chunkd[1968]: Listening on host :: port 8082 Feb 6 23:37:10 hitlain chunkd[1968]: initialized Feb 6 23:37:10 hitlain chunkd[1968]: New CLD session created, sid 05B521BF4071EBA2 Feb 6 23:37:10 hitlain chunkd[1968]: CLD file "/chunk-default/2" created Feb 6 23:37:10 hitlain chunkd[1968]: CLD file "/chunk-default/2" written Feb 6 23:39:45 hitlain chunkd[1968]: Session failed, sid 05B521BF4071EBA2 Feb 6 23:39:45 hitlain chunkd[1968]: Selected CLD host elanor.zaitcev.lan port 4499 Feb 6 23:39:45 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:39:50 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:39:55 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:00 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:05 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:10 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:15 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:21 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:26 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:31 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:36 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:41 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:46 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:51 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:40:56 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:01 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:06 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:11 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:16 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:21 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:26 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:31 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:36 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:41 hitlain chunkd[1968]: cldc_udp_receive_pkt failed: -111 Feb 6 23:41:46 hitlain chunkd[1968]: New CLD session creation failed: 17 Feb 6 23:41:46 hitlain chunkd[1968]: Session failed, sid 6C5A5E5D4D8F2270 Feb 6 23:41:46 hitlain chunkd[1968]: Selected CLD host hitlain.zaitcev.lan port 4499 Feb 6 23:41:46 hitlain chunkd[1968]: New CLD session created, sid 4E2A8ED73878F038 Feb 6 23:41:46 hitlain chunkd[1968]: CLD file "/chunk-default/2" created Feb 6 23:41:46 hitlain chunkd[1968]: CLD lock(/chunk-default/2) failed: 11 So, first regression: session ALWAYS fails, for no reason I can see. It takes 2 minutes 35 seconds, as you can observe from the "Session failed" message. Second regression: locks of failed session are not removed (this is what code 11 is). Once the original session fails, CLD client cannot re-acquire the lock, ever, until the daemon is restarted. Thanks for the report. That is definitely annoying... I wonder if it is related to the ping_open bug I fixed... This definitely used work before the XDR, and it only takes 3 minutes to fail. Do you guys run and use chunkd or you just do "make check" and consider it done? I thought we talked about having virtually permanent cells and long-living CLD clients, because this sort of thing keeps cropping up. My local one (shamefully not using SRV, like I should) is pretty outdated, back to the latest released tarballs, since I dislike having to lose data on upgrade ;-) Jeff -- To unsubscribe from this list: send the line "unsubscribe hail-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html