Re: CR 6860309 - solaris timing issue on thread startup
Alan, On 15/11/2011 11:26 PM, Alan Bateman wrote: On 15/11/2011 00:56, David Holmes wrote: : 25 * @bug 4820217 6860309 As per other emails and still waiting from confirmation from Alan. I don't think the @bug should be updated for any of these test fixes. The @bug tag is intended to list the bug numbers of bugs that are useful when you have a test failure and sometimes there can be really useful information in a bug that turns out to be a test bug. I'm not aware of any rule that says that the bug numbers shouldn't include bugs that were test only bugs but it's almost always bug numbers of bugs that caused the test to be extended to cover additional scenarios. For the tests being discussed here then I think we should just use our best judgment. Worse case it's easy to dig into the history with hg. That was somewhat non-committal :) To me @bug says these are the bugs that this test is checking the fix for hence not applicable in any of the recent timing/race test fixes. David -Alan.
Re: CR 6860309 - solaris timing issue on thread startup
Where can we find the definition of the tag contents? Whichever way this discussion goes, we should update that documentation with the conclusions. On 11/15/11 4:29 PM, David Holmes wrote: Alan, On 15/11/2011 11:26 PM, Alan Bateman wrote: On 15/11/2011 00:56, David Holmes wrote: : 25 * @bug 4820217 6860309 As per other emails and still waiting from confirmation from Alan. I don't think the @bug should be updated for any of these test fixes. The @bug tag is intended to list the bug numbers of bugs that are useful when you have a test failure and sometimes there can be really useful information in a bug that turns out to be a test bug. I'm not aware of any rule that says that the bug numbers shouldn't include bugs that were test only bugs but it's almost always bug numbers of bugs that caused the test to be extended to cover additional scenarios. For the tests being discussed here then I think we should just use our best judgment. Worse case it's easy to dig into the history with hg. That was somewhat non-committal :) To me @bug says these are the bugs that this test is checking the fix for hence not applicable in any of the recent timing/race test fixes. David -Alan.
Re: CR 6860309 - solaris timing issue on thread startup
On 15/11/2011 21:29, David Holmes wrote: That was somewhat non-committal :) To me @bug says these are the bugs that this test is checking the fix for hence not applicable in any of the recent timing/race test fixes. It's non-committal because I don't think this has come up before, it's really something for the developer's guide. In any case, I don't think this discussion should stop us pushing Gary's fixes as he can easily revert the @bug tags for now. -Alan.
RE: CR 6860309 - solaris timing issue on thread startup
Hi. The current practice may be different, but... The original intent was that every bug would either have a unit/regression test or a BugTraq keyword explaining why a test was not provided. See step 6 on this page for the list of valid keywords: http://openjdk.java.net/guide/changePlanning.html#bug In the case of bugs against regression tests I've seen two approaches: 1. Addition of the bugid to the @bug tag 2. Addition of the noreg-self keyword to bugtraq Both technically fulfill the original intent. At one point there were audits to enforce this. Knowing how the audits worked (I personally preformed them for a while), I tended to favor adding the bugid to @bug as that would minimize the number of BugTraq queries. Even when the network and BugTraq were functioning perfectly, a BugTraq query always took longer than just looking at the source (which we were already looking at). Again, the above may no longer be the current practice or recommendation. Thanks, iris -Original Message- From: Alan Bateman Sent: Tuesday, November 15, 2011 2:37 PM To: David Holmes Cc: Gary Adams; core-libs-dev@openjdk.java.net Subject: Re: CR 6860309 - solaris timing issue on thread startup On 15/11/2011 21:29, David Holmes wrote: That was somewhat non-committal :) To me @bug says these are the bugs that this test is checking the fix for hence not applicable in any of the recent timing/race test fixes. It's non-committal because I don't think this has come up before, it's really something for the developer's guide. In any case, I don't think this discussion should stop us pushing Gary's fixes as he can easily revert the @bug tags for now. -Alan.
Re: CR 6860309 - solaris timing issue on thread startup
Hi Iris, Still seems to me, based on the FAQ, http://openjdk.java.net/jtreg/faq.html that the intent is for @bug to refer to the bug that the test is testing. But as it is looking like this has been used in an ad-hoc manner anyway I'll shut up now. ;-) Cheers, David On 16/11/2011 10:47 AM, Iris Clark wrote: Hi. The current practice may be different, but... The original intent was that every bug would either have a unit/regression test or a BugTraq keyword explaining why a test was not provided. See step 6 on this page for the list of valid keywords: http://openjdk.java.net/guide/changePlanning.html#bug In the case of bugs against regression tests I've seen two approaches: 1. Addition of the bugid to the @bug tag 2. Addition of the noreg-self keyword to bugtraq Both technically fulfill the original intent. At one point there were audits to enforce this. Knowing how the audits worked (I personally preformed them for a while), I tended to favor adding the bugid to @bug as that would minimize the number of BugTraq queries. Even when the network and BugTraq were functioning perfectly, a BugTraq query always took longer than just looking at the source (which we were already looking at). Again, the above may no longer be the current practice or recommendation. Thanks, iris -Original Message- From: Alan Bateman Sent: Tuesday, November 15, 2011 2:37 PM To: David Holmes Cc: Gary Adams; core-libs-dev@openjdk.java.net Subject: Re: CR 6860309 - solaris timing issue on thread startup On 15/11/2011 21:29, David Holmes wrote: That was somewhat non-committal :) To me @bug says these are the bugs that this test is checking the fix for hence not applicable in any of the recent timing/race test fixes. It's non-committal because I don't think this has come up before, it's really something for the developer's guide. In any case, I don't think this discussion should stop us pushing Gary's fixes as he can easily revert the @bug tags for now. -Alan.
Re: CR 6860309 - solaris timing issue on thread startup
On 14/11/2011 00:42, David Holmes wrote: Will the exec'd process block until the copier threads read from its output streams? If not then the copier threads (well stdin anyway) could read their input and have terminated before the main thread even reaches the original sleep() call. I don't think this test can be written correctly as-is. Even using a CountDownLatch won't help because you have to sync with two copier threads, so the first could be finished before the second signals the latch. I would think we would need to exec our own process (a Java one of course) that assists with the synchronization issue - ie by not terminating until it receives an input token. At least that way we know the copier threads can not proceed passed the read() calls, even if we can't be 100% certain they are in the read at the time the process is destroyed. The test runs cat without any arguments so the Copier threads will block when they read from the stream. If we can get the main thread to wait until the Copier threads are just about to do the read, sleep for a bit, then the threads should be blocked in the read when we destroy the process. It's not ideal but I don't think there is any other way to test this. -Alan.
Re: CR 6860309 - solaris timing issue on thread startup
On 11/14/2011 05:09 PM, Gary Adams wrote: I've updated the webrev for CR#6860309 using a CountDownLatch. The main thread will wait til both worker threads are ready to block on the read() before the process is destroyed. http://cr.openjdk.java.net/~gadams/6860309/ Tested with -Xcomp, but still may need a older slow solaris/sparc machine to confirm the fix works for the original bug submitter. Hi Gary, I think it's better if a Copier takes the CountDownLatch as parameter of the constructor instead of being declared static. Rémi
Re: CR 6860309 - solaris timing issue on thread startup
Updated to move static latch to Copier constructor argument. On 11/14/11 11:09 AM, Gary Adams wrote: I've updated the webrev for CR#6860309 using a CountDownLatch. The main thread will wait til both worker threads are ready to block on the read() before the process is destroyed. http://cr.openjdk.java.net/~gadams/6860309/ Tested with -Xcomp, but still may need a older slow solaris/sparc machine to confirm the fix works for the original bug submitter.
Re: CR 6860309 - solaris timing issue on thread startup
On 14/11/2011 6:05 PM, Alan Bateman wrote: The test runs cat without any arguments so the Copier threads will block when they read from the stream. Thank's Alan that critical point had escaped my notice. David If we can get the main thread to wait until the Copier threads are just about to do the read, sleep for a bit, then the threads should be blocked in the read when we destroy the process. It's not ideal but I don't think there is any other way to test this. -Alan.
Re: CR 6860309 - solaris timing issue on thread startup
Alan, On 12/11/2011 9:58 PM, Alan Bateman wrote: On 11/11/2011 16:56, Gary Adams wrote: CR 6860309 - TEST_BUG: Insufficient sleep time in java/lang/Runtime/exec/StreamsSurviveDestroy.java A timing problem is reported for slow solaris systems for this test to start up a process and systematically torture the underlying threads processing data from the running process. On my fast solaris machine I can not reproduce the error, but it is reasonable to assume that on a slower machine there could be scheduling issues that could delay the thread startup past the designated 100 millisecond delay in the main thread. This webrev suggests gating the process destruction until both worker threads are alive. http://cr.openjdk.java.net/~gadams/6860309/ -Xcomp on a slow machine, always fun when testing the untestable. I agree with David but I don't think there is perfect solution. I would suggest using a CountDownLatch or other synchronization so that the main thread waits until the Copier thread is just about to do the read. Then do a sleep in the main thread before invoking the destroy method. I suspect that is the best that you can do as can't be guaranteed that the Copier thread is blocked in the underlying read. Will the exec'd process block until the copier threads read from its output streams? If not then the copier threads (well stdin anyway) could read their input and have terminated before the main thread even reaches the original sleep() call. I don't think this test can be written correctly as-is. Even using a CountDownLatch won't help because you have to sync with two copier threads, so the first could be finished before the second signals the latch. I would think we would need to exec our own process (a Java one of course) that assists with the synchronization issue - ie by not terminating until it receives an input token. At least that way we know the copier threads can not proceed passed the read() calls, even if we can't be 100% certain they are in the read at the time the process is destroyed. Gary: while fixing timing bugs is a worthwhile goal in terms of test stability etc it is rarely if ever low hanging fruit as you have found. David -Alan.
Re: CR 6860309 - solaris timing issue on thread startup
On 11/11/2011 16:56, Gary Adams wrote: CR 6860309 - TEST_BUG: Insufficient sleep time in java/lang/Runtime/exec/StreamsSurviveDestroy.java A timing problem is reported for slow solaris systems for this test to start up a process and systematically torture the underlying threads processing data from the running process. On my fast solaris machine I can not reproduce the error, but it is reasonable to assume that on a slower machine there could be scheduling issues that could delay the thread startup past the designated 100 millisecond delay in the main thread. This webrev suggests gating the process destruction until both worker threads are alive. http://cr.openjdk.java.net/~gadams/6860309/ -Xcomp on a slow machine, always fun when testing the untestable. I agree with David but I don't think there is perfect solution. I would suggest using a CountDownLatch or other synchronization so that the main thread waits until the Copier thread is just about to do the read. Then do a sleep in the main thread before invoking the destroy method. I suspect that is the best that you can do as can't be guaranteed that the Copier thread is blocked in the underlying read. -Alan.
Re: CR 6860309 - solaris timing issue on thread startup
On 11/12/11 6:58 AM, Alan Bateman wrote: On 11/11/2011 16:56, Gary Adams wrote: CR 6860309 - TEST_BUG: Insufficient sleep time in java/lang/Runtime/exec/StreamsSurviveDestroy.java A timing problem is reported for slow solaris systems for this test to start up a process and systematically torture the underlying threads processing data from the running process. On my fast solaris machine I can not reproduce the error, but it is reasonable to assume that on a slower machine there could be scheduling issues that could delay the thread startup past the designated 100 millisecond delay in the main thread. This webrev suggests gating the process destruction until both worker threads are alive. http://cr.openjdk.java.net/~gadams/6860309/ -Xcomp on a slow machine, always fun when testing the untestable. I agree with David but I don't think there is perfect solution. I would suggest using a CountDownLatch or other synchronization so that the main thread waits until the Copier thread is just about to do the read. Then do a sleep in the main thread before invoking the destroy method. I suspect that is the best that you can do as can't be guaranteed that the Copier thread is blocked in the underlying read. -Alan. I'll give it a try on Monday. Ideally, I could get a reproducible failure before trying to fix the problem. Unfortunately, the only slow machines I have on hand are linux/arm, which might have different issues from the solaris/sparc originally reported.
Re: CR 6860309 - solaris timing issue on thread startup
Hi Gary, On 12/11/2011 2:56 AM, Gary Adams wrote: CR 6860309 - TEST_BUG: Insufficient sleep time in java/lang/Runtime/exec/StreamsSurviveDestroy.java A timing problem is reported for slow solaris systems for this test to start up a process and systematically torture the underlying threads processing data from the running process. On my fast solaris machine I can not reproduce the error, but it is reasonable to assume that on a slower machine there could be scheduling issues that could delay the thread startup past the designated 100 millisecond delay in the main thread. This webrev suggests gating the process destruction until both worker threads are alive. http://cr.openjdk.java.net/~gadams/6860309/ This doesn't achieve anything I'm afraid. isAlive() will return true once start() has been invoked on the thread (it doesn't depend on the new thread getting scheduled to run), hence there will only be a single call to sleep - no different from today. If the issue is that c1 and c2 have not performed some action necessary before the process interaction then that action needs to be synchronized with explicitly. David -