To clarify the number of test failures by platform: JDK6 Linux x64 - no test failures. JDK7 Linux ARM - no test failures. JDK7 Linux x64 - 1 occassional test failure. JDK7 Windows Server 2008 x64 - 18 test failures, all but 1 related to SDM. JDK7 Windows 7 x64 - no test failures JDK7 Solaris x64 - 2 test failures.
Concurrency bugs are harder to flush out than Greg's comments suggest, I haven't experienced them recently on my hardware, nor have I been able to reproduce them despite my best attempts, does that mean they don't exist? I really do wish it was as simple as, I fix this bug, now I have 35 test failures to debug. Regards, Peter. ----- Original message ----- > ----- Original message ----- > > > > @Patricia - I understand your interpretation, and I think I’ve misled > > you by including the words “concrete, observable” in my question. I’m > > not talking about concurrency bugs here. I’m perfectly OK with making > > changes based on analysis that shows a possible race condition. > > However, I spent most of yesterday reading the Java Language > > Specification and I’m now more convinced than ever that statements like > > “we’re using final variables, therefore all our code has to change” > > (paraphrasing) are no substitute for reasoned analysis. > > You don't agree that final field usage should comply with the JMM? > > Please also read the Java Memory Model. > > Basically, I’m > > asserting the very common professional view that changes to existing > > code should be traceable to a requirement or a problem. Further, > > those changes should be carefully considered and designed to minimize > > the chance of breaking anything else. I don’t find the argument that > > “I fixed a concurrency problem, so it’s expected to have 35 other > > failures” convincing. From the outside, that looks a lot like > > introducing failures in an attempt to fix a poorly understood or > > poorly stated problem. > > You're misinterpreting my development effort, I haven't yet fixed > ServiceDiscoveryManager, but I have fixed other classes, until only a > few commits ago ServiceDiscoveryManager was not failing (at least not > often). Although ServiceDiscoveryManager hasn't changed since then, it > is now causing multiple test failures. > > So while the other classes that use ExecutorService aren't failing, > ServiceDiscoveryManager is failing precisely because we're using > TaskManager's fundamentally broken runAfter method. > > We have previously focused on the problem that runAfter was designed to > address and found this was not a good solution to the original problem. > > > > > > I teach programming. I see this all the time. When people make > > changes based on what they “think” “might” be happening, it’s always a > > disaster. > > You're assuming FindBugs, JMM non-compliance and common issues > described by Concurrency in Practise, when clearly and easily identified > fall into the same category as someone who's learning to code for the > first time? > > You're advocating a completely experimental driven approach, like > someone who's learning to program experiences. It is very good approach > for learning... > > Do you not teach your students how to safely use final fields? > > > > > @Peter - You’re asking the wrong question. The right question is > > whether we should have a review-then-commit policy for existing code > > that traces a well-defined (and ideally limited) package of changes to > > a JIRA issue, > > This would prevent a move from experimental to include a more > theoretical development approach. Test cases are still required to pass > with a theoretical approach. > > and includes the community in decisions around how to fix > > the problem that is stated. > > And when a simple discussion about a simple issue veers off topic with > heated arguments, what then? You propose I stop committing to skunk, > until each change is discussed at length on the list? But we can't even > agree on final field usage? > > We need to make a decision whether to include theoretical driven > development that includes compliance to standards like the JMM. > > Code has been committed to skunk, for people to review, in public, they > might misinterpret my suggestions on the mail list, but they will have a > better understanding after reading the code, I welcome their > participation. > > I am open to other solutions, but I don't like the suggestion that non > compliant code is preferred until a test case that proves failure can be > found. What happens when the test case only fails on one platform, but > is not easily repeatable? What if the fix just changes timing to mask > the original problem? > > Testing doesn't prove the absence of errors? Reducing errors further > requires additonal tools like FindBugs and public open peer review. > > Trunk development was halted because of unrelated test failures. > > If you try to fix those test failures it will cause a cascading effect > of additional new test failures. I am still fixing those bugs in > qa-refactor. > > If the outcome of this discussion is to exclude theoretical driven > development (FindBugs, JMM compliance), then the only option will be to > revert trunk and exclude code that causes unrelated test failures. > > The java platform is also undergoing development, this too will cause > timing changes resulting in test failures. What do we do then when we > can't fix them because each fix causes other unrelated test failures? > > On that note, Java 7 support is not well tested, 2.2 has only been > tested on Linux x64, no results for the jtreg tests have been posted and > I have previously seen discovery event related test failures in the 2.2 > branch. > > The qa-refactor branch is relatively stable on Linux x64. > > The test failures we're discussing occur on Windows x64, perhaps they > also occur with 2.2 on windows, should we test 2.2 on windows too? > > If 2.2 fails on windows that gives us a provable test case doesn't it? > The test failures are then proven to be only indirectly related to > changes in qa-refactor? > > Regards, > > Peter. > > And I will suggest pre-emptively that a > > JIRA issue that says “concurrency problems exist” is not specific > > enough to drive development. > > > > Greg. > > > > On Jan 4, 2014, at 4:59 AM, Peter Firmstone <j...@zeus.net.au> wrote: > > > > > Please provide your thoughts on the following: > > > > > > How do we develop code for River? > > > > > > Do we only use experimental based development, or do we also allow > > > theoretical development also? Do we only fix bugs that can be > > > demonstrated with a test case, or do we fix bugs identified by > > > FindBugs and manual code auditing as well? > > > > > > Should we allow theoretical development based on standards like the > > > Java Memory Model with visual auditing and static analysis with > > > FindBugs, or should we prohibit fixing bugs that don't include a test > > > case demonstrating the failure? > > > > > > Regards, > > > > > > Peter. > > > > > > > > > On 4/01/2014 6:58 PM, Patricia Shanahan wrote: > > > > Just before Christmas, you were discussing whether to fix > > > > concurrency problems based on theoretical analysis, or to only fix > > > > those problems for which there is experimental evidence. > > > > > > > > I believe the PMC will be at cross-purposes until you resolve that > > > > issue, and strongly advise discussing and voting on it. > > > > > > > > This is an example of a question whose answer would be obvious and > > > > non-controversial if you had agreement, either way, on that general > > > > issue. "When do you claim that this happens? And what currently > > > > happens now that is unacceptable? What is the concrete, > > > > observable problem that you’re trying to solve, that justifies > > > > introducing failures that require further work?" is a valid, and > > > > important, set of questions if you are only going to fix > > > > concurrency bugs for which there is experimental evidence. It is > > > > irrelevant if you are going to fix concurrency bugs based on > > > > theoretical analysis. > > > > > > > > Patricia > > > > > > > > On 1/3/2014 10:14 PM, Greg Trasuk wrote: > > > > > > > > > > On Jan 4, 2014, at 12:52 AM, Peter Firmstone <j...@zeus.net.au> > > > > > wrote: > > > > > > > > > > > On 4/01/2014 3:18 PM, Greg Trasuk wrote: > > > > > > > I’ll also point out Patricia’s recent statement that > > > > > > > TaskManager should be reasonably efficient for small task > > > > > > > queues, but less efficient for larger task queues. We don’t > > > > > > > have solid evidence that the task queues ever get large. > > > > > > > Hence, the assertion that “TaskManager doesn’t scale” is > > > > > > > meaningless. > > > > > > > > > > > > No, it's not about scalability, it's about the window of time > > > > > > when a task is removed from the queue in TaskManager for > > > > > > execution but fails and needs to be retried later. > > > > > > Task.runAfter doesn't contain the task that "should have > > > > > > executed" so dependant tasks proceed before their depenencies. > > > > > > > > > > > > This code comment from ServiceDiscoveryManager might help: > > > > > > > > > > > > /** This task class, when executed, first registers to receive > > > > > > * ServiceEvents from the given ServiceRegistrar. If the > > > > > > registration * process succeeds (no RemoteExceptions), it > > > > > > then executes the * LookupTask to query the given > > > > > > ServiceRegistrar for a "snapshot" * of its current state > > > > > > with respect to services that match the * given template. > > > > > > * > > > > > > * Note that the order of execution of the two tasks is > > > > > > important. * That is, the LookupTask must be executed only > > > > > > after registration * for events has completed. This is > > > > > > because when an entity registers * with the event mechanism > > > > > > of a ServiceRegistrar, the entity will * only receive > > > > > > notification of events that occur "in the future", * after > > > > > > the registration is made. The entity will not receive events * > > > > > > about changes to the state of the ServiceRegistrar that may > > > > > > have * occurred before or during the registration process. * > > > > > > * Thus, if the order of these tasks were reversed and the > > > > > > LookupTask * were to be executed prior to the > > > > > > RegisterListenerTask, then the * possibility exists for the > > > > > > occurrence of a change in the * ServiceRegistrar's state > > > > > > between the time the LookupTask retrieves * a snapshot of > > > > > > that state, and the time the event registration * process has > > > > > > completed, resulting in an incorrect view of the * current > > > > > > state of the ServiceRegistrar. > > > > > > > > > > > > > > > > When do you claim that this happens? And what currently happens > > > > > now that is unacceptable? What is the concrete, observable > > > > > problem that you’re trying to solve, that justifies introducing > > > > > failures that require further work? > > > > > > > > > > > > > >