[ https://issues.apache.org/jira/browse/PROTON-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16827294#comment-16827294 ]
Chuck Rolke commented on PROTON-2033: ------------------------------------- Working out the analysis proves that post-0.27 proton did not 'break dispatch self tests'. * Dispatch had a latent bug and some (presumably) proton timing changes exposed that bug. * Dispatch self tests were unprepared for the python receiver to get on_settled callbacks. That said, before dismissing this issue a discussion of the timing change and the on_settled callbacks is in order. h3. Test Overview For review, how did the test work? {code:java} +-------------+ +------------+ | INT.A | | EA1 | +-----------+ +-------+ +-------+ edge | +-------+ +-------+ | receiver |<--+ 21001 | | 21000 |<--------* | 21002 |<--+ sender | +-----------+ +-------+ +-------+ | +-------+ +-------+ | interior | | edge | +-------------+ +------------+ ^ ^ ^ CONN1 CONN2 CONN3 {code} From a single reactor container a proton python sender and receiver connect to two routers respectively. * The test injects some number of messages at the sender and verifies that they are all accepted. * Then the receiver is closed (connection stays open) and 50 messages are injected by the sender. * The expectation is that all the 50 messages are released. h3. Timing Change A significant change was measured between when the receiver client called receiver.close() and when the corresponding detach was received by INT.A port 21001. ||Time mS||0.27.x||master|| |run 1|3.2|14.5| |run 2|3.3|17.3| These timings are reflected in the self test behavior: * The 0.27.x code is "so fast" that it detaches the receiver before any of the messages in the 50-message storm get through INT.A. All of the messages are released by the router network. No messages were released or modified by the receiver. * The master branch is "so slow" that one or possibly up to a dozen messages are sent to the receiver by INT.A. Now the receiver is dealing with a receiver close in the face of incoming messages. Now some messages are seen as 'modified' by the sender and not just released. Is there an explanation for why the client response to the receiver.close() is so slow compared to 0.27.x? h3. Receiver on_settled callbacks These appear when the receiver is closing in the face of incoming messages. This may not be a new issue as dispatch self tests never had to deal with them before. If it *is* an new issue some consideration must be given for how existing clients (like dispatch self tests) must be modified to deal with them. > qpid-proton changes 0.27.x to master(9ff0a) break qpid-dispatch self test > ------------------------------------------------------------------------- > > Key: PROTON-2033 > URL: https://issues.apache.org/jira/browse/PROTON-2033 > Project: Qpid Proton > Issue Type: Bug > Components: proton-c, python-binding > Affects Versions: proton-c-0.28.0 > Environment: Fedora 29, python 3.7, proton and dispatch debug builds > Reporter: Chuck Rolke > Priority: Major > > While cleaning up qpid-dispatch for a 1.6 follow-on release an intermittent > failure in one of the tests was exposed. Analysis of the failure is in > https://issues.apache.org/jira/browse/DISPATCH-1322 in an attached text > document > https://issues.apache.org/jira/secure/attachment/12965891/DISPATCH-1322-analysis.txt > Another part of the cleanup was fixing the failing test to better report what > went wrong. See https://issues.apache.org/jira/browse/DISPATCH-1318 > These issues happened using qpid-proton master @9ff0a. Temporarily the > qpid-proton version was reverted to branch 0.27.x @560ba. Using proton 0.27.x > branch the qpid-dispatch self test does not fail. It passed several thousand > times in a loop. Moving proton back to the master branch brought the dispatch > failures back. > From the qpid-dispatch perspective this issue is serious. If a receiver > detaches a link while a sender is sending to it then the sender may lose a > disposition. > No attempt yet has been made to bisect proton to see where the different > behavior starts showing up. That may be the next best avenue of research. > Another theory for the error is that qpid-dispatch has been mishandling links > all along. > To help expose the problem one may clone > [https://github.com/ChugR/qpid-dispatch] , or add it as a remote, and then > check out branch DISPATCH-1318-cwip. Test system_tests_edge_router has been > gutted to run only test_12 in a loop and to print details of the error(s). > {{ cd build}} > {{ make}} > {{ ctest -VV -R edge}} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org For additional commands, e-mail: dev-h...@qpid.apache.org