[ https://issues.apache.org/jira/browse/DISPATCH-2081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337405#comment-17337405 ]
Charles E. Rolke commented on DISPATCH-2081: -------------------------------------------- After some off-line discussion the proposal is to implement option #1: {code:java} Track client credits and when the credit drops to zero then let that satisfy the ongoing drain cycle. Do this even without receiving a flow with drain=true. {code} The AMQP spec does not specify that a client MUST return a drain=true after it has consumed remaining credits in response to receiving a drain=true. Any suggestions about how to force the router to send a flow with drain=true, especially in the existing test environment, are welcome. > Fallback test fail - router not detecting drained link > ------------------------------------------------------ > > Key: DISPATCH-2081 > URL: https://issues.apache.org/jira/browse/DISPATCH-2081 > Project: Qpid Dispatch > Issue Type: Bug > Components: Routing Engine > Affects Versions: 1.15.0 > Environment: h4. > Reporter: Charles E. Rolke > Priority: Major > > h3. History > The fallback dest test, particularly the SwitchoverTest subclasses, have had > a long history of persistent, intermittent failures. See DISPATCH-1361 and > DISPATCH-1786. CI tests running on Ubunto xenial fail more frequently than > any other platform > h3. Recreating the failure > The only way to get any clue at all is to get access to the router logs after > a test failure. On the CI systems this is not an option. > A reproducer was created that fails usually before 1000 switchover tests run. > This is an Ubuntu xenial docker image that is run with *--cpus=0.8*. This > means slow-upon-slow to get internal scheduling just right. Then loop on > *ctest -VV -R fallback_dest*. After the test finally fails then get the log > files out of the docker image. > h3. Analyzing the logs > h4. Get the Scraper web page > Run command > {{ scraper -f I*.log E*.log > fallback_dest.html}} > Then view the resulting web page. > h4. Navigating the web page > Nice web page. Now what? The tests are designed to help you a little here. > The failing case was test_35. This test uses router address *dest.35* for > link sources and targets making the test pretty easy to isolate in the > >1,000,000 lines of web page. The address appears early on in lists of > addresses and then happens for real in an attach launched by the self test. > h3. What happened? > * This test sets up a sender to INTA, a primary receiver to INTB and a > fallback receiver in EA1. > * Surprisingly the fallback receiver connects before the primary receiver > despite the order in the test souce code. Not to worry. > * Then the test sends 300 messages that are received and accepted by the > primary receiver. > * The primary receiver closes > * The sender starts sending 300 messages to the fallback receiver > * These messages go into INTA and get forwarded to INTB. INTB has no > destination for them so they are released. > * When the sender gets the released status it sends more. > * Pretty soon the sender has sent 1,700 messaged > * Somewhere along the way INTB deletes address M0dest.35 > * Eventually router INTA sends a DRAIN to the sender. > * The test sender sends enough messages to consume the remaining credit. > * Then all message traffic stops. > * The test sits there for a minute and then times out. > h3. What went wrong > It looks like the router started a drain cycle with the sender but the sender > never sent a FLOW back with drain=true. > Proton python does not spontaneously send flow with drain=true. It is up to > the application, in this case the fallback_dest self test code, to do that. > Furthermore, if the application has consumed all the credit then proton will > not send a flow with drain=true even if sender.drained() is called. Proton > python sends the flow only if the drained function consumed any credits > outside of message flow. > If the router is waiting for a flow then with this test setup it will never > come. > Note: Knowing now that the issue is drain related the web page helps find the > drain. In the Table of Contents click on the link for Noteworthy Log Lines. > There was one 'Flow with drain set' entry. Clicking on the lozenge shows the > line number link. Clicking on that link takes you to the flow performative > for the router issuing it. > h3. What's the fix? > # Track client credits and when the credit drops to zero then let that > satisfy the ongoing drain cycle. Do this even without receiving the flow with > drain=true. > # Don't send a drain to begin with. Come up with another way of dealing > with the client's stream of messages internally that does not involve a drain. > # The test client could be gimmicked to detect when it has consumed all but > one credit. Then it could call drained() so proton python could consume the > last credit via a drain cycle and send the AMQP flow with drain=true. This > may work to get the test to pass but it won't help people in the real world > who use the proton python client. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org For additional commands, e-mail: dev-h...@qpid.apache.org