[ https://issues.apache.org/jira/browse/NIFI-5581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Koji Kawamura reopened NIFI-5581: --------------------------------- Assignee: Koji Kawamura (was: Mark Payne) I've researched on this failure and the root cause. We should avoid using 1xx responses as that range should be used at HTTP protocol itself, not by application layer. Reopening this JIRA. I'll submit a PR shortly. Please see following findings for detail: h2. How status 100 works as HTTP spec 1. Client sends a PUT without body, with a 'Expect: 100-continue' header 2. Server checks its availability, if so, respond with 100 status code, and wait for more bytes to come 3. Client sends body, using the same connection 4. Server reads body, then return the final response, status 200 h2. NiFi cluster replication as 2-phase commit 1. End user (browser) sends a PUT to a NiFi server (NiFi-A) 2. NiFi-A (client) sends a PUT with body to all nodes including itself (NIFI-A'), expecting status 150 3. NiFi-A' (server) validates, if successful, return status 150 4. NiFi-A (client) confirms status 150, and send another PUT with X-Execution-Continue: true 5. NiFi-A' (server) continues execution 6. NiFi-A returns the response to the original request 7. End user (browser) receives 200 Here, NiFi uses 150 as a custom protocol. But it differs from the official 100 behavior. While original status 100 story finishes within a single HTTP transaction, NiFi splits the entire 2-phase commit protocol into 2 PUT requests. Some Jetty code has different code paths switched by the response status code. And it affects how Jetty shutdown a connection after it processes a request. [https://github.com/eclipse/jetty.project/blob/jetty-9.4.x/jetty-http/src/main/java/org/eclipse/jetty/http/HttpParser.java#L1175] h2. What went wrong? - As Jetty sees NiFi returns 150 response code, it keeps connection open - Then the 2nd PUT request is read by Jetty as a continuing payload of the previous PUT request. This violates Jetty's state and produces undefined result, such as never returning a response which leads the SocketTimeoutException - Turning off connection pooling ensures that each PUT request uses different connection. So Jetty will think the 2nd PUT request as a different one and process it as expected h2. Why had it been working before upgrading Jetty? This Jetty PR changed how it closes connection. [https://github.com/eclipse/jetty.project/pull/2338] Specifically, the commit removed this block. If I bring this block back, the OKHttp replicator works even if it uses connection pool. [https://github.com/eclipse/jetty.project/pull/2338/files#diff-0d18b8e1bcedaef338f6ac601fcf5e6bL255] h2. How should we address this? I think we should use 202 Accepted, instead of 150. > Seeing timeouts when trying to replicate requests across the cluster > -------------------------------------------------------------------- > > Key: NIFI-5581 > URL: https://issues.apache.org/jira/browse/NIFI-5581 > Project: Apache NiFi > Issue Type: Bug > Components: Core Framework > Affects Versions: 1.8.0 > Reporter: Mark Payne > Assignee: Koji Kawamura > Priority: Blocker > Fix For: 1.8.0 > > > When trying to replicate requests across the cluster on the current master > branch, I see everything go smoothly for GET requests, but all mutable > requests timeout. > This issue appears to have been introduced by the upgrade to a new version of > Jetty. -- This message was sent by Atlassian JIRA (v7.6.3#76005)