[jira] [Reopened] (NIFI-5581) Seeing timeouts when trying to replicate requests across the cluster

Koji Kawamura (JIRA) Wed, 03 Oct 2018 21:48:27 -0700


     [ 
https://issues.apache.org/jira/browse/NIFI-5581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Koji Kawamura reopened NIFI-5581:
---------------------------------
      Assignee: Koji Kawamura  (was: Mark Payne)

I've researched on this failure and the root cause. We should avoid using 1xx 
responses as that range should be used at HTTP protocol itself, not by 
application layer. Reopening this JIRA. I'll submit a PR shortly. Please see 
following findings for detail:
h2. How status 100 works as HTTP spec

1. Client sends a PUT without body, with a 'Expect: 100-continue' header
 2. Server checks its availability, if so, respond with 100 status code, and 
wait for more bytes to come
 3. Client sends body, using the same connection
 4. Server reads body, then return the final response, status 200
h2. NiFi cluster replication as 2-phase commit

1. End user (browser) sends a PUT to a NiFi server (NiFi-A)
 2. NiFi-A (client) sends a PUT with body to all nodes including itself 
(NIFI-A'), expecting status 150
 3. NiFi-A' (server) validates, if successful, return status 150
 4. NiFi-A (client) confirms status 150, and send another PUT with 
X-Execution-Continue: true
 5. NiFi-A' (server) continues execution
 6. NiFi-A returns the response to the original request
 7. End user (browser) receives 200

Here, NiFi uses 150 as a custom protocol. But it differs from the official 100 
behavior.
 While original status 100 story finishes within a single HTTP transaction, 
NiFi splits the entire 2-phase commit protocol into 2 PUT requests.
 Some Jetty code has different code paths switched by the response status code.
 And it affects how Jetty shutdown a connection after it processes a request.
 
[https://github.com/eclipse/jetty.project/blob/jetty-9.4.x/jetty-http/src/main/java/org/eclipse/jetty/http/HttpParser.java#L1175]
h2. What went wrong?
 - As Jetty sees NiFi returns 150 response code, it keeps connection open
 - Then the 2nd PUT request is read by Jetty as a continuing payload of the 
previous PUT request. This violates Jetty's state and produces undefined 
result, such as never returning a response which leads the 
SocketTimeoutException
 - Turning off connection pooling ensures that each PUT request uses different 
connection. So Jetty will think the 2nd PUT request as a different one and 
process it as expected

h2. Why had it been working before upgrading Jetty?

This Jetty PR changed how it closes connection. 
[https://github.com/eclipse/jetty.project/pull/2338]
 Specifically, the commit removed this block. If I bring this block back, the 
OKHttp replicator works even if it uses connection pool.
 
[https://github.com/eclipse/jetty.project/pull/2338/files#diff-0d18b8e1bcedaef338f6ac601fcf5e6bL255]
h2. How should we address this?

I think we should use 202 Accepted, instead of 150.

> Seeing timeouts when trying to replicate requests across the cluster
> --------------------------------------------------------------------
>
>                 Key: NIFI-5581
>                 URL: https://issues.apache.org/jira/browse/NIFI-5581
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 1.8.0
>            Reporter: Mark Payne
>            Assignee: Koji Kawamura
>            Priority: Blocker
>             Fix For: 1.8.0
>
>
> When trying to replicate requests across the cluster on the current master 
> branch, I see everything go smoothly for GET requests, but all mutable 
> requests timeout.
> This issue appears to have been introduced by the upgrade to a new version of 
> Jetty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Reopened] (NIFI-5581) Seeing timeouts when trying to replicate requests across the cluster

Reply via email to