[ https://issues.apache.org/jira/browse/TINKERPOP-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569894#comment-17569894 ]
ASF GitHub Bot commented on TINKERPOP-2769: ------------------------------------------- jroimartin commented on PR #1761: URL: https://github.com/apache/tinkerpop/pull/1761#issuecomment-1192356654 @kenhuuu I tested the PR and I can confirm that both the gremlin-go and the gremlin-python reproducers are not able to trigger the issue anymore. Thanks! > gremlin-server does not reply with a timeout response to all timed out > requests > ------------------------------------------------------------------------------- > > Key: TINKERPOP-2769 > URL: https://issues.apache.org/jira/browse/TINKERPOP-2769 > Project: TinkerPop > Issue Type: Bug > Affects Versions: 3.6.0, 3.5.3, 3.6.1, 3.5.4 > Reporter: Roi Martin > Priority: Major > > The following reproducer hangs because not all timed out requests are > reported by gremlin-server. > *main.go* > {code:go} > package main > import ( > "log" > "sync" > "sync/atomic" > gremlingo "github.com/apache/tinkerpop/gremlin-go/v3/driver" > ) > const ( > gremlinServerEndpoint = "ws://localhost:8182/gremlin" > nWorkers = 250 > ) > func main() { > conn, err := gremlingo.NewDriverRemoteConnection(gremlinServerEndpoint, > func(settings *gremlingo.DriverRemoteConnectionSettings) { > settings.LogVerbosity = gremlingo.Off > }) > if err != nil { > log.Fatalf("could not create remote connection: %v", err) > } > defer conn.Close() > g := gremlingo.Traversal_().WithRemote(conn) > var ( > wg sync.WaitGroup > nPendingRequests int32 > ) > for i := 0; i < nWorkers; i++ { > wg.Add(1) > go func(id int) { > defer wg.Done() > atomic.AddInt32(&nPendingRequests, 1) > result, err := g. > Inject(1). > > SideEffect(&gremlingo.Lambda{"Thread.sleep(5_000)", "gremlin-groovy"}). > Next() > npr := atomic.AddInt32(&nPendingRequests, -1) > log.Printf("[%v] result=%v err=%v nPendingRequests=%v", > id, result, err, npr) > }(i) > } > log.Print("waiting") > wg.Wait() > log.Print("done") > } > {code} > Steps to reproduce the issue: > 1. Run gremlin-server: > {noformat} > docker run --rm -ti -p 8182:8182 tinkerpop/gremlin-server:3.5.4-SNAPSHOT 2>&1 > | tee log > {noformat} > 2. Execute the reproducer: > {noformat} > go run main.go > {noformat} > 3. Wait ~30s for the reproducer to hang. > All these tests has been performed against the default gremlin-server > configuration. It is important to note that in this machine the gremlin pool > is set to 8. Higher values may require to adjust the number of workers in the > reproducer to trigger the issue. > After following these steps, the output of the reproducer should look like > this: > {noformat} > 2022/06/29 16:08:47 waiting > 2022/06/29 16:08:52 [22] result=result{object=1 class=int64} err=<nil> > nPendingRequests=249 > ... > 2022/06/29 16:09:17 [154] result=result{object=1 class=int64} err=<nil> > nPendingRequests=208 > 2022/06/29 16:09:17 [243] result=result{object=1 class=int64} err=<nil> > nPendingRequests=207 > 2022/06/29 16:09:17 [226] result=<nil> err=E0903: there are no results left > nPendingRequests=206 > 2022/06/29 16:09:17 [164] result=<nil> err=E0903: there are no results left > nPendingRequests=204 > ... > 2022/06/29 16:09:17 [131] result=<nil> err=E0903: there are no results left > nPendingRequests=113 > {noformat} > On the other side, the gremlin-server logs should look like this: > {noformat} > [INFO] ? - Channel started at port 8182. > [WARN] ? - A timeout occurred during traversal evaluation of > [RequestMessage{, requestId=9d7ccc61-0e7e-4105-ae35-eb81a3895170, > op='bytecode', processor='traversal', args={gremlin=[[], [inject(1), > sideEffect(lambda[Thread.sleep(5_000)])]], aliases={g=g}}}] - consider > increasing the limit given to evaluationTimeout > [WARN] ? - A timeout occurred during traversal evaluation of > [RequestMessage{, requestId=6f2d939a-31c9-48fd-bb54-b2bfa1fbd4de, > op='bytecode', processor='traversal', args={gremlin=[[], [inject(1), > sideEffect(lambda[Thread.sleep(5_000)])]], aliases={g=g}}}] - consider > increasing the limit given to evaluationTimeout > ... > {noformat} > In this specific case, > - 43 requests finished successfully (250-207 = 43) > - 94 requests timed out (207-113 = 94) > - 113 requests never finished (making the reproducer hang) > If we check the gremlin-server logs, the number of reported timeouts matches > the output of the reproducer, which makes me think that gremlin-server is the > culprit. > {noformat} > $ grep 'A timeout occurred' log | wc -l > 94 > {noformat} > Inspecting the hanging goroutines in the reproducer shows that they are IO > waiting: > {noformat} > (dlv) grs > ... > Goroutine 373 - User: /home/n/goroot/src/net/fd_posix.go:55 > net.(*netFD).Read (0x684273) [IO wait] > ... > (dlv) gr 373 > Switched from 0 to 373 (thread 709858) > (dlv) bt > 0 0x000000000043d212 in runtime.gopark > at /home/n/goroot/src/runtime/proc.go:362 > 1 0x0000000000435a4a in runtime.netpollblock > at /home/n/goroot/src/runtime/netpoll.go:522 > 2 0x0000000000465e05 in internal/poll.runtime_pollWait > at /home/n/goroot/src/runtime/netpoll.go:302 > 3 0x00000000004ff2c8 in internal/poll.(*pollDesc).wait > at /home/n/goroot/src/internal/poll/fd_poll_runtime.go:83 > 4 0x00000000004ff377 in internal/poll.(*pollDesc).waitRead > at /home/n/goroot/src/internal/poll/fd_poll_runtime.go:88 > 5 0x0000000000500f8f in internal/poll.(*FD).Read > at /home/n/goroot/src/internal/poll/fd_unix.go:167 > 6 0x0000000000684273 in net.(*netFD).Read > at /home/n/goroot/src/net/fd_posix.go:55 > 7 0x000000000069ce49 in net.(*conn).Read > at /home/n/goroot/src/net/net.go:183 > 8 0x00000000005d9eb4 in bufio.(*Reader).fill > at /home/n/goroot/src/bufio/bufio.go:106 > 9 0x00000000005da278 in bufio.(*Reader).Peek > at /home/n/goroot/src/bufio/bufio.go:144 > 10 0x00000000008298a9 in github.com/gorilla/websocket.(*Conn).read > at /home/n/go/pkg/mod/github.com/gorilla/websocket@v1.5.0/conn.go:371 > 11 0x000000000082d5fd in github.com/gorilla/websocket.(*Conn).advanceFrame > at /home/n/go/pkg/mod/github.com/gorilla/websocket@v1.5.0/conn.go:809 > 12 0x000000000082f1d1 in github.com/gorilla/websocket.(*Conn).NextReader > at /home/n/go/pkg/mod/github.com/gorilla/websocket@v1.5.0/conn.go:1009 > 13 0x000000000082fccb in github.com/gorilla/websocket.(*Conn).ReadMessage > at /home/n/go/pkg/mod/github.com/gorilla/websocket@v1.5.0/conn.go:1093 > 14 0x00000000008418fb in > github.com/apache/tinkerpop/gremlin-go/v3/driver.(*gorillaTransporter).Read > at > /home/n/go/pkg/mod/github.com/apache/tinkerpop/gremlin-go/v3@v3.5.4-rc4.0.20220628161437-04f47d242010/driver/gorillaTransporter.go:116 > 15 0x000000000085a302 in > github.com/apache/tinkerpop/gremlin-go/v3/driver.(*gremlinServerWSProtocol).readLoop > at > /home/n/go/pkg/mod/github.com/apache/tinkerpop/gremlin-go/v3@v3.5.4-rc4.0.20220628161437-04f47d242010/driver/protocol.go:57 > 16 0x000000000085c2e7 in > github.com/apache/tinkerpop/gremlin-go/v3/driver.newGremlinServerWSProtocol.func1 > at > /home/n/go/pkg/mod/github.com/apache/tinkerpop/gremlin-go/v3@v3.5.4-rc4.0.20220628161437-04f47d242010/driver/protocol.go:196 > 17 0x000000000046a841 in runtime.goexit > at /home/n/goroot/src/runtime/asm_amd64.s:1571 > {noformat} > The reason why the go program does not finish is that gremlin-server keeps > responding to pings, therefore gremlin-go's pong handler keeps extending the > read deadline of the connection. > */tinkerpop/gremlin-go/driver/gorillaTransporter.go* > {code:go} > transporter.connection.SetPongHandler(func(string) error { > err := transporter.connection.SetReadDeadline(time.Now().Add(2 > * transporter.connSettings.keepAliveInterval)) > if err != nil { > return err > } > return nil > }) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)