[jira] [Comment Edited] (TINKERPOP-2769) gremlin-server does not reply with a timeout response to all timed out requests

Roi Martin (Jira) Fri, 08 Jul 2022 10:52:04 -0700


    [ 
https://issues.apache.org/jira/browse/TINKERPOP-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564378#comment-17564378
 ]


Roi Martin edited comment on TINKERPOP-2769 at 7/8/22 5:51 PM:
---------------------------------------------------------------

Please, find below a gremlin-python reproducer:

 

*stress.bash*
{code:sh}
#!/bin/bash

for i in $(seq 0 250); do
        python3 repro.py &
        pids[${i}]=$!
done

wait ${pids[*]}
{code}
 

*repro.py*
{code:python}
#!/usr/bin/env python3

from datetime import datetime

from gremlin_python.driver.driver_remote_connection import 
DriverRemoteConnection
from gremlin_python.process.anonymous_traversal import traversal

g = 
traversal().with_remote(DriverRemoteConnection('ws://localhost:8182/gremlin','g'))

try:
    start = datetime.now()
    result = g.inject(1).sideEffect(lambda: "Thread.sleep(5000)").next()
    print(f"result={result} t={datetime.now() - start}")
except Exception as e:
    print(e)
{code}
 

Steps to reproduce the issue:

1. Run gremlin-server:
{noformat}
docker run --rm -ti -p 8182:8182 tinkerpop/gremlin-server:3.5.4-SNAPSHOT 2>&1 | 
tee log
{noformat}
2. Execute the reproducer:
{noformat}
bash stress.bash
{noformat}
3. Wait ~30s and check that some python3 processes hang. The number of hanging 
python3 processes should be:
{noformat}
# hanging processes = # total processes - # processes ending successfully - # 
number of timeouts in gremlin-server logs
{noformat}
 

Please, note that using threads in python I was not able to reproduce the issue 
reliably. Also, in this machine the gremlin pool is set to 8. Higher values may 
require to adjust the number of processes in the reproducer to trigger the 
issue.


was (Author: JIRAUSER291493):
Please, find below a gremlin-python reproducer:

 

*stress.bash*
{code:sh}
#!/bin/bash

for i in $(seq 0 250); do
        python3 repro.py &
        pids[${i}]=$!
done

wait ${pids[*]}
{code}
 

*repro.py*
{code:python}
#!/usr/bin/env python3

from datetime import datetime

from gremlin_python.driver.driver_remote_connection import 
DriverRemoteConnection
from gremlin_python.process.anonymous_traversal import traversal

g = 
traversal().with_remote(DriverRemoteConnection('ws://localhost:8182/gremlin','g'))

try:
    start = datetime.now()
    result = g.inject(1).sideEffect(lambda: "Thread.sleep(5000)").next()
    print(f"result={result} t={datetime.now() - start}")
except Exception as e:
    print(e)
{code}
 

Steps to reproduce the issue:

1. Run gremlin-server:
{noformat}
docker run --rm -ti -p 8182:8182 tinkerpop/gremlin-server:3.5.4-SNAPSHOT 2>&1 | 
tee log
{noformat}
2. Execute the reproducer:
{noformat}
bash stress.bash
{noformat}
3. Wait ~30s and check that some python3 processes hang. The number of hanging 
python3 processes should be:
{noformat}
# hanging processes = # total processes - # processes ending successfully - # 
processes ending with timeout
{noformat}
 

Please, note that using threads in python I was not able to reproduce the issue 
reliably. Also, in this machine the gremlin pool is set to 8. Higher values may 
require to adjust the number of processes in the reproducer to trigger the 
issue.

> gremlin-server does not reply with a timeout response to all timed out 
> requests
> -------------------------------------------------------------------------------
>
>                 Key: TINKERPOP-2769
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-2769
>             Project: TinkerPop
>          Issue Type: Bug
>    Affects Versions: 3.6.0, 3.5.3, 3.6.1, 3.5.4
>            Reporter: Roi Martin
>            Priority: Major
>
> The following reproducer hangs because not all timed out requests are 
> reported by gremlin-server.
> *main.go*
> {code:go}
> package main
> import (
>       "log"
>       "sync"
>       "sync/atomic"
>       gremlingo "github.com/apache/tinkerpop/gremlin-go/v3/driver"
> )
> const (
>       gremlinServerEndpoint = "ws://localhost:8182/gremlin"
>       nWorkers              = 250
> )
> func main() {
>       conn, err := gremlingo.NewDriverRemoteConnection(gremlinServerEndpoint, 
> func(settings *gremlingo.DriverRemoteConnectionSettings) {
>               settings.LogVerbosity = gremlingo.Off
>       })
>       if err != nil {
>               log.Fatalf("could not create remote connection: %v", err)
>       }
>       defer conn.Close()
>       g := gremlingo.Traversal_().WithRemote(conn)
>       var (
>               wg               sync.WaitGroup
>               nPendingRequests int32
>       )
>       for i := 0; i < nWorkers; i++ {
>               wg.Add(1)
>               go func(id int) {
>                       defer wg.Done()
>                       atomic.AddInt32(&nPendingRequests, 1)
>                       result, err := g.
>                               Inject(1).
>                               
> SideEffect(&gremlingo.Lambda{"Thread.sleep(5_000)", "gremlin-groovy"}).
>                               Next()
>                       npr := atomic.AddInt32(&nPendingRequests, -1)
>                       log.Printf("[%v] result=%v err=%v nPendingRequests=%v", 
> id, result, err, npr)
>               }(i)
>       }
>       log.Print("waiting")
>       wg.Wait()
>       log.Print("done")
> }
> {code}
> Steps to reproduce the issue:
> 1. Run gremlin-server:
> {noformat}
> docker run --rm -ti -p 8182:8182 tinkerpop/gremlin-server:3.5.4-SNAPSHOT 2>&1 
> | tee log
> {noformat}
> 2. Execute the reproducer:
> {noformat}
> go run main.go
> {noformat}
> 3. Wait ~30s for the reproducer to hang.
> All these tests has been performed against the default gremlin-server 
> configuration. It is important to note that in this machine the gremlin pool 
> is set to 8. Higher values may require to adjust the number of workers in the 
> reproducer to trigger the issue.
> After following these steps, the output of the reproducer should look like 
> this:
> {noformat}
> 2022/06/29 16:08:47 waiting
> 2022/06/29 16:08:52 [22] result=result{object=1 class=int64} err=<nil> 
> nPendingRequests=249
> ...
> 2022/06/29 16:09:17 [154] result=result{object=1 class=int64} err=<nil> 
> nPendingRequests=208
> 2022/06/29 16:09:17 [243] result=result{object=1 class=int64} err=<nil> 
> nPendingRequests=207
> 2022/06/29 16:09:17 [226] result=<nil> err=E0903: there are no results left 
> nPendingRequests=206
> 2022/06/29 16:09:17 [164] result=<nil> err=E0903: there are no results left 
> nPendingRequests=204
> ...
> 2022/06/29 16:09:17 [131] result=<nil> err=E0903: there are no results left 
> nPendingRequests=113
> {noformat}
> On the other side, the gremlin-server logs should look like this:
> {noformat}
> [INFO] ? - Channel started at port 8182.
> [WARN] ? - A timeout occurred during traversal evaluation of 
> [RequestMessage{, requestId=9d7ccc61-0e7e-4105-ae35-eb81a3895170, 
> op='bytecode', processor='traversal', args={gremlin=[[], [inject(1), 
> sideEffect(lambda[Thread.sleep(5_000)])]], aliases={g=g}}}] - consider 
> increasing the limit given to evaluationTimeout
> [WARN] ? - A timeout occurred during traversal evaluation of 
> [RequestMessage{, requestId=6f2d939a-31c9-48fd-bb54-b2bfa1fbd4de, 
> op='bytecode', processor='traversal', args={gremlin=[[], [inject(1), 
> sideEffect(lambda[Thread.sleep(5_000)])]], aliases={g=g}}}] - consider 
> increasing the limit given to evaluationTimeout
> ...
> {noformat}
> In this specific case,
>  - 43 requests finished successfully (250-207 = 43)
>  - 94 requests timed out (207-113 = 94)
>  - 113 requests never finished (making the reproducer hang)
> If we check the gremlin-server logs, the number of reported timeouts matches 
> the output of the reproducer, which makes me think that gremlin-server is the 
> culprit.
> {noformat}
> $ grep 'A timeout occurred' log | wc -l
> 94
> {noformat}
> Inspecting the hanging goroutines in the reproducer shows that they are IO 
> waiting:
> {noformat}
> (dlv) grs
> ...
>   Goroutine 373 - User: /home/n/goroot/src/net/fd_posix.go:55 
> net.(*netFD).Read (0x684273) [IO wait]
> ...
> (dlv) gr 373
> Switched from 0 to 373 (thread 709858)
> (dlv) bt
>  0  0x000000000043d212 in runtime.gopark
>     at /home/n/goroot/src/runtime/proc.go:362
>  1  0x0000000000435a4a in runtime.netpollblock
>     at /home/n/goroot/src/runtime/netpoll.go:522
>  2  0x0000000000465e05 in internal/poll.runtime_pollWait
>     at /home/n/goroot/src/runtime/netpoll.go:302
>  3  0x00000000004ff2c8 in internal/poll.(*pollDesc).wait
>     at /home/n/goroot/src/internal/poll/fd_poll_runtime.go:83
>  4  0x00000000004ff377 in internal/poll.(*pollDesc).waitRead
>     at /home/n/goroot/src/internal/poll/fd_poll_runtime.go:88
>  5  0x0000000000500f8f in internal/poll.(*FD).Read
>     at /home/n/goroot/src/internal/poll/fd_unix.go:167
>  6  0x0000000000684273 in net.(*netFD).Read
>     at /home/n/goroot/src/net/fd_posix.go:55
>  7  0x000000000069ce49 in net.(*conn).Read
>     at /home/n/goroot/src/net/net.go:183
>  8  0x00000000005d9eb4 in bufio.(*Reader).fill
>     at /home/n/goroot/src/bufio/bufio.go:106
>  9  0x00000000005da278 in bufio.(*Reader).Peek
>     at /home/n/goroot/src/bufio/bufio.go:144
> 10  0x00000000008298a9 in github.com/gorilla/websocket.(*Conn).read
>     at /home/n/go/pkg/mod/github.com/gorilla/[email protected]/conn.go:371
> 11  0x000000000082d5fd in github.com/gorilla/websocket.(*Conn).advanceFrame
>     at /home/n/go/pkg/mod/github.com/gorilla/[email protected]/conn.go:809
> 12  0x000000000082f1d1 in github.com/gorilla/websocket.(*Conn).NextReader
>     at /home/n/go/pkg/mod/github.com/gorilla/[email protected]/conn.go:1009
> 13  0x000000000082fccb in github.com/gorilla/websocket.(*Conn).ReadMessage
>     at /home/n/go/pkg/mod/github.com/gorilla/[email protected]/conn.go:1093
> 14  0x00000000008418fb in 
> github.com/apache/tinkerpop/gremlin-go/v3/driver.(*gorillaTransporter).Read
>     at 
> /home/n/go/pkg/mod/github.com/apache/tinkerpop/gremlin-go/[email protected]/driver/gorillaTransporter.go:116
> 15  0x000000000085a302 in 
> github.com/apache/tinkerpop/gremlin-go/v3/driver.(*gremlinServerWSProtocol).readLoop
>     at 
> /home/n/go/pkg/mod/github.com/apache/tinkerpop/gremlin-go/[email protected]/driver/protocol.go:57
> 16  0x000000000085c2e7 in 
> github.com/apache/tinkerpop/gremlin-go/v3/driver.newGremlinServerWSProtocol.func1
>     at 
> /home/n/go/pkg/mod/github.com/apache/tinkerpop/gremlin-go/[email protected]/driver/protocol.go:196
> 17  0x000000000046a841 in runtime.goexit
>     at /home/n/goroot/src/runtime/asm_amd64.s:1571
> {noformat}
> The reason why the go program does not finish is that gremlin-server keeps 
> responding to pings, therefore gremlin-go's pong handler keeps extending the 
> read deadline of the connection.
> */tinkerpop/gremlin-go/driver/gorillaTransporter.go*
> {code:go}
>       transporter.connection.SetPongHandler(func(string) error {
>               err := transporter.connection.SetReadDeadline(time.Now().Add(2 
> * transporter.connSettings.keepAliveInterval))
>               if err != nil {
>                       return err
>               }
>               return nil
>       })
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (TINKERPOP-2769) gremlin-server does not reply with a timeout response to all timed out requests

Reply via email to