[jira] [Created] (TINKERPOP-2820) gremlin-python _close_session race condition/FD leak

Alex Hamilton (Jira) Fri, 04 Nov 2022 12:34:06 -0700

Alex Hamilton created TINKERPOP-2820:
----------------------------------------


             Summary: gremlin-python _close_session race condition/FD leak
                 Key: TINKERPOP-2820
                 URL: https://issues.apache.org/jira/browse/TINKERPOP-2820
             Project: TinkerPop
          Issue Type: Bug
          Components: python
    Affects Versions: 3.6.1
            Reporter: Alex Hamilton


There is a race condition in gremlin-python when closing session-based 
connections that results in leaking file descriptors for event loops - 
eventually leading to an `OSError [Errno 24] too many open files` error after 
enough transactions occur.

The problem stems from a race condition when closing session based connections 
that causes the event loop opened for the session's connection to be left open.

The problem is completely contained in these two methods from 
`gremlin_python.driver.client.py`

```py
def close(self):
    # prevent the Client from being closed more than once. it raises errors if 
new jobby jobs
    # get submitted to the executor when it is shutdown
    if self._closed:
        return

    if self._session_enabled:
        self._close_session()  # 1. (see below)
    log.info("Closing Client with url '%s'", self._url)
    while not self._pool.empty():  # 3. (see below)
        conn = self._pool.get(True)
        conn.close()
    self._executor.shutdown()
    self._closed = True

def _close_session(self):
    message = request.RequestMessage(
        processor='session', op='close',
        args={'session': str(self._session)})
    conn = self._pool.get(True)
    return conn.write(message).result()    # 2. (see below)
```

1. `_close_session()` called
2. `.result()` waits for the _write_ to finish, but does __not__ wait for the 
_read_ to finish. `conn` does not get put back into `self._pool` until AFTER 
the read finishes (`gremlin_python.driver.connection.Connection._receive()`). 
However, this method returns early and goes to 3.
3. this while loop is not entered to close out the connections. This leaves the 
conn's event loop running, never to be closed.


I was able to solve this by modifying `_close_session` as follows:

```py
def _close_session(self):
    message = request.RequestMessage(
        processor='session', op='close',
        args={'session': str(self._session)})
    conn = self._pool.get(True)
    try:
        write_result_set = conn.write(message).result()
        return write_result_set.all().result() # wait for _receive() to finish
    except protocol.GremlinServerError:
        pass
```

I'm not sure if this is the correct solution, but wanted to point out the bug.


In the meantime however, I wrote a context manager to handle this cleanup for me

```py
@contextlib.contextmanager
def transaction():
    tx = g.tx()
    gtx = tx.begin()

    try:
        yield tx, gtx
        tx.commit()
    except Exception as e:
        tx.rollback()
    finally:
        while not tx._session_based_connection._client._pool.empty():
            conn = tx._session_based_connection._client._pool.get(True)
            conn.close()
            logger.info("Closed abandoned session connection")


with transaction() as (tx, gtx):
    foo = gtx.some_traversal().to_list()
    # do something with foo
    gtx.some_other_traversal().iterate()
```

Cheers



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (TINKERPOP-2820) gremlin-python _close_session race condition/FD leak

Reply via email to