Alex Hamilton created TINKERPOP-2820:
----------------------------------------
Summary: gremlin-python _close_session race condition/FD leak
Key: TINKERPOP-2820
URL: https://issues.apache.org/jira/browse/TINKERPOP-2820
Project: TinkerPop
Issue Type: Bug
Components: python
Affects Versions: 3.6.1
Reporter: Alex Hamilton
There is a race condition in gremlin-python when closing session-based
connections that results in leaking file descriptors for event loops -
eventually leading to an `OSError [Errno 24] too many open files` error after
enough transactions occur.
The problem stems from a race condition when closing session based connections
that causes the event loop opened for the session's connection to be left open.
The problem is completely contained in these two methods from
`gremlin_python.driver.client.py`
```py
def close(self):
# prevent the Client from being closed more than once. it raises errors if
new jobby jobs
# get submitted to the executor when it is shutdown
if self._closed:
return
if self._session_enabled:
self._close_session() # 1. (see below)
log.info("Closing Client with url '%s'", self._url)
while not self._pool.empty(): # 3. (see below)
conn = self._pool.get(True)
conn.close()
self._executor.shutdown()
self._closed = True
def _close_session(self):
message = request.RequestMessage(
processor='session', op='close',
args={'session': str(self._session)})
conn = self._pool.get(True)
return conn.write(message).result() # 2. (see below)
```
1. `_close_session()` called
2. `.result()` waits for the _write_ to finish, but does __not__ wait for the
_read_ to finish. `conn` does not get put back into `self._pool` until AFTER
the read finishes (`gremlin_python.driver.connection.Connection._receive()`).
However, this method returns early and goes to 3.
3. this while loop is not entered to close out the connections. This leaves the
conn's event loop running, never to be closed.
I was able to solve this by modifying `_close_session` as follows:
```py
def _close_session(self):
message = request.RequestMessage(
processor='session', op='close',
args={'session': str(self._session)})
conn = self._pool.get(True)
try:
write_result_set = conn.write(message).result()
return write_result_set.all().result() # wait for _receive() to finish
except protocol.GremlinServerError:
pass
```
I'm not sure if this is the correct solution, but wanted to point out the bug.
In the meantime however, I wrote a context manager to handle this cleanup for me
```py
@contextlib.contextmanager
def transaction():
tx = g.tx()
gtx = tx.begin()
try:
yield tx, gtx
tx.commit()
except Exception as e:
tx.rollback()
finally:
while not tx._session_based_connection._client._pool.empty():
conn = tx._session_based_connection._client._pool.get(True)
conn.close()
logger.info("Closed abandoned session connection")
with transaction() as (tx, gtx):
foo = gtx.some_traversal().to_list()
# do something with foo
gtx.some_other_traversal().iterate()
```
Cheers
--
This message was sent by Atlassian Jira
(v8.20.10#820010)