Pete Fawcett created PROTON-2903: ------------------------------------ Summary: Python Fatal Error in void2py ffi.from_handle Key: PROTON-2903 URL: https://issues.apache.org/jira/browse/PROTON-2903 Project: Qpid Proton Issue Type: Bug Components: python-binding Affects Versions: proton-c-0.39.0, proton-c-future Environment: qpid-proton 0.41.0-dev (current main branch) Rocky Linux 9.5 Python 3.12.5 (CPython) cffi (pip package) 1.17.1 Reporter: Pete Fawcett
Experiencing sporadic, but not infrequent, "Python Fatal Error" in proton client. IT seems to happen more often when the client is under more load, I am investigating using the current main branch, but the relevant code seems to be unchanged since v 0.39.0 The error is happening in the call to {{ffi.from_handle}} within {{cproton.void2py}} {{from_handle()}} is calling {{Py_FatalError()}} because the handle passed in, whilst being a {{<cdata void *>}} object, is not pointing to an object of the expected type ({{{}CDataOwningGC_Type{}}}) I made a custom version of {{cffi}} to include, in the error message, what type was being pointed to. The results varied but they all seemed to be valid types, i.e. not garbage. This suggests to me that the original object has been freed and the memory used by a new object. I am only beginning to learn about the use of CFFI handles. but from what I understand, it is the responsibility of the Python code, when creating a handle to a Python object using {{{}ffi.new_handle(){}}}, to keep the handle and the original 'alive' by maintaining a reference to the handle. This appears to be the intended function of {{cproton.retained_objects}} but I think the current implementation is too simple. Some things that I think are contributing to the problem are: * {{retained_objects}} is a set so effectively only provides a reference count of 1 * {{retained_objects}} is added to on the 'Python side' following calls to {{new_handle}} but also, indirectly, from the 'C side' via {{pn_pyref_incref}} * Different {{<cdata>}} instances can hash to the same value and so are treated as being the same within {{{}retained_objects{}}}. (I think the hash is based on the content, so where a {{<cdata void *>}} is pointing) One potentially problematic call sequence I have seen is in {{cproton.pn_collector_put_pyref}} # A new handle ({{{}d{}}}) is created an added to {{retained_objects}} # {{d}} is passed to {{lib.pn_collector_put_py}} # within that call a callback is made to {{pn_pyref_incref}} with parameter {{obj}} # {{obj}} is a different instance to {{d}} (the {{id()}} differs) but is has the same hash and so the call to {{retained_objects.add(obj)}} doesn't add a new item. So, if at some point {{pn_pyref_decref(obj)}} is called, it will release the original {{d}} object. Whilst I think I can see why the problem might be happening I am not sure how to fix it. In particular, I haven't yet worked out what would be the correct point to release the reference in {{{}retained_objects{}}}. Not releasing them at all would stop the fatal error but would effectively be a "memory leak" with resultant problems. It might be that better reference counting is required, rather than just the single level afforded by {{retained_objects}} being a {{set}} - but then there might not be "enough" {{decref}} calls to free up the object at all. Perhaps releasing the handle within {{void2py}} would work, but that feels too risky without better understanding of the intended lifetime. I would appreciate some guidance on the original design thoughts behind the current implementation, particularly around knowing when the handle is no longer required. My current client program is non-trivial. It both receives and sends messages and also makes use of {{{}ApplicationEvent{}}}. I am looking at how to make a simpler client that can still produce the error. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org For additional commands, e-mail: dev-h...@qpid.apache.org