On 6/6/2025 11:43 AM, Daniel P. Berrangé wrote:
On Fri, Jun 06, 2025 at 10:37:28AM -0500, JAEHOON KIM wrote:
On 6/6/2025 10:12 AM, Steven Sistare wrote:
On 6/6/2025 11:06 AM, JAEHOON KIM wrote:
On 6/6/2025 9:14 AM, Steven Sistare wrote:
On 6/6/2025 9:53 AM, Daniel P. Berrangé wrote:
On Thu, Jun 05, 2025 at 06:08:08PM -0500, Jaehoon Kim wrote:
When the source VM attempts to connect to the destination VM's Unix
domain socket(cpr.sock) during CPR transfer, the socket
file might not
yet be exist if the destination side hasn't completed the bind
operation. This can lead to connection failures when
running tests with
the qtest framework.
This sounds like a flawed test impl to me - whatever is initiating
the cpr operation on the source has done so prematurely - it should
ensure the dest is ready before starting the operation.
To address this, add cpr_validate_socket_path(), which wait for the
socket file to appear. This avoids intermittent qtest
failures caused by
early connection attempts.
IMHO it is dubious to special case cpr in this way.
I agree with Daniel, and unfortunately it is not just a test issue;
every management framework that supports cpr-transfer must add logic to
wait for the cpr socket to appear in the target before proceeding.
This is analogous to waiting for the monitor socket to appear before
connecting to it.
- Steve
Thank you very much for your valuable review and feedback.
Just to clarify, the added cpr_validate_socket_path() function is
not limited to the test framework.
It is part of the actual CPR implementation and is intended to
ensure correct behavior in all cases, including outside of tests.
I mentioned the qtest failure simply as an example where this issue
became apparent.
Yes, I understand that you understand :)
Are you willing to move your fix to the qtest?
- Steve
Thank you for your question and feedback.
I agree that the issue could be addressed within the qtest framework to
improve stability.
However, this socket readiness check is a fundamental part of CPR transfer
process,
and I believe that incorporating cpr_validate_socket_path() directly into
the CPR implementation helps ensure more reliable behavior
across all environments - not only during testing.
Just from my perspective, adding this logic to the CPR code does not
introduce significant overhead or side effects.
I would appreciate if you could share more details about your concerns, so I
can better address them.
Requiring a busy wait like this is a sign of a design problem.
There needs to be a way to setup the incoming socket listener
without resorting to busy waiting - that's showing a lack of
synchronization.
How is this a design problem? If I start a program that creates a listening
unix
domain socket, I cannot attempt to connect to it until the socket is created and
listening. Clients face the same issue when starting qemu and connecting to the
monitor socket.
- Steve