cyb70289 commented on pull request #12442: URL: https://github.com/apache/arrow/pull/12442#issuecomment-1085377998
For the "unknown address 0" ucx error, looks it's related to rdma network devices plugged in my test machine. I spawn a clean VM for test, there's no such error. Setting a breakpoint where the error is printed https://github.com/openucx/ucx/blob/v1.12.0/src/ucs/sys/sock.c#L660 Interestingly, when the bp is fired, printing `addr->sa_family`, the value is 2 (AF_INET), logically impossible. Looks like `addr` is pointing to some volatile memory that's changed by other threads or hardware in parallel. `addr` is get by calling `rdma_get_local_addr` at https://github.com/openucx/ucx/blob/v1.12.0/src/uct/ib/rdmacm/rdmacm_cm_ep.c#L176 From man page: https://linux.die.net/man/3/rdma_get_local_addr `rdma_get_local_addr` returns all zero if rdma nic is not bounded to an address. I do have some rdma nics disabled. The error looks harmless. Though it doesn't explain the strange behaviour found in the debugger. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
