[ https://issues.apache.org/jira/browse/THRIFT-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106454#comment-17106454 ]
Max commented on THRIFT-5186: ----------------------------- Keeping today's findings posted. Still not sure if this is to be fixed with further patches, or if one would claim it's a [pervasively common] misconfiguration (e.g. Docker default). Probably the former. TServerSocket::listen() has this piece, setting IPV6_V6ONLY on AF_INET6 sockets: {code:java} #ifdef IPV6_V6ONLY if (path_.empty() && res->ai_family == AF_INET6) { int zero = 0; if (-1 == setsockopt(serverSocket_, IPPROTO_IPV6, IPV6_V6ONLY, cast_sockopt(&zero), sizeof(zero))) { GlobalOutput.perror("TServerSocket::listen() IPV6_V6ONLY ", THRIFT_GET_SOCKET_ERROR); } } #endif // #ifdef IPV6_V6ONLY {code} More importantly, this is how {{getaddrinfo()}} results are processed in TServerSocket::listen(): {code:java} // Pick the ipv6 address first since ipv4 addresses can be mapped // into ipv6 space. for (res = info.res(); res; res = res->ai_next) { if (res->ai_family == AF_INET6 || res->ai_next == nullptr) break; } } {code} I.e. IPv6 results are unconditionally preferred. This, together with {{::1 localhost}} entry in {{/etc/hosts}}, and removed {{AI_ADDRCONFIG}} hint — leads to funny result: {{localhost}} resolves to something which you can't connect() to, at least in Docker containers with the default v4-only bridge network. The issue goes away if I configure IPv6 in Docker. [https://docs.docker.com/config/daemon/ipv6/] The issue goes away if I comment out the {{::1 localhost}} entry in container's /etc/hosts. The issue also goes away if I bring back {{AI_ADDRCONFIG}} hint. But then, I get "getaddrinfo() <Host: 127.0.0.1 Port: 1302>Address family for hostname not supported" with loopback-only network. Hmmm. Current conclusion at this point: in that do-while bind()-retry loop, TServerSocket should also loop over the individual {{getaddrinfo}} results. That way, it would work around this (seemingly standard and OK!) situation: {code:java} [root@04dd07b70038 /]# ping -6 localhost ping: connect: Cannot assign requested address {code} > AI_ADDRCONFIG: Thrift libraries crash with localhost-only network. > ------------------------------------------------------------------ > > Key: THRIFT-5186 > URL: https://issues.apache.org/jira/browse/THRIFT-5186 > Project: Thrift > Issue Type: Bug > Components: C++ - Library, Delphi - Library, Python - Library > Affects Versions: 0.13.0 > Environment: Red Hat Enterprise Linux 8.0 > Reporter: Max > Assignee: Max > Priority: Major > Labels: getaddrinfo, localhost, sockets > Fix For: 0.14.0 > > Attachments: > 0001-THRIFT-5186-Dont-pass-AI_ADDRCONFIG-to-getaddrinfo.patch > > Time Spent: 10m > Remaining Estimate: 0h > > THRIFT-2539 has been reported, and fixed — but for win32 only, for no > apparent reason. The exact same problem reproduces on POSIX. > Namely, when no network interfaces besides {{lo}} (the 127.0.0.1 loopback > interface) are up, C++ and Python apps linked with Thrift-generated code, > both clients and servers — *crash by throwing an exception*. Even when the > intention is exactly to run them on localhost only. > This happens because Thrift library code for TSocket, TServerSocket, > TNonblockingServerSocket calls > [{{getaddrinfo()}}|http://man7.org/linux/man-pages/man3/getaddrinfo.3.html] > to resolve target hostname to connect to/listen on, into concrete IP address > (v4 or v6, whichever the system is configured for). To that call, it *passes > the {{AI_ADDRCONFIG}} hint* which effectively turns a localhost-only > situation into: > {quote}{{Could not resolve host for client socket.}} > {quote} > and into this (server-side): > {code:java} > гру 23 13:52:13 localhost.localdomain systemd[1]: db_cache.service: Main > process exited, code=dumped, status=6/ABRT > гру 23 13:52:13 localhost.localdomain systemd[1]: db_cache.service: Failed > with result 'core-dump'. > гру 23 13:52:17 localhost.localdomain db_cache[12912]: Thrift: Mon Dec 23 > 13:52:15 2019 TSocket::open() getaddrinfo() <Host: 127.0.0.1 Port: > 1302>Address family for hostname not supported > гру 23 13:52:17 localhost.localdomain db_cache[12912]: Thrift: Mon Dec 23 > 13:52:15 2019 TSocket::open() getaddrinfo() <Host: 127.0.0.1 Port: > 8345>Address family for hostname not supported > гру 23 13:52:17 localhost.localdomain db_cache[12912]: Thrift: Mon Dec 23 > 13:52:15 2019 TNonblocking: using dedicated listener thread, io threads: 16 > гру 23 13:52:17 localhost.localdomain db_cache[12912]: Thrift: Mon Dec 23 > 13:52:15 2019 getaddrinfo -9: Address family for hostname not supported > гру 23 13:52:17 localhost.localdomain db_cache[12912]: terminate called after > throwing an instance of 'apache::thrift::transport::TTransportException' > гру 23 13:52:17 localhost.localdomain db_cache[12912]: what(): Could not > resolve host for server socket. > {code} > I fail to understand the original reason to pass that {{AI_ADDRCONFIG}} hint. > It shouldn't be there as I see it. > Further, since Thrift 0.9.2, windows builds of thrift apps don't pass that > hint anymore (see THRIFT-2539), and it seems to be okay. > For comprehension, I'm attaching a sample patch to remove {{AI_ADDRCONFIG}} > from {{lib/cpp}} and {{lib/py}}. The main change will be landing via GitHub, > per Thrift's contribution process, so please follow there too. -- This message was sent by Atlassian Jira (v8.3.4#803005)