Parth Chandra created DRILL-5050:
------------------------------------
Summary: C++ client library has symbol resolution issues when
loaded by a process that already uses boost::asio
Key: DRILL-5050
URL: https://issues.apache.org/jira/browse/DRILL-5050
Project: Apache Drill
Issue Type: Bug
Components: Client - C++
Affects Versions: 1.6.0
Environment: MacOs
Reporter: Parth Chandra
Assignee: Parth Chandra
Fix For: 2.0.0
h4. Summary
On MacOS, the Drill ODBC driver hangs when loaded by any process that might
also be using {{boost::asio}}. This is observed in trying to connect to Drill
via the ODBC driver using Tableau.
h4. Analysis
The problem is seen in the Drill client library on MacOS. In the method
{code}
DrillClientImpl::recvHandshake
.
.
m_io_service.reset();
if (DrillClientConfig::getHandshakeTimeout() > 0){
m_deadlineTimer.expires_from_now(boost::posix_time::seconds(DrillClientConfig::getHandshakeTimeout()));
m_deadlineTimer.async_wait(boost::bind(
&DrillClientImpl::handleHShakeReadTimeout,
this,
boost::asio::placeholders::error
));
DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Started new handshake wait timer
with "
<< DrillClientConfig::getHandshakeTimeout() << " seconds." <<
std::endl;)
}
async_read(
this->m_socket,
boost::asio::buffer(m_rbuf, LEN_PREFIX_BUFLEN),
boost::bind(
&DrillClientImpl::handleHandshake,
this,
m_rbuf,
boost::asio::placeholders::error,
boost::asio::placeholders::bytes_transferred)
);
DRILL_MT_LOG(DRILL_LOG(LOG_DEBUG) << "DrillClientImpl::recvHandshake: async
read waiting for server handshake response.\n";)
m_io_service.run();
.
.
{code}
The call to {{io_service::run}} returns without invoking any of the handlers
that have been registered. The {{io_service}} object has two tasks in its
queue, the timer task, and the socket read task. However, in the run method,
the state of the {{io_service}} object appears to change and the number of
outstanding tasks becomes zero. The run method therefore returns immediately.
Subsequently, any query request sent to the server hangs as data is never
pulled off the socket.
This is bizarre behaviour and typically points to build problems.
More investigation revealed a more interesting thing. {{boost::asio}} is a
header only library. In other words, there is no actual library
{{libboost_asio}}. All the code is included into the binary that includes the
headers of {{boost::asio}}. It so happens that the Tableau process has a
library (libtabquery) that uses {{boost::asio}} so the code for {{boost::asio}}
is already loaded into process memory. When the drill client library (via the
ODBC driver) is loaded by the loader, the drill client library loads its own
copy of the {{boost:asio}} code. At runtime, the drill client code jumps to an
address that resolves to an address inside the libtabquery copy of
{{boost::asio}}. And that code returns incorrectly.
Really? How is that even allowed? Two copies of {{boost::asio}} in the same
process? Even if that is allowed, since the code is included at compile time,
calls to the {{boost::asio}} library should be resolved using internal linkage.
And if the call to {{boost::asio}} is not resolved statically, the dynamic
loader would encounter two symbols with the same name and would give us an
error. And even if the linker picks one of the symbols, as long as the code is
the same (for example if both libraries use the same version of boost) can that
cause a problem? Even more importantly, how do we fix that?
h4. Some assembly required
The disassembled libdrillClient shows this code inside recvHandshake
{code}
000000000003dd8f movq -0xb0(%rbp), %rdi
000000000003dd96 addq $0xc0, %rdi
000000000003dd9d callq 0x1bff42 ## symbol stub for:
__ZN5boost4asio10io_service3runEv
000000000003dda2 movq -0xb0(%rbp), %rdi
000000000003dda9 cmpq $0x0, 0x190(%rdi)
000000000003ddb4 movq %rax, -0x158(%rbp)
{code}
and later in the code
{code}
0000000000057216 retq
0000000000057217 nopw (%rax,%rax)
__ZN5boost4asio10io_service3runEv: ## definition of
io_service::run
0000000000057220 pushq %rbp
0000000000057221 movq %rsp, %rbp
0000000000057224 subq $0x30, %rsp
0000000000057228 leaq -0x18(%rbp), %rax
000000000005722c movq %rdi, -0x8(%rbp)
0000000000057230 movq -0x8(%rbp), %rdi
0000000000057234 movq %rdi, -0x28(%rbp)
{code}
Note that in recvHandshake the call instruction jumps to an address that is an
offset (0x1bff42). This offset happens to be beyond the end of the library. It
certainly isn't the offset at which the io_service::run method is defined
(0x57220).
The linker is definitely not resolving the address statically, but we had
already guessed that. It is, in fact, jumping to a stub method and at runtime
this address is being resolved to the address of the {{io_service::run}} method
in libtabquery.
Just to check, in the debugger, we can see the following two implementations of
{{io_service::run}} in the process
{code}
libtabquery.dylib`boost::asio::io_service::run():
0x10d597a10: pushq %rbp
0x10d597a11: movq %rsp, %rbp
0x10d597a14: pushq %rbx
0x10d597a15: subq $0x18, %rsp
0x10d597a19: movq %rdi, %rbx
0x10d597a1c: movl $0x0, -0x18(%rbp)
0x10d597a23: callq 0x10d5b73a4 ; symbol stub for:
boost::system::system_category()
0x10d597a28: movq %rax, -0x10(%rbp)
0x10d597a2c: movq 0x8(%rbx), %rdi
0x10d597a30: leaq -0x18(%rbp), %rsi
0x10d597a34: callq 0x10d5b71e2 ; symbol stub for:
boost::asio::detail::task_io_service::run(boost::system::error_code&)
0x10d597a39: cmpl $0x0, -0x18(%rbp)
0x10d597a3d: jne 0x10d597a46 ;
boost::asio::io_service::run() + 54
0x10d597a3f: addq $0x18, %rsp
0x10d597a43: popq %rbx
0x10d597a44: popq %rbp
0x10d597a45: retq
0x10d597a46: leaq -0x18(%rbp), %rdi
0x10d597a4a: callq 0x10d5b71a6 ; symbol stub for:
boost::asio::detail::do_throw_error(boost::system::error_code const&)
0x10d597a4f: nop
libdrillClient.dylib`boost::asio::io_service::run() at io_service.ipp:57:
0x11f158300: pushq %rbp
0x11f158301: movq %rsp, %rbp
0x11f158304: subq $0x30, %rsp
0x11f158308: leaq -0x18(%rbp), %rax
0x11f15830c: movq %rdi, -0x8(%rbp)
0x11f158310: movq -0x8(%rbp), %rdi
0x11f158314: movq %rdi, -0x28(%rbp)
0x11f158318: movq %rax, %rdi
0x11f15831b: callq 0x11f2c210c ; symbol stub for:
boost::system::error_code::error_code()
0x11f158320: leaq -0x18(%rbp), %rsi
0x11f158324: movq -0x28(%rbp), %rax
0x11f158328: movq 0x8(%rax), %rdi
0x11f15832c: callq 0x11f2c3516 ; symbol stub for:
boost::asio::detail::task_io_service::run(boost::system::error_code&)
0x11f158331: leaq -0x18(%rbp), %rdi
0x11f158335: movq %rax, -0x20(%rbp)
0x11f158339: callq 0x11f2c1bf6 ; symbol stub for:
boost::asio::detail::throw_error(boost::system::error_code const&)
0x11f15833e: movq -0x20(%rbp), %rax
0x11f158342: addq $0x30, %rsp
0x11f158346: popq %rbp
0x11f158347: retq
{code}
As suspected, the code for the two versions of {{io_service::run}} is
different, so if the code is executing the wrong version, then the behaviour
will be, expectedly, unexpected.
h4. What does not work
Linking statically with boost has no effect. The code is inlined in the first
place and is effectively part of the dynamic library already.
Changing the load order of the libraries (by specifying
LD_LIBRARY_PATH/DYLD_LIBRARY_PATH does not help). This is because the
application library is already loaded into the process.
The linker -prebind flag does not help. The prebind flag is intended to tell
the linker to resolve all addresses at link time. Why this did not work is not
clear.
Both libtabquery.dylib and libdrillClient.dylib contain symbols (functions)
from the {{boost::asio package}}. At runtime, the MacOs loader assigns the
drillClient library to call the functions defined in libtabquery. This causes
the code to behave unpredictably and eventually the ODBC driver 'hangs' waiting
for data from the server.
Because the symbol linkage is being determined at runtime, changing the linker
settings in the Drill client build has no effect. This is true even if you
build with static linkage (a remarkable feature of MacOS!). Also, the boost
builds between libtabquery and libdrillClient are different even if we use the
same boost version; the compiled code is different. This is a critical part of
the problem because if the compiled code were the same there would be no
problem if the code was called using the libtabquery version instead of the
libdrillClient version.
h4. Solution
The only way to resolve this is to use a 'shaded' version of boost in the drill
client library. Luckily for us C++ namespaces, boost's bcp tool, and CMake
together provide a way to rename the boost namespace to any name we like and
use it in the drill client code. This effectively renames every symbol from
boost to a different name using a new namespace name and the symbol name
conflict does not arise.
Using this build of boost, and using static linking (just to make sure) in the
Drill client library, one is able to connect to and run queries against Drill
from Tableau.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)