[ https://issues.apache.org/jira/browse/DRILL-5050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chunhui Shi updated DRILL-5050: ------------------------------- Assignee: Parth Chandra (was: Chunhui Shi) > C++ client library has symbol resolution issues when loaded by a process that > already uses boost::asio > ------------------------------------------------------------------------------------------------------ > > Key: DRILL-5050 > URL: https://issues.apache.org/jira/browse/DRILL-5050 > Project: Apache Drill > Issue Type: Bug > Components: Client - C++ > Affects Versions: 1.6.0 > Environment: MacOs > Reporter: Parth Chandra > Assignee: Parth Chandra > Fix For: 2.0.0 > > > h4. Summary > On MacOS, the Drill ODBC driver hangs when loaded by any process that might > also be using {{boost::asio}}. This is observed in trying to connect to Drill > via the ODBC driver using Tableau. > h4. Analysis > The problem is seen in the Drill client library on MacOS. In the method > {code} > DrillClientImpl::recvHandshake > . > . > m_io_service.reset(); > if (DrillClientConfig::getHandshakeTimeout() > 0){ > > m_deadlineTimer.expires_from_now(boost::posix_time::seconds(DrillClientConfig::getHandshakeTimeout())); > m_deadlineTimer.async_wait(boost::bind( > &DrillClientImpl::handleHShakeReadTimeout, > this, > boost::asio::placeholders::error > )); > DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Started new handshake wait > timer with " > << DrillClientConfig::getHandshakeTimeout() << " seconds." << > std::endl;) > } > async_read( > this->m_socket, > boost::asio::buffer(m_rbuf, LEN_PREFIX_BUFLEN), > boost::bind( > &DrillClientImpl::handleHandshake, > this, > m_rbuf, > boost::asio::placeholders::error, > boost::asio::placeholders::bytes_transferred) > ); > DRILL_MT_LOG(DRILL_LOG(LOG_DEBUG) << "DrillClientImpl::recvHandshake: > async read waiting for server handshake response.\n";) > m_io_service.run(); > . > . > {code} > The call to {{io_service::run}} returns without invoking any of the handlers > that have been registered. The {{io_service}} object has two tasks in its > queue, the timer task, and the socket read task. However, in the run method, > the state of the {{io_service}} object appears to change and the number of > outstanding tasks becomes zero. The run method therefore returns immediately. > Subsequently, any query request sent to the server hangs as data is never > pulled off the socket. > This is bizarre behaviour and typically points to build problems. > More investigation revealed a more interesting thing. {{boost::asio}} is a > header only library. In other words, there is no actual library > {{libboost_asio}}. All the code is included into the binary that includes the > headers of {{boost::asio}}. It so happens that the Tableau process has a > library (libtabquery) that uses {{boost::asio}} so the code for > {{boost::asio}} is already loaded into process memory. When the drill client > library (via the ODBC driver) is loaded by the loader, the drill client > library loads its own copy of the {{boost:asio}} code. At runtime, the drill > client code jumps to an address that resolves to an address inside the > libtabquery copy of {{boost::asio}}. And that code returns incorrectly. > Really? How is that even allowed? Two copies of {{boost::asio}} in the same > process? Even if that is allowed, since the code is included at compile time, > calls to the {{boost::asio}} library should be resolved using internal > linkage. And if the call to {{boost::asio}} is not resolved statically, the > dynamic loader would encounter two symbols with the same name and would give > us an error. And even if the linker picks one of the symbols, as long as the > code is the same (for example if both libraries use the same version of > boost) can that cause a problem? Even more importantly, how do we fix that? > h4. Some assembly required > The disassembled libdrillClient shows this code inside recvHandshake > {code} > 000000000003dd8f movq -0xb0(%rbp), %rdi > 000000000003dd96 addq $0xc0, %rdi > 000000000003dd9d callq 0x1bff42 ## symbol stub for: > __ZN5boost4asio10io_service3runEv > 000000000003dda2 movq -0xb0(%rbp), %rdi > 000000000003dda9 cmpq $0x0, 0x190(%rdi) > 000000000003ddb4 movq %rax, -0x158(%rbp) > {code} > and later in the code > {code} > 0000000000057216 retq > 0000000000057217 nopw (%rax,%rax) > __ZN5boost4asio10io_service3runEv: ## definition of > io_service::run > 0000000000057220 pushq %rbp > 0000000000057221 movq %rsp, %rbp > 0000000000057224 subq $0x30, %rsp > 0000000000057228 leaq -0x18(%rbp), %rax > 000000000005722c movq %rdi, -0x8(%rbp) > 0000000000057230 movq -0x8(%rbp), %rdi > 0000000000057234 movq %rdi, -0x28(%rbp) > {code} > Note that in recvHandshake the call instruction jumps to an address that is > an offset (0x1bff42). This offset happens to be beyond the end of the > library. It certainly isn't the offset at which the io_service::run method is > defined (0x57220). > The linker is definitely not resolving the address statically, but we had > already guessed that. It is, in fact, jumping to a stub method and at > runtime this address is being resolved to the address of the > {{io_service::run}} method in libtabquery. > Just to check, in the debugger, we can see the following two implementations > of {{io_service::run}} in the process > {code} > libtabquery.dylib`boost::asio::io_service::run(): > 0x10d597a10: pushq %rbp > 0x10d597a11: movq %rsp, %rbp > 0x10d597a14: pushq %rbx > 0x10d597a15: subq $0x18, %rsp > 0x10d597a19: movq %rdi, %rbx > 0x10d597a1c: movl $0x0, -0x18(%rbp) > 0x10d597a23: callq 0x10d5b73a4 ; symbol stub for: > boost::system::system_category() > 0x10d597a28: movq %rax, -0x10(%rbp) > 0x10d597a2c: movq 0x8(%rbx), %rdi > 0x10d597a30: leaq -0x18(%rbp), %rsi > 0x10d597a34: callq 0x10d5b71e2 ; symbol stub for: > boost::asio::detail::task_io_service::run(boost::system::error_code&) > 0x10d597a39: cmpl $0x0, -0x18(%rbp) > 0x10d597a3d: jne 0x10d597a46 ; > boost::asio::io_service::run() + 54 > 0x10d597a3f: addq $0x18, %rsp > 0x10d597a43: popq %rbx > 0x10d597a44: popq %rbp > 0x10d597a45: retq > 0x10d597a46: leaq -0x18(%rbp), %rdi > 0x10d597a4a: callq 0x10d5b71a6 ; symbol stub for: > boost::asio::detail::do_throw_error(boost::system::error_code const&) > 0x10d597a4f: nop > libdrillClient.dylib`boost::asio::io_service::run() at io_service.ipp:57: > 0x11f158300: pushq %rbp > 0x11f158301: movq %rsp, %rbp > 0x11f158304: subq $0x30, %rsp > 0x11f158308: leaq -0x18(%rbp), %rax > 0x11f15830c: movq %rdi, -0x8(%rbp) > 0x11f158310: movq -0x8(%rbp), %rdi > 0x11f158314: movq %rdi, -0x28(%rbp) > 0x11f158318: movq %rax, %rdi > 0x11f15831b: callq 0x11f2c210c ; symbol stub for: > boost::system::error_code::error_code() > 0x11f158320: leaq -0x18(%rbp), %rsi > 0x11f158324: movq -0x28(%rbp), %rax > 0x11f158328: movq 0x8(%rax), %rdi > 0x11f15832c: callq 0x11f2c3516 ; symbol stub for: > boost::asio::detail::task_io_service::run(boost::system::error_code&) > 0x11f158331: leaq -0x18(%rbp), %rdi > 0x11f158335: movq %rax, -0x20(%rbp) > 0x11f158339: callq 0x11f2c1bf6 ; symbol stub for: > boost::asio::detail::throw_error(boost::system::error_code const&) > 0x11f15833e: movq -0x20(%rbp), %rax > 0x11f158342: addq $0x30, %rsp > 0x11f158346: popq %rbp > 0x11f158347: retq > {code} > As suspected, the code for the two versions of {{io_service::run}} is > different, so if the code is executing the wrong version, then the behaviour > will be, expectedly, unexpected. > h4. What does not work > Linking statically with boost has no effect. The code is inlined in the first > place and is effectively part of the dynamic library already. > Changing the load order of the libraries (by specifying > LD_LIBRARY_PATH/DYLD_LIBRARY_PATH does not help). This is because the > application library is already loaded into the process. > The linker -prebind flag does not help. The prebind flag is intended to tell > the linker to resolve all addresses at link time. Why this did not work is > not clear. > > Both libtabquery.dylib and libdrillClient.dylib contain symbols (functions) > from the {{boost::asio package}}. At runtime, the MacOs loader assigns the > drillClient library to call the functions defined in libtabquery. This causes > the code to behave unpredictably and eventually the ODBC driver 'hangs' > waiting for data from the server. > > Because the symbol linkage is being determined at runtime, changing the > linker settings in the Drill client build has no effect. This is true even if > you build with static linkage (a remarkable feature of MacOS!). Also, the > boost builds between libtabquery and libdrillClient are different even if we > use the same boost version; the compiled code is different. This is a > critical part of the problem because if the compiled code were the same there > would be no problem if the code was called using the libtabquery version > instead of the libdrillClient version. > > h4. Solution > The only way to resolve this is to use a 'shaded' version of boost in the > drill client library. Luckily for us C++ namespaces, boost's bcp tool, and > CMake together provide a way to rename the boost namespace to any name we > like and use it in the drill client code. This effectively renames every > symbol from boost to a different name using a new namespace name and the > symbol name conflict does not arise. > Using this build of boost, and using static linking (just to make sure) in > the Drill client library, one is able to connect to and run queries against > Drill from Tableau. -- This message was sent by Atlassian JIRA (v6.3.4#6332)