[ 
https://issues.apache.org/jira/browse/DRILL-5050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chunhui Shi updated DRILL-5050:
-------------------------------
    Assignee: Parth Chandra  (was: Chunhui Shi)

> C++ client library has symbol resolution issues when loaded by a process that 
> already uses boost::asio
> ------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-5050
>                 URL: https://issues.apache.org/jira/browse/DRILL-5050
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Client - C++
>    Affects Versions: 1.6.0
>         Environment: MacOs
>            Reporter: Parth Chandra
>            Assignee: Parth Chandra
>             Fix For: 2.0.0
>
>
> h4. Summary
> On MacOS, the Drill ODBC driver hangs when loaded by any process that might 
> also be using {{boost::asio}}. This is observed in trying to connect to Drill 
> via the ODBC driver using Tableau.
> h4. Analysis
> The problem is seen in the Drill client library on MacOS. In the method 
> {code}
>  DrillClientImpl::recvHandshake
> .
> .
>     m_io_service.reset();
>     if (DrillClientConfig::getHandshakeTimeout() > 0){
>         
> m_deadlineTimer.expires_from_now(boost::posix_time::seconds(DrillClientConfig::getHandshakeTimeout()));
>         m_deadlineTimer.async_wait(boost::bind(
>                     &DrillClientImpl::handleHShakeReadTimeout,
>                     this,
>                     boost::asio::placeholders::error
>                     ));
>         DRILL_MT_LOG(DRILL_LOG(LOG_TRACE) << "Started new handshake wait 
> timer with "
>                 << DrillClientConfig::getHandshakeTimeout() << " seconds." << 
> std::endl;)
>     }
>     async_read(
>             this->m_socket,
>             boost::asio::buffer(m_rbuf, LEN_PREFIX_BUFLEN),
>             boost::bind(
>                 &DrillClientImpl::handleHandshake,
>                 this,
>                 m_rbuf,
>                 boost::asio::placeholders::error,
>                 boost::asio::placeholders::bytes_transferred)
>             );
>     DRILL_MT_LOG(DRILL_LOG(LOG_DEBUG) << "DrillClientImpl::recvHandshake: 
> async read waiting for server handshake response.\n";)
>     m_io_service.run();
> .
> .
> {code}
> The call to {{io_service::run}} returns without invoking any of the handlers 
> that have been registered. The {{io_service}} object has two tasks in its 
> queue, the timer task, and the socket read task. However, in the run method, 
> the state of the {{io_service}} object appears to change and the number of 
> outstanding tasks becomes zero. The run method therefore returns immediately. 
> Subsequently, any query request sent to the server hangs as data is never 
> pulled off the socket.
> This is bizarre behaviour and typically points to build problems. 
> More investigation revealed a more interesting thing. {{boost::asio}} is a 
> header only library. In other words, there is no actual library 
> {{libboost_asio}}. All the code is included into the binary that includes the 
> headers of {{boost::asio}}. It so happens that the Tableau process has a 
> library (libtabquery) that uses {{boost::asio}} so the code for 
> {{boost::asio}} is already loaded into process memory. When the drill client 
> library (via the ODBC driver) is loaded by the loader, the drill client 
> library loads its own copy of the {{boost:asio}} code.  At runtime, the drill 
> client code jumps to an address that resolves to an address inside the 
> libtabquery copy of {{boost::asio}}. And that code returns incorrectly.
> Really? How is that even allowed? Two copies of {{boost::asio}} in the same 
> process? Even if that is allowed, since the code is included at compile time, 
> calls to the {{boost::asio}} library should be resolved using internal 
> linkage. And if the call to {{boost::asio}} is not resolved statically, the 
> dynamic loader would encounter two symbols with the same name and would give 
> us an error. And even if the linker picks one of the symbols, as long as the 
> code is the same (for example if both libraries use the same version of 
> boost) can that cause a problem? Even more importantly, how do we fix that?
> h4. Some assembly required
> The disassembled libdrillClient shows this code inside recvHandshake
> {code}
> 000000000003dd8f    movq    -0xb0(%rbp), %rdi       
> 000000000003dd96    addq    $0xc0, %rdi
> 000000000003dd9d    callq   0x1bff42                ## symbol stub for: 
> __ZN5boost4asio10io_service3runEv
> 000000000003dda2    movq    -0xb0(%rbp), %rdi
> 000000000003dda9    cmpq    $0x0, 0x190(%rdi)
> 000000000003ddb4    movq    %rax, -0x158(%rbp)
> {code}
> and later in the code 
> {code}
> 0000000000057216    retq    
> 0000000000057217    nopw    (%rax,%rax)
> __ZN5boost4asio10io_service3runEv:                 ## definition of 
> io_service::run
> 0000000000057220    pushq   %rbp
> 0000000000057221    movq    %rsp, %rbp
> 0000000000057224    subq    $0x30, %rsp
> 0000000000057228    leaq    -0x18(%rbp), %rax
> 000000000005722c    movq    %rdi, -0x8(%rbp)        
> 0000000000057230    movq    -0x8(%rbp), %rdi
> 0000000000057234    movq    %rdi, -0x28(%rbp)
> {code}
> Note that in recvHandshake the call instruction jumps to an address that is 
> an offset (0x1bff42). This offset happens to be beyond the end of the 
> library. It certainly isn't the offset at which the io_service::run method is 
> defined (0x57220).
> The linker is definitely not resolving the address statically, but we had 
> already guessed that. It is, in fact, jumping to a stub method and  at 
> runtime this address is being resolved to the address of the 
> {{io_service::run}} method in libtabquery.
> Just to check, in the debugger, we can see the following two implementations 
> of {{io_service::run}} in the process
> {code}
> libtabquery.dylib`boost::asio::io_service::run():
>    0x10d597a10:  pushq  %rbp
>    0x10d597a11:  movq   %rsp, %rbp
>    0x10d597a14:  pushq  %rbx
>    0x10d597a15:  subq   $0x18, %rsp
>    0x10d597a19:  movq   %rdi, %rbx
>    0x10d597a1c:  movl   $0x0, -0x18(%rbp)
>    0x10d597a23:  callq  0x10d5b73a4               ; symbol stub for: 
> boost::system::system_category()
>    0x10d597a28:  movq   %rax, -0x10(%rbp) 
>    0x10d597a2c:  movq   0x8(%rbx), %rdi             
>    0x10d597a30:  leaq   -0x18(%rbp), %rsi
>    0x10d597a34:  callq  0x10d5b71e2               ; symbol stub for: 
> boost::asio::detail::task_io_service::run(boost::system::error_code&)
>    0x10d597a39:  cmpl   $0x0, -0x18(%rbp)
>    0x10d597a3d:  jne    0x10d597a46               ; 
> boost::asio::io_service::run() + 54
>    0x10d597a3f:  addq   $0x18, %rsp
>    0x10d597a43:  popq   %rbx
>    0x10d597a44:  popq   %rbp
>    0x10d597a45:  retq   
>    0x10d597a46:  leaq   -0x18(%rbp), %rdi
>    0x10d597a4a:  callq  0x10d5b71a6               ; symbol stub for: 
> boost::asio::detail::do_throw_error(boost::system::error_code const&)
>    0x10d597a4f:  nop        
> libdrillClient.dylib`boost::asio::io_service::run() at io_service.ipp:57:
>    0x11f158300:  pushq  %rbp
>    0x11f158301:  movq   %rsp, %rbp
>    0x11f158304:  subq   $0x30, %rsp
>    0x11f158308:  leaq   -0x18(%rbp), %rax
>    0x11f15830c:  movq   %rdi, -0x8(%rbp)
>    0x11f158310:  movq   -0x8(%rbp), %rdi
>    0x11f158314:  movq   %rdi, -0x28(%rbp)
>    0x11f158318:  movq   %rax, %rdi
>    0x11f15831b:  callq  0x11f2c210c               ; symbol stub for: 
> boost::system::error_code::error_code()
>    0x11f158320:  leaq   -0x18(%rbp), %rsi
>    0x11f158324:  movq   -0x28(%rbp), %rax           
>    0x11f158328:  movq   0x8(%rax), %rdi
>    0x11f15832c:  callq  0x11f2c3516               ; symbol stub for: 
> boost::asio::detail::task_io_service::run(boost::system::error_code&)
>    0x11f158331:  leaq   -0x18(%rbp), %rdi
>    0x11f158335:  movq   %rax, -0x20(%rbp)
>    0x11f158339:  callq  0x11f2c1bf6               ; symbol stub for: 
> boost::asio::detail::throw_error(boost::system::error_code const&)
>    0x11f15833e:  movq   -0x20(%rbp), %rax
>    0x11f158342:  addq   $0x30, %rsp
>    0x11f158346:  popq   %rbp
>    0x11f158347:  retq   
> {code}
> As suspected, the code for the two versions of {{io_service::run}} is 
> different, so if the code is executing the wrong version, then the behaviour 
> will be, expectedly, unexpected.
> h4. What does not work
> Linking statically with boost has no effect. The code is inlined in the first 
> place and is effectively part of the dynamic library already. 
> Changing the load order of the libraries (by specifying 
> LD_LIBRARY_PATH/DYLD_LIBRARY_PATH does not help). This is because the 
> application library is already loaded into the process.
> The linker -prebind flag does not help. The prebind flag is intended to tell 
> the linker to resolve all addresses at link time. Why this did not work is 
> not clear.
>  
> Both libtabquery.dylib and libdrillClient.dylib contain symbols (functions) 
> from the {{boost::asio package}}. At runtime, the MacOs loader assigns the 
> drillClient library to call the functions defined in libtabquery. This causes 
> the code to behave unpredictably and eventually the ODBC driver 'hangs' 
> waiting for data from the server.
>  
> Because the symbol linkage is being determined at runtime, changing the 
> linker settings in the Drill client build has no effect. This is true even if 
> you build with static linkage (a remarkable feature of MacOS!). Also, the 
> boost builds between libtabquery and libdrillClient are different even if we 
> use the same boost version; the compiled code is different. This is a 
> critical part of the problem because if the compiled code were the same there 
> would be no problem if the code was called using the libtabquery version 
> instead of the libdrillClient version.
>  
> h4. Solution
> The only way to resolve this is to use a 'shaded' version of boost in the 
> drill client library. Luckily for us C++ namespaces, boost's bcp tool, and 
> CMake together provide a way to rename the boost namespace to any name we 
> like and use it in the drill client code. This effectively renames every 
> symbol from boost to a different name using a new namespace name and the 
> symbol name conflict does not arise.
> Using this build of boost, and using static linking (just to make sure) in 
> the Drill client library, one is able to connect to and run queries against 
> Drill from Tableau.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to