[ 
https://issues.apache.org/jira/browse/THRIFT-4451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292204#comment-17292204
 ] 

Allen George commented on THRIFT-4451:
--------------------------------------

I made some time recently to work on the Rust Thrift implementation again and 
decided to look at this with fresh eyes.

h3. What's failing

I started off by evaluating the state of the world. There have been quite a few 
changes/bugfixes to the Rust implementation over the past two years, and I 
wanted to see where things were.

The first interesting thing I noticed was that the Rust multi _server_ 
implementation was solid; every client was able to communicate with it without 
issues. The second interesting thing I noticed was that the Rust multi _client_ 
implementation only failed with some languages, not others. Notably, 
communication with Java multi servers was totally fine.
{noformat}
root@3cec70e4dd34:/thrift/src# test/test.py --client rs
Apache Thrift - Integration Test Suite
Mon Feb 22 20:19:10 2021
===============================================================================
server-client:          protocol:         transport:               result:
c_glib-rs               multi-binary      buffered-ip              success
c_glib-rs               multic-compact    buffered-ip              success
c_glib-rs               multi-binary      framed-ip                success
c_glib-rs               multic-compact    framed-ip                success
c_glib-rs               multi-binary      buffered-ip              success
c_glib-rs               multi-binary      framed-ip                success
c_glib-rs               multi             framed-ip                
failure(timeout)
c_glib-rs               multi             buffered-ip              
failure(timeout)
c_glib-rs               multic-compact    buffered-ip              success
c_glib-rs               multic-compact    framed-ip                success
cl-rs                   multi-binary      buffered-ip              failure(1)
cl-rs                   multi-binary      framed-ip                failure(1)
cl-rs                   binary            buffered-ip              failure(1)
cl-rs                   binary            framed-ip                failure(1)
c_glib-rs               multic            framed-ip                
failure(timeout)
c_glib-rs               multic            buffered-ip              
failure(timeout)
cl-rs                   multi             framed-ip                
failure(timeout)
cl-rs                   multi             buffered-ip              
failure(timeout)
java-rs                 multi             buffered-ip              success
java-rs                 multi             fastframed-framed-ip     success
java-rs                 multi             framed-ip                success
java-rs                 multi-binary      buffered-ip              success
java-rs                 multi-binary      fastframed-framed-ip     success
java-rs                 multi-binary      framed-ip                success
java-rs                 multic-compact    buffered-ip              success
java-rs                 multic-compact    fastframed-framed-ip     success
java-rs                 multic-compact    framed-ip                success
java-rs                 multi-binary      buffered-ip              success
java-rs                 multi-binary      fastframed-framed-ip     success
java-rs                 multi-binary      framed-ip                success
java-rs                 multic            fastframed-framed-ip     success
java-rs                 multic            buffered-ip              success
java-rs                 multic            framed-ip                success
java-rs                 multic-compact    buffered-ip              success
java-rs                 multic-compact    fastframed-framed-ip     success
java-rs                 multic-compact    framed-ip                success
py-rs                   multi             framed-ip                failure(65)
py-rs                   multi             buffered-ip              failure(65)
py-rs                   multi-binary      framed-ip                failure(65)
py-rs                   multi-binary      buffered-ip              failure(65)
py-rs                   binary            framed-ip                failure(65)
py-rs                   binary            buffered-ip              failure(65)
py-rs                   compact           framed-ip                failure(65)
py-rs                   compact           buffered-ip              failure(65)
py-rs                   multic            framed-ip                failure(65)
py-rs                   multic            buffered-ip              failure(65)
py-rs                   multic-compact    framed-ip                failure(65)
py-rs                   multic-compact    buffered-ip              failure(65)
py-rs                   accelc-compact    framed-ip                failure(65)
py-rs                   accelc-compact    buffered-ip              failure(65)
py-rs                   accel-binary      framed-ip                failure(65)
py-rs                   accel-binary      buffered-ip              failure(65)
py-rs                   multiac-multic    framed-ip                failure(65)
py-rs                   multiac-multic    buffered-ip              failure(65)
py-rs                   multiac-compact   framed-ip                failure(65)
py-rs                   multiac-compact   buffered-ip              failure(65)
py-rs                   multia-multi      framed-ip                failure(65)
py-rs                   multia-binary     buffered-ip              failure(65)
py-rs                   multia-multi      buffered-ip              failure(65)
py-rs                   multia-binary     framed-ip                failure(65)
py3-rs                  multi-binary      buffered-ip              success
py3-rs                  multi-binary      framed-ip                success
py3-rs                  multic-compact    buffered-ip              success
py3-rs                  multic-compact    framed-ip                success
py3-rs                  multi             framed-ip                
failure(timeout)
py3-rs                  multi             buffered-ip              
failure(timeout)
py3-rs                  multiac-compact   buffered-ip              success
py3-rs                  multic            framed-ip                
failure(timeout)
py3-rs                  multiac-compact   framed-ip                success
py3-rs                  multic            buffered-ip              
failure(timeout)
py3-rs                  multia-binary     buffered-ip              success
py3-rs                  multia-binary     framed-ip                success
py3-rs                  multiac-multic    framed-ip                
failure(timeout)
py3-rs                  multiac-multic    buffered-ip              
failure(timeout)
cpp-rs                  multi-binary      buffered-ip              success
cpp-rs                  multic-compact    buffered-ip              success
cpp-rs                  multi-binary      framed-ip                success
py3-rs                  multia-multi      framed-ip                
failure(timeout)
cpp-rs                  multic-compact    framed-ip                success
cpp-rs                  multi-binary      buffered-ip              success
py3-rs                  multia-multi      buffered-ip              
failure(timeout)
cpp-rs                  multi             buffered-ip              
failure(timeout)
cpp-rs                  multi-binary      framed-ip                success
cpp-rs                  multic-compact    buffered-ip              success
cpp-rs                  multi             framed-ip                
failure(timeout)
cpp-rs                  multic-compact    framed-ip                success
cpp-rs                  multic            buffered-ip              
failure(timeout)
cpp-rs                  multic            framed-ip                
failure(timeout)
perl-rs                 multi-binary      buffered-ip              success
perl-rs                 multi-binary      buffered-ip              success
perl-rs                 multi-binary      framed-ip                success
perl-rs                 multi-binary      framed-ip                success
perl-rs                 multi             framed-ip                
failure(timeout)
perl-rs                 multi             buffered-ip              
failure(timeout)
rs-rs                   multi             buffered-ip              success
rs-rs                   multi-binary      buffered-ip              success
rs-rs                   multic-compact    buffered-ip              success
rs-rs                   multi             framed-ip                success
rs-rs                   multi-binary      framed-ip                success
rs-rs                   multic-compact    framed-ip                success
rs-rs                   multi-binary      buffered-ip              success
rs-rs                   multi-binary      framed-ip                success
rs-rs                   multic            buffered-ip              success
rs-rs                   multic-compact    buffered-ip              success
rs-rs                   multic            framed-ip                success
rs-rs                   multic-compact    framed-ip                success
{noformat}
These failures were independent of protocol (compact, binary) OR transport 
(buffered, framed). They _also_ only occurred if the client was started with 
the {{multi}} or {{multic}} argument; if a regular client connected to a 
{{multi}} server there were no issues.

I decided to focus on the interaction between the Rust client and the 
Python/CPP servers. I chose Python/CPP because those Thrift implementations 
appear heavily-used and receive a lot of community contributions.

h3. What does the cross-test do?

The Rust client implements {{multi}} and {{multic}} support as follows:
# It creates a TCP connection for the first client ({{ThriftTest}}), opens a 
connection to the remote server, splits this connection into a read half/write 
half, and then wraps the halves in the appropriate {{T*Transport}} and 
{{T*Protocol}}
# It creates a _new_ TCP connection for the second client ({{SecondService}}), 
opens a connection to the remote server, splits this new connection into a read 
half/write half, and then wraps the halves in the appropriate {{T*Transport}} 
and {{T*Protocol}}

When {{multi}} or {{multic}} is passed to the Rust client it makes a series of 
service calls to {{ThriftTest}}, followed by a single call to 
{{SecondService}}, and then finally finishes up with a few calls to 
{{ThriftTest}} again. The client always hangs waiting for a response to its 
{{SecondService}} call.

A typical interaction between a {{multi}} Rust client and a {{multiprocessing}} 
server looks like this:
{noformat}
Tue Feb 23 02:14:20 2021
Executing: python3 TestServer.py --verbose --genpydir=gen-py --protocol=multi 
--transport=buffered --port=33047 TSimpleServer
Directory: /thrift/src/test/py
config:delay: 5
config:timeout: 6
===============================================================================
INFO:root:testVoid()
INFO:root:testString(thing)
INFO:root:testBool(true)
INFO:root:testBool(false)
INFO:root:testByte(42)
INFO:root:testI32(1159348374)
INFO:root:testI64(-9223372036854775808)
INFO:root:testDouble(42.420000)
INFO:root:testTypedef(2348)
INFO:root:testEnum(2)
INFO:root:testBinary()
INFO:root:testStruct({foo, 12, 219129, 12938492818})
INFO:root:testNest(Xtruct2(byte_thing=32, 
struct_thing=Xtruct(string_thing='foo', byte_thing=1, i32_thing=324382098, 
i64_thing=12938492818), i32_thing=293481098))
===============================================================================
Return code: -1 (negative values indicate kill by signal)
Test execution took 6.3 seconds.
Tue Feb 23 02:14:27 2021
{noformat}

h3. Debugging

I started off by verifying that indeed, the {{SecondService}} call was being 
made successfully through some {{println!}} usage and {{tcpdump}} ({{tcpdump -i 
any -tttt -XX -vv tcp port 33047 -w capture_0.pcap}}).

!Screen Shot 2021-02-27 at 13.31.21.png!

The two connections initiations are for the {{ThriftTest}} client (source port: 
44834) and the {{SecondService}} client (source port: 44836). We can see 
immediately that the {{SecondService}} client {{CALL}} bytes are going over the 
wire, but the remote server (whether CPP or PY3) was returning anything. (Note 
that I'd made a mistake in my analysis two years ago: the {{FIN}} I'd seen was 
not caused by lifetime issues or {{Drop}} etc. - it was simply because I'd shut 
the client down. Doh.)

This behaviour was very puzzling: it implied that there was a problem on the 
_server_ side, not the _client_ side - contrary to what I expected. I also 
noticed that this behaviour happened _regardless_ of order:
# {{ThriftTest}} calls _before_ the {{SecondService}} call
# {{SecondService}} call, followed by {{ThriftService}} calls

I started to suspect that the server was actually blocking waiting for bytes on 
the {{ThriftTest}} connection and _flipped_ the order in which the 
{{ThriftTest}} and {{SecondService}} connection were created. When I flipped 
the order of connection creation, immediately {{SecondService}} calls completed 
successfully and it was {{ThriftTest}} calls that blocked. This was a real 
facepalm moment for me, because, looking at the way the CPP/PY3 servers were 
created, they were both _unthreaded_, _blocking_ servers. So, of course, the 
_first_ connection the server received would be the one it would `recv()` on 
and service. A quick {{strace}} proved this:
{noformat}
bind(3, {sa_family=AF_INET6, sin6_port=htons(33047), inet_pton(AF_INET6, "::", 
&sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
listen(3, 128)                          = 0
accept4(3, {sa_family=AF_INET6, sin6_port=htons(44834), inet_pton(AF_INET6, 
"::ffff:127.0.0.1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 
[28], SOCK_CLOEXEC) = 4
recvfrom(4, 0x1b699d0, 4096, 0, NULL, NULL) = ? ERESTARTSYS (To be restarted if 
SA_RESTART is set)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
recvfrom(4, 0x1b699d0, 4096, 0, NULL, NULL) = ? ERESTARTSYS (To be restarted if 
SA_RESTART is set)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
{noformat}

Indeed, changing the CPP server to use a {{threaded}} server type worked 
successfully with the Rust client:
{noformat}
root@3cec70e4dd34:/thrift/src# /thrift/src/test/cpp/TestServer --protocol=multi 
--transport=buffered --port=45849 --server-type threaded
Starting "threaded" server (buffered/multi) listen on: 45849
secondTestString({"thing"})
testList({29384, 238, 32498})
{noformat}

Looking at the PY3 client it was immediately obvious how this wasn't an issue 
for it (or other clients). It's because, unlike the Rust client, the underlying 
connection is _shared_ between all Thrift client/protocol/transport instances 
(from {{test/py/TestClient.py}}):
{noformat}
 self.transport = TTransport.TBufferedTransport(socket)
            if options.trans == 'framed':
                self.transport = TTransport.TFramedTransport(socket)
            elif options.trans == 'buffered':
                self.transport = TTransport.TBufferedTransport(socket)
            elif options.trans == '':
                raise AssertionError('Unknown --transport option: %s' % 
options.trans)
            if options.zlib:
                self.transport = TZlibTransport.TZlibTransport(self.transport, 
9)
        self.transport.open()

        # IMPORTANT: <<<---- notice that self.transport is used for _both_ 
ThriftTest.Client and SecondService.Client

        protocol = self.get_protocol(self.transport)
        self.client = ThriftTest.Client(protocol)
        # for multiplexed services:
        protocol2 = self.get_protocol2(self.transport)
        self.client2 = SecondService.Client(protocol2) if protocol2 is not None 
else None
{noformat}

h3. Summary

In summary, the problem occurs because:
# The Rust cross-test client creates _two_ connections: one for the 
{{ThriftTest}} service and a second for the {{SecondService}} service
# The Rust cross-test client is connecting to _unthreaded, nonblocking_ 
servers. With these servers, only the first {{accept()}}ed connection will be 
{{recv()}}ed from and processed

The solution is to somehow modify the Rust clients to allow the underlying 
connections (not the transports) to be shared.

_In hindsight I should have quickly looked at how the Java/CPP cross-test 
servers were being invoked and I would have noticed immediately that the Java 
server was multithreaded and the CPP one was not._

> Rust cross test client fails to communicate with multiplexed servers
> --------------------------------------------------------------------
>
>                 Key: THRIFT-4451
>                 URL: https://issues.apache.org/jira/browse/THRIFT-4451
>             Project: Thrift
>          Issue Type: Bug
>          Components: Rust - Library
>            Reporter: Allen George
>            Assignee: Allen George
>            Priority: Major
>         Attachments: Screen Shot 2018-11-09 at 07.03.39.png, Screen Shot 
> 2021-02-27 at 13.31.21.png
>
>
> As stated in description. Minimal case is to comment out everything in the 
> Rust {{test_client}} leaving only the {{SecondService}} call behind.
> From what I can tell the Rust socket isn't getting *any* bytes at all for the 
> response (i.e. it can't even get the first 4 bytes of the message header). 
> There is a {{flush()}} call on the remote side - so that's puzzling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to