[
https://issues.apache.org/jira/browse/THRIFT-4451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292204#comment-17292204
]
Allen George commented on THRIFT-4451:
--------------------------------------
I made some time recently to work on the Rust Thrift implementation again and
decided to look at this with fresh eyes.
h3. What's failing
I started off by evaluating the state of the world. There have been quite a few
changes/bugfixes to the Rust implementation over the past two years, and I
wanted to see where things were.
The first interesting thing I noticed was that the Rust multi _server_
implementation was solid; every client was able to communicate with it without
issues. The second interesting thing I noticed was that the Rust multi _client_
implementation only failed with some languages, not others. Notably,
communication with Java multi servers was totally fine.
{noformat}
root@3cec70e4dd34:/thrift/src# test/test.py --client rs
Apache Thrift - Integration Test Suite
Mon Feb 22 20:19:10 2021
===============================================================================
server-client: protocol: transport: result:
c_glib-rs multi-binary buffered-ip success
c_glib-rs multic-compact buffered-ip success
c_glib-rs multi-binary framed-ip success
c_glib-rs multic-compact framed-ip success
c_glib-rs multi-binary buffered-ip success
c_glib-rs multi-binary framed-ip success
c_glib-rs multi framed-ip
failure(timeout)
c_glib-rs multi buffered-ip
failure(timeout)
c_glib-rs multic-compact buffered-ip success
c_glib-rs multic-compact framed-ip success
cl-rs multi-binary buffered-ip failure(1)
cl-rs multi-binary framed-ip failure(1)
cl-rs binary buffered-ip failure(1)
cl-rs binary framed-ip failure(1)
c_glib-rs multic framed-ip
failure(timeout)
c_glib-rs multic buffered-ip
failure(timeout)
cl-rs multi framed-ip
failure(timeout)
cl-rs multi buffered-ip
failure(timeout)
java-rs multi buffered-ip success
java-rs multi fastframed-framed-ip success
java-rs multi framed-ip success
java-rs multi-binary buffered-ip success
java-rs multi-binary fastframed-framed-ip success
java-rs multi-binary framed-ip success
java-rs multic-compact buffered-ip success
java-rs multic-compact fastframed-framed-ip success
java-rs multic-compact framed-ip success
java-rs multi-binary buffered-ip success
java-rs multi-binary fastframed-framed-ip success
java-rs multi-binary framed-ip success
java-rs multic fastframed-framed-ip success
java-rs multic buffered-ip success
java-rs multic framed-ip success
java-rs multic-compact buffered-ip success
java-rs multic-compact fastframed-framed-ip success
java-rs multic-compact framed-ip success
py-rs multi framed-ip failure(65)
py-rs multi buffered-ip failure(65)
py-rs multi-binary framed-ip failure(65)
py-rs multi-binary buffered-ip failure(65)
py-rs binary framed-ip failure(65)
py-rs binary buffered-ip failure(65)
py-rs compact framed-ip failure(65)
py-rs compact buffered-ip failure(65)
py-rs multic framed-ip failure(65)
py-rs multic buffered-ip failure(65)
py-rs multic-compact framed-ip failure(65)
py-rs multic-compact buffered-ip failure(65)
py-rs accelc-compact framed-ip failure(65)
py-rs accelc-compact buffered-ip failure(65)
py-rs accel-binary framed-ip failure(65)
py-rs accel-binary buffered-ip failure(65)
py-rs multiac-multic framed-ip failure(65)
py-rs multiac-multic buffered-ip failure(65)
py-rs multiac-compact framed-ip failure(65)
py-rs multiac-compact buffered-ip failure(65)
py-rs multia-multi framed-ip failure(65)
py-rs multia-binary buffered-ip failure(65)
py-rs multia-multi buffered-ip failure(65)
py-rs multia-binary framed-ip failure(65)
py3-rs multi-binary buffered-ip success
py3-rs multi-binary framed-ip success
py3-rs multic-compact buffered-ip success
py3-rs multic-compact framed-ip success
py3-rs multi framed-ip
failure(timeout)
py3-rs multi buffered-ip
failure(timeout)
py3-rs multiac-compact buffered-ip success
py3-rs multic framed-ip
failure(timeout)
py3-rs multiac-compact framed-ip success
py3-rs multic buffered-ip
failure(timeout)
py3-rs multia-binary buffered-ip success
py3-rs multia-binary framed-ip success
py3-rs multiac-multic framed-ip
failure(timeout)
py3-rs multiac-multic buffered-ip
failure(timeout)
cpp-rs multi-binary buffered-ip success
cpp-rs multic-compact buffered-ip success
cpp-rs multi-binary framed-ip success
py3-rs multia-multi framed-ip
failure(timeout)
cpp-rs multic-compact framed-ip success
cpp-rs multi-binary buffered-ip success
py3-rs multia-multi buffered-ip
failure(timeout)
cpp-rs multi buffered-ip
failure(timeout)
cpp-rs multi-binary framed-ip success
cpp-rs multic-compact buffered-ip success
cpp-rs multi framed-ip
failure(timeout)
cpp-rs multic-compact framed-ip success
cpp-rs multic buffered-ip
failure(timeout)
cpp-rs multic framed-ip
failure(timeout)
perl-rs multi-binary buffered-ip success
perl-rs multi-binary buffered-ip success
perl-rs multi-binary framed-ip success
perl-rs multi-binary framed-ip success
perl-rs multi framed-ip
failure(timeout)
perl-rs multi buffered-ip
failure(timeout)
rs-rs multi buffered-ip success
rs-rs multi-binary buffered-ip success
rs-rs multic-compact buffered-ip success
rs-rs multi framed-ip success
rs-rs multi-binary framed-ip success
rs-rs multic-compact framed-ip success
rs-rs multi-binary buffered-ip success
rs-rs multi-binary framed-ip success
rs-rs multic buffered-ip success
rs-rs multic-compact buffered-ip success
rs-rs multic framed-ip success
rs-rs multic-compact framed-ip success
{noformat}
These failures were independent of protocol (compact, binary) OR transport
(buffered, framed). They _also_ only occurred if the client was started with
the {{multi}} or {{multic}} argument; if a regular client connected to a
{{multi}} server there were no issues.
I decided to focus on the interaction between the Rust client and the
Python/CPP servers. I chose Python/CPP because those Thrift implementations
appear heavily-used and receive a lot of community contributions.
h3. What does the cross-test do?
The Rust client implements {{multi}} and {{multic}} support as follows:
# It creates a TCP connection for the first client ({{ThriftTest}}), opens a
connection to the remote server, splits this connection into a read half/write
half, and then wraps the halves in the appropriate {{T*Transport}} and
{{T*Protocol}}
# It creates a _new_ TCP connection for the second client ({{SecondService}}),
opens a connection to the remote server, splits this new connection into a read
half/write half, and then wraps the halves in the appropriate {{T*Transport}}
and {{T*Protocol}}
When {{multi}} or {{multic}} is passed to the Rust client it makes a series of
service calls to {{ThriftTest}}, followed by a single call to
{{SecondService}}, and then finally finishes up with a few calls to
{{ThriftTest}} again. The client always hangs waiting for a response to its
{{SecondService}} call.
A typical interaction between a {{multi}} Rust client and a {{multiprocessing}}
server looks like this:
{noformat}
Tue Feb 23 02:14:20 2021
Executing: python3 TestServer.py --verbose --genpydir=gen-py --protocol=multi
--transport=buffered --port=33047 TSimpleServer
Directory: /thrift/src/test/py
config:delay: 5
config:timeout: 6
===============================================================================
INFO:root:testVoid()
INFO:root:testString(thing)
INFO:root:testBool(true)
INFO:root:testBool(false)
INFO:root:testByte(42)
INFO:root:testI32(1159348374)
INFO:root:testI64(-9223372036854775808)
INFO:root:testDouble(42.420000)
INFO:root:testTypedef(2348)
INFO:root:testEnum(2)
INFO:root:testBinary()
INFO:root:testStruct({foo, 12, 219129, 12938492818})
INFO:root:testNest(Xtruct2(byte_thing=32,
struct_thing=Xtruct(string_thing='foo', byte_thing=1, i32_thing=324382098,
i64_thing=12938492818), i32_thing=293481098))
===============================================================================
Return code: -1 (negative values indicate kill by signal)
Test execution took 6.3 seconds.
Tue Feb 23 02:14:27 2021
{noformat}
h3. Debugging
I started off by verifying that indeed, the {{SecondService}} call was being
made successfully through some {{println!}} usage and {{tcpdump}} ({{tcpdump -i
any -tttt -XX -vv tcp port 33047 -w capture_0.pcap}}).
!Screen Shot 2021-02-27 at 13.31.21.png!
The two connections initiations are for the {{ThriftTest}} client (source port:
44834) and the {{SecondService}} client (source port: 44836). We can see
immediately that the {{SecondService}} client {{CALL}} bytes are going over the
wire, but the remote server (whether CPP or PY3) was returning anything. (Note
that I'd made a mistake in my analysis two years ago: the {{FIN}} I'd seen was
not caused by lifetime issues or {{Drop}} etc. - it was simply because I'd shut
the client down. Doh.)
This behaviour was very puzzling: it implied that there was a problem on the
_server_ side, not the _client_ side - contrary to what I expected. I also
noticed that this behaviour happened _regardless_ of order:
# {{ThriftTest}} calls _before_ the {{SecondService}} call
# {{SecondService}} call, followed by {{ThriftService}} calls
I started to suspect that the server was actually blocking waiting for bytes on
the {{ThriftTest}} connection and _flipped_ the order in which the
{{ThriftTest}} and {{SecondService}} connection were created. When I flipped
the order of connection creation, immediately {{SecondService}} calls completed
successfully and it was {{ThriftTest}} calls that blocked. This was a real
facepalm moment for me, because, looking at the way the CPP/PY3 servers were
created, they were both _unthreaded_, _blocking_ servers. So, of course, the
_first_ connection the server received would be the one it would `recv()` on
and service. A quick {{strace}} proved this:
{noformat}
bind(3, {sa_family=AF_INET6, sin6_port=htons(33047), inet_pton(AF_INET6, "::",
&sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
listen(3, 128) = 0
accept4(3, {sa_family=AF_INET6, sin6_port=htons(44834), inet_pton(AF_INET6,
"::ffff:127.0.0.1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0},
[28], SOCK_CLOEXEC) = 4
recvfrom(4, 0x1b699d0, 4096, 0, NULL, NULL) = ? ERESTARTSYS (To be restarted if
SA_RESTART is set)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
recvfrom(4, 0x1b699d0, 4096, 0, NULL, NULL) = ? ERESTARTSYS (To be restarted if
SA_RESTART is set)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
{noformat}
Indeed, changing the CPP server to use a {{threaded}} server type worked
successfully with the Rust client:
{noformat}
root@3cec70e4dd34:/thrift/src# /thrift/src/test/cpp/TestServer --protocol=multi
--transport=buffered --port=45849 --server-type threaded
Starting "threaded" server (buffered/multi) listen on: 45849
secondTestString({"thing"})
testList({29384, 238, 32498})
{noformat}
Looking at the PY3 client it was immediately obvious how this wasn't an issue
for it (or other clients). It's because, unlike the Rust client, the underlying
connection is _shared_ between all Thrift client/protocol/transport instances
(from {{test/py/TestClient.py}}):
{noformat}
self.transport = TTransport.TBufferedTransport(socket)
if options.trans == 'framed':
self.transport = TTransport.TFramedTransport(socket)
elif options.trans == 'buffered':
self.transport = TTransport.TBufferedTransport(socket)
elif options.trans == '':
raise AssertionError('Unknown --transport option: %s' %
options.trans)
if options.zlib:
self.transport = TZlibTransport.TZlibTransport(self.transport,
9)
self.transport.open()
# IMPORTANT: <<<---- notice that self.transport is used for _both_
ThriftTest.Client and SecondService.Client
protocol = self.get_protocol(self.transport)
self.client = ThriftTest.Client(protocol)
# for multiplexed services:
protocol2 = self.get_protocol2(self.transport)
self.client2 = SecondService.Client(protocol2) if protocol2 is not None
else None
{noformat}
h3. Summary
In summary, the problem occurs because:
# The Rust cross-test client creates _two_ connections: one for the
{{ThriftTest}} service and a second for the {{SecondService}} service
# The Rust cross-test client is connecting to _unthreaded, nonblocking_
servers. With these servers, only the first {{accept()}}ed connection will be
{{recv()}}ed from and processed
The solution is to somehow modify the Rust clients to allow the underlying
connections (not the transports) to be shared.
_In hindsight I should have quickly looked at how the Java/CPP cross-test
servers were being invoked and I would have noticed immediately that the Java
server was multithreaded and the CPP one was not._
> Rust cross test client fails to communicate with multiplexed servers
> --------------------------------------------------------------------
>
> Key: THRIFT-4451
> URL: https://issues.apache.org/jira/browse/THRIFT-4451
> Project: Thrift
> Issue Type: Bug
> Components: Rust - Library
> Reporter: Allen George
> Assignee: Allen George
> Priority: Major
> Attachments: Screen Shot 2018-11-09 at 07.03.39.png, Screen Shot
> 2021-02-27 at 13.31.21.png
>
>
> As stated in description. Minimal case is to comment out everything in the
> Rust {{test_client}} leaving only the {{SecondService}} call behind.
> From what I can tell the Rust socket isn't getting *any* bytes at all for the
> response (i.e. it can't even get the first 4 bytes of the message header).
> There is a {{flush()}} call on the remote side - so that's puzzling.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)