Binary transport break up by overlapping of TFrame headers in scribe.AsyncClient
--------------------------------------------------------------------------------

                 Key: THRIFT-1423
                 URL: https://issues.apache.org/jira/browse/THRIFT-1423
             Project: Thrift
          Issue Type: Bug
          Components: Java - Library
    Affects Versions: 0.7, 0.6.1, 0.6, 0.5
         Environment: Oracle Java JDK 1.6.0_27-b07, Ubuntu Server, Amazon EC2 
large.instance
Scribe server on local host using libthrift 0.5, TNonBlockingServer, 
TFramedTransport
Java client on localhost with libthrift 0.5, scribe.AsyncClient, 1 long-lived 
connection. Client is multithreaded and has persistent connection to scribe.
            Reporter: Ivan Prisyazhniy


I have got strange behavior of async client in libthrift. Building up described 
configuration on ec2, i have got a lot of moments when thrift server says that 
it has broken transport. I thought it is scribe problem. But digging inside of 
tcpdump i have found one strange thing with thrift async client 
(scribe.AsyncClient).

There is no doubt that async client support for multi-threaded access. Yep, it 
is implemented as single threaded fst. But it has concurrent incoming pending 
calls queue as you know.

So, what i have found and that is the source of connections transports breaking 
off:

17:10:17.815425 IP (tos 0x0, ttl 64, id 38164, offset 0, flags [DF], proto TCP 
(6), length 56)
localhost.39412 > localhost.1463: Flags [P.], cksum 0xfe2c (incorrect -> 
0x1371), seq 1110:1114, ack 337, win 257, options [nop,nop,TS val 34123260 ecr 
34122685], length 4
0x0000: 4500 0038 9514 4000 4006 a7a9 7f00 0001
0x0010: 7f00 0001 99f4 05b7 7417 1499 747b 6961
0x0020: 8018 0101 fe2c 0000 0101 080a 0208 adfc
0x0030: 0208 abbd 0000 003a
17:10:17.815534 IP (tos 0x0, ttl 64, id 38165, offset 0, flags [DF], proto TCP 
(6), length 56)
localhost.39412 > localhost.1463: Flags [P.], cksum 0xfe2c (incorrect -> 
0x134b), seq 1114:1118, ack 337, win 257, options [nop,nop,TS val 34123260 ecr 
34122685], length 4
0x0000: 4500 0038 9515 4000 4006 a7a8 7f00 0001
0x0010: 7f00 0001 99f4 05b7 7417 149d 747b 6961
0x0020: 8018 0101 fe2c 0000 0101 080a 0208 adfc
0x0030: 0208 abbd 0000 005c
17:10:17.815648 IP (tos 0x0, ttl 64, id 60631, offset 0, flags [DF], proto TCP 
(6), length 52)
localhost.1463 > localhost.39412: Flags [.], cksum 0xfe28 (incorrect -> 
0x1171), seq 337, ack 1118, win 256, options [nop,nop,TS val 34123260 ecr 
34123260], length 0
0x0000: 4500 0034 ecd7 4000 4006 4fea 7f00 0001
0x0010: 7f00 0001 05b7 99f4 747b 6961 7417 14a1
0x0020: 8010 0100 fe28 0000 0101 080a 0208 adfc
0x0030: 0208 adfc
17:10:17.815658 IP (tos 0x0, ttl 64, id 38166, offset 0, flags [DF], proto TCP 
(6), length 144)
localhost.39412 > localhost.1463: Flags [P.], cksum 0xfe84 (incorrect -> 
0xe7af), seq 1118:1210, ack 337, win 257, options [nop,nop,TS val 34123260 ecr 
34123260], length 92
0x0000: 4500 0090 9516 4000 4006 a74f 7f00 0001
0x0010: 7f00 0001 99f4 05b7 7417 14a1 747b 6961
0x0020: 8018 0101 fe84 0000 0101 080a 0208 adfc
0x0030: 0208 adfc 8001 0001 0000 0003 4c6f 6700
0x0040: 0000 000f 0001 0c00 0000 010b 0001 0000
0x0050: 0005 6974 656d 730b 0002 0000 0030 2863
0x0060: 6c75 622e 7573 652c 2036 3031 322c 2030
0x0070: 2c20 302c 2032 3830 2c20 302c 2074 7275
0x0080: 652c 2031 3332 3035 3838 3631 3729 0000
17:10:17.815971 IP (tos 0x0, ttl 64, id 60632, offset 0, flags [DF], proto TCP 
(6), length 52)
localhost.1463 > localhost.39412: Flags [R.], cksum 0xfe28 (incorrect -> 
0x1111), seq 337, ack 1210, win 256, options [nop,nop,TS val 34123260 ecr 
34123260], length 0
0x0000: 4500 0034 ecd8 4000 4006 4fe9 7f00 0001
0x0010: 7f00 0001 05b7 99f4 747b 6961 7417 14fd
0x0020: 8014 0100 fe28 0000 0101 080a 0208 adfc
0x0030: 0208 adfc

There are two sequentional tframedtransport headers/packets in the dump. then 
ack from scribe. then psh of Log_call. it is clearly ok that scribe said it is 
broken transport caused by tframes are mixed. 1 packet is TFrame data size for 
message = 3a, 2 packet is TFrame data size = 5c. Third is ACK. Forth is 
Log_call length of 5c. Fifth packed is ACK with RST from scribe cause it found 
error in binary stream. There is no Log_call packet of length 3a between two 
first packets. Two sequentional tframe headers are in separate packets from 
it's data only because of TCP_NODELAY and separated 
channel.write(TFrameSizeBuffer) and channel.write(TFrameDataBuffer).

Why there are separated states and buffers to write BODY_SIZE and BODY_DATA in 
the client (TAsyncMethodCall)? Why there are separated states for READS? The 
only thing the BODY_SIZE is for, is emulation of TFramedTransport, but why it 
writes it's header asynchronously and separatelly from BODY_DATA? Sync version 
of client writes TFrameHeader and it's data in one single TCP packet, not as 
async client. Any way, this follows transport to mess it's data.

Am i right this is bug of implementation of tframed support in async client?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to