Binary transport break up by overlapping of TFrame headers in scribe.AsyncClient --------------------------------------------------------------------------------
Key: THRIFT-1423 URL: https://issues.apache.org/jira/browse/THRIFT-1423 Project: Thrift Issue Type: Bug Components: Java - Library Affects Versions: 0.7, 0.6.1, 0.6, 0.5 Environment: Oracle Java JDK 1.6.0_27-b07, Ubuntu Server, Amazon EC2 large.instance Scribe server on local host using libthrift 0.5, TNonBlockingServer, TFramedTransport Java client on localhost with libthrift 0.5, scribe.AsyncClient, 1 long-lived connection. Client is multithreaded and has persistent connection to scribe. Reporter: Ivan Prisyazhniy I have got strange behavior of async client in libthrift. Building up described configuration on ec2, i have got a lot of moments when thrift server says that it has broken transport. I thought it is scribe problem. But digging inside of tcpdump i have found one strange thing with thrift async client (scribe.AsyncClient). There is no doubt that async client support for multi-threaded access. Yep, it is implemented as single threaded fst. But it has concurrent incoming pending calls queue as you know. So, what i have found and that is the source of connections transports breaking off: 17:10:17.815425 IP (tos 0x0, ttl 64, id 38164, offset 0, flags [DF], proto TCP (6), length 56) localhost.39412 > localhost.1463: Flags [P.], cksum 0xfe2c (incorrect -> 0x1371), seq 1110:1114, ack 337, win 257, options [nop,nop,TS val 34123260 ecr 34122685], length 4 0x0000: 4500 0038 9514 4000 4006 a7a9 7f00 0001 0x0010: 7f00 0001 99f4 05b7 7417 1499 747b 6961 0x0020: 8018 0101 fe2c 0000 0101 080a 0208 adfc 0x0030: 0208 abbd 0000 003a 17:10:17.815534 IP (tos 0x0, ttl 64, id 38165, offset 0, flags [DF], proto TCP (6), length 56) localhost.39412 > localhost.1463: Flags [P.], cksum 0xfe2c (incorrect -> 0x134b), seq 1114:1118, ack 337, win 257, options [nop,nop,TS val 34123260 ecr 34122685], length 4 0x0000: 4500 0038 9515 4000 4006 a7a8 7f00 0001 0x0010: 7f00 0001 99f4 05b7 7417 149d 747b 6961 0x0020: 8018 0101 fe2c 0000 0101 080a 0208 adfc 0x0030: 0208 abbd 0000 005c 17:10:17.815648 IP (tos 0x0, ttl 64, id 60631, offset 0, flags [DF], proto TCP (6), length 52) localhost.1463 > localhost.39412: Flags [.], cksum 0xfe28 (incorrect -> 0x1171), seq 337, ack 1118, win 256, options [nop,nop,TS val 34123260 ecr 34123260], length 0 0x0000: 4500 0034 ecd7 4000 4006 4fea 7f00 0001 0x0010: 7f00 0001 05b7 99f4 747b 6961 7417 14a1 0x0020: 8010 0100 fe28 0000 0101 080a 0208 adfc 0x0030: 0208 adfc 17:10:17.815658 IP (tos 0x0, ttl 64, id 38166, offset 0, flags [DF], proto TCP (6), length 144) localhost.39412 > localhost.1463: Flags [P.], cksum 0xfe84 (incorrect -> 0xe7af), seq 1118:1210, ack 337, win 257, options [nop,nop,TS val 34123260 ecr 34123260], length 92 0x0000: 4500 0090 9516 4000 4006 a74f 7f00 0001 0x0010: 7f00 0001 99f4 05b7 7417 14a1 747b 6961 0x0020: 8018 0101 fe84 0000 0101 080a 0208 adfc 0x0030: 0208 adfc 8001 0001 0000 0003 4c6f 6700 0x0040: 0000 000f 0001 0c00 0000 010b 0001 0000 0x0050: 0005 6974 656d 730b 0002 0000 0030 2863 0x0060: 6c75 622e 7573 652c 2036 3031 322c 2030 0x0070: 2c20 302c 2032 3830 2c20 302c 2074 7275 0x0080: 652c 2031 3332 3035 3838 3631 3729 0000 17:10:17.815971 IP (tos 0x0, ttl 64, id 60632, offset 0, flags [DF], proto TCP (6), length 52) localhost.1463 > localhost.39412: Flags [R.], cksum 0xfe28 (incorrect -> 0x1111), seq 337, ack 1210, win 256, options [nop,nop,TS val 34123260 ecr 34123260], length 0 0x0000: 4500 0034 ecd8 4000 4006 4fe9 7f00 0001 0x0010: 7f00 0001 05b7 99f4 747b 6961 7417 14fd 0x0020: 8014 0100 fe28 0000 0101 080a 0208 adfc 0x0030: 0208 adfc There are two sequentional tframedtransport headers/packets in the dump. then ack from scribe. then psh of Log_call. it is clearly ok that scribe said it is broken transport caused by tframes are mixed. 1 packet is TFrame data size for message = 3a, 2 packet is TFrame data size = 5c. Third is ACK. Forth is Log_call length of 5c. Fifth packed is ACK with RST from scribe cause it found error in binary stream. There is no Log_call packet of length 3a between two first packets. Two sequentional tframe headers are in separate packets from it's data only because of TCP_NODELAY and separated channel.write(TFrameSizeBuffer) and channel.write(TFrameDataBuffer). Why there are separated states and buffers to write BODY_SIZE and BODY_DATA in the client (TAsyncMethodCall)? Why there are separated states for READS? The only thing the BODY_SIZE is for, is emulation of TFramedTransport, but why it writes it's header asynchronously and separatelly from BODY_DATA? Sync version of client writes TFrameHeader and it's data in one single TCP packet, not as async client. Any way, this follows transport to mess it's data. Am i right this is bug of implementation of tframed support in async client? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira