[ https://issues.apache.org/jira/browse/THRIFT-3867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15363041#comment-15363041 ]
ASF GitHub Bot commented on THRIFT-3867: ---------------------------------------- Github user erikvanoosten commented on a diff in the pull request: https://github.com/apache/thrift/pull/1036#discussion_r69620426 --- Diff: doc/specs/thrift-binary-protocol-encoding.md --- @@ -0,0 +1,467 @@ +Thrift Protocol Encoding for BinaryProtocol and CompactProtocol +==================================================================== + +Last Modified: 2016-Jun-29 + +! WARNING ! + +This document is _work in progress_ and should not (yet) be seen as an authoritative source of information. + +This text is submitted to the Thrift community for review and improvements. + +-------------------------------------------------------------------- + +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +-------------------------------------------------------------------- + +There are many ways to encode Thrift on the wire. This documents focuses on the wire encoding for services calls +(encoding and semantics) in the Thrift older *binary protocol* (which has not been documented before) and the +*compact protocol*. Both the regular socket transport (unframed) and the framed transport are described. + +Note that no effort is made to group descriptions of behavior of the Thrift server and the encodings used in the +Thrift wire format. The order in which things are described is such that you can read the document from top to bottom. + +The information here is mostly based on the Java implementation in the Apache thrift library (version 0.9.1) and +[THRIFT-110 A more compact format](https://issues.apache.org/jira/browse/THRIFT-110). Other implementation however, +should behave the same. + +## Message exchange + +Both the binary protocol and the compact protocol assume a transport layer that exposes a bi-directional byte stream, +for example a TCP socket. Both use the following message exchange: + +1. Client sends a `TMessage` (type `Call`). The TMessage contains some metadata and the name of the method to invoke. +2. Client sends method arguments (a struct defined by the generate code). +3. Server sends a `TMessage` (type `Response` or `Exception`) to start the response. +4. Server sends completes response with a struct (a predefined struct or one defined by generated code). + +The pattern is a simple half duplex protocol where the parties alternate in sending a `TMessage` followed by a struct. +What these are is described below. + +Although the standard Apache Thrift Java clients do not support pipelining (sending multiple requests without waiting +for an response), the standard Apache Thrift Java servers do support it. + +## TMessage + +A *TMessage* contains the following information: + +* _Message type_, a message types, one of `Call`, `Reply`, `Exception` and `Oneway`. +* _Sequence id_, an int32 integer. +* _Name_, a string (can be empty). + +The *sequence id* is a simple message id assigned by the client. The server will use the same sequence id in the +TMessage of the response. The client uses this number to detect out of order responses. Each client has a int32 field +which is increased for each message. The sequence id simply wraps around when it overflows. + +The *name* indicates the service method name to invoke. The server uses the same name in the TMessage of the response. + +When the *multiplexed protocol* is used, the name contains the service name, a colon `:` and the method name. The +multiplexed protocol is not compatible with other protocols. + +The *message type* indicates what kind of message is sent. + +Clients send requests with TMessages of type `Call` or `Oneway` (step 1 in the protocol exchange). Servers send +responses with TMessages of type `Exception` or `Reply`. + +### Oneway + +Type `Oneway` is only used starting from Apache Thrift 0.9.3. Earlier versions do _not_ send TMessages of type `Oneway`, +even for service methods defined with the `oneway` modifier. + +When client sends a request with type `Oneway`, the server must _not_ send a response (steps 3 and 4 are skipped). +Strangely enough (in the Java code generated by Apache Thrift 0.9.1 up to 0.9.3), only responses of type `Response` are +skipped. Responses of type `Exception` are always send. There is no correct way to handle this situation from the client +perspective; you either wait for a response or not, you can't do both. Luckily this has been fixed _after_ Apache Thrift +0.9.3 (THRIFT-3479). My advice is to avoid oneway methods unless you know exactly what behavior your stack has. + +## Integer encoding + +In the _binary protocol_ integers are encoded with the most significant byte first (big endian byte order, aka network +order). An `int8` needs 1 byte, an `int16` 2, an `int32` 4 and an `int64` needs 8 bytes. + +The CPP version has the option to use the binary protocol with little endian order. Little endian gives a small but +noticeable performance boost because contemporary CPUs use little endian when storing integers to RAM. + +The _compact protocol_ uses multiple encodings for ints: the _zigzag int_, and the _var int_. + +Values of type `int32` and `int64` are first transformed to a *zigzag int*. A zigzag int folds positive and negative +numbers into the positive number space. When we read 0, 1, 2, 3, 4 or 5 from the wire, this is translated to 0, -1, 1, +-2 or 2 respectively. Here are the (scala) formulas to convert from int32/int64 to a zigzag int and back: + +```scala +def intToZigZag(n: Int): Int = (n << 1) ^ (n >> 31) +def zigzagToInt(n: Int): Int = (n >>> 1) ^ - (n & 1) +def longToZigZag(n: Long): Long = (n << 1) ^ (n >> 63) +def zigzagToLong(n: Long): Long = (n >>> 1) ^ - (n & 1) +``` + +The zigzag int is then encoded as a *var int*. Var ints take 1 to 5 bytes (int32) or 1 to 10 bytes (int64). The most +significant bit of each byte indicates if more bytes follow. The concatenation of the least significant 7 bits from each +byte form the number, where the first byte has the most significant bits (so they are in big endian or network order). + +Var ints are sometimes used directly inside the compact protocol to represent positive numbers. + +To encode an `int16` as zigzag int, it is first converted to an `int32` and then encoded as such. The type `int8` simply +uses a single byte as in the binary protocol. + +## Enum encoding + +The generated code encodes `Enum`s by taking the ordinal value and then encoding that as an int32. + +## String encoding + +*String*s are first encoded to UTF-8, and then send as Binary. Binary encoding is described later. + +## Double encoding + +Values of type `double` are first converted to a int64 according to the IEEE 754 floating-point "double format" bit +layout. Most run-times provide a library to make this conversion. Both the binary protocol as the compact protocol then +encode the int64 in 8 bytes in big endian order. + +## TMessage encoding + +A `TMessage` on the wire looks as follows: + +``` +Binary protocol (strict, 12+ bytes): ++--------+--------+--------+--------+--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+ +|1vvvvvvv|vvvvvvvv|unused |00000mmm| name length | name | seq id | ++--------+--------+--------+--------+--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+ + +Binary protocol (old, 9+ bytes): ++--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+--------+ +| name length | name |00000mmm| seq id | ++--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+--------+ +``` + +Where: + +* `vvvvvvvvvvvvvvv` is the version, an unsigned 15 bit number fixed to `1`. The leading bit must be `1`. --- End diff -- I don't understand. https://github.com/apache/thrift/blob/master/lib/cpp/src/thrift/protocol/TBinaryProtocol.h#L40 indicates that the version mask takes the first 2 bytes = 16 bits. The first bit must be `1` is used to distinguish the format and is therefore not part of the version number. Would it be clearer if I say that the version is fixed to `000 0000 0000 0001`? > Specify BinaryProtocol and CompactProtocol > ------------------------------------------ > > Key: THRIFT-3867 > URL: https://issues.apache.org/jira/browse/THRIFT-3867 > Project: Thrift > Issue Type: Documentation > Components: Documentation > Reporter: Erik van Oosten > > It would be nice when the protocol(s) would be specified somewhere. This > should improve communication between developers, but also opens the way for > alternative implementations so that Thrift can thrive even better. > I have a fairly complete description of the BinaryProtocol and > CompactProtocol which I will submit as a patch for further review and > discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)