[
https://issues.apache.org/jira/browse/THRIFT-3867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15357737#comment-15357737
]
ASF GitHub Bot commented on THRIFT-3867:
----------------------------------------
Github user Jens-G commented on a diff in the pull request:
https://github.com/apache/thrift/pull/1036#discussion_r69198205
--- Diff: doc/specs/thrift-binary-protocol-encoding.md ---
@@ -0,0 +1,467 @@
+Thrift Protocol Encoding for BinaryProtocol and CompactProtocol
+====================================================================
+
+Last Modified: 2016-Jun-29
+
+! WARNING !
+
+This document is _work in progress_ and should not (yet) be seen as an
authoritative source of information.
+
+This text is submitted to the Thrift community for review and improvements.
+
+--------------------------------------------------------------------
+
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+
+--------------------------------------------------------------------
+
+There are many ways to encode Thrift on the wire. This documents focuses
on the wire encoding for services calls
+(encoding and semantics) in the Thrift older *binary protocol* (which has
not been documented before) and the
+*compact protocol*. Both the regular socket transport (unframed) and the
framed transport are described.
+
+Note that no effort is made to group descriptions of behavior of the
Thrift server and the encodings used in the
+Thrift wire format. The order in which things are described is such that
you can read the document from top to bottom.
+
+The information here is mostly based on the Java implementation in the
Apache thrift library (version 0.9.1) and
+[THRIFT-110 A more compact
format](https://issues.apache.org/jira/browse/THRIFT-110). Other implementation
however,
+should behave the same.
+
+## Message exchange
+
+Both the binary protocol and the compact protocol assume a transport layer
that exposes a bi-directional byte stream,
+for example a TCP socket. Both use the following message exchange:
+
+1. Client sends a `TMessage` (type `Call`). The TMessage contains some
metadata and the name of the method to invoke.
+2. Client sends method arguments (a struct defined by the generate code).
+3. Server sends a `TMessage` (type `Response` or `Exception`) to start the
response.
+4. Server sends completes response with a struct (a predefined struct or
one defined by generated code).
+
+The pattern is a simple half duplex protocol where the parties alternate
in sending a `TMessage` followed by a struct.
+What these are is described below.
+
+Although the standard Apache Thrift Java clients do not support pipelining
(sending multiple requests without waiting
+for an response), the standard Apache Thrift Java servers do support it.
+
+## TMessage
+
+A *TMessage* contains the following information:
+
+* _Message type_, a message types, one of `Call`, `Reply`, `Exception` and
`Oneway`.
+* _Sequence id_, an int32 integer.
+* _Name_, a string (can be empty).
+
+The *sequence id* is a simple message id assigned by the client. The
server will use the same sequence id in the
+TMessage of the response. The client uses this number to detect out of
order responses. Each client has a int32 field
+which is increased for each message. The sequence id simply wraps around
when it overflows.
+
+The *name* indicates the service method name to invoke. The server uses
the same name in the TMessage of the response.
+
+When the *multiplexed protocol* is used, the name contains the service
name, a colon `:` and the method name. The
+multiplexed protocol is not compatible with other protocols.
+
+The *message type* indicates what kind of message is sent.
+
+Clients send requests with TMessages of type `Call` or `Oneway` (step 1 in
the protocol exchange). Servers send
+responses with TMessages of type `Exception` or `Reply`.
+
+### Oneway
+
+Type `Oneway` is only used starting from Apache Thrift 0.9.3. Earlier
versions do _not_ send TMessages of type `Oneway`,
+even for service methods defined with the `oneway` modifier.
+
+When client sends a request with type `Oneway`, the server must _not_ send
a response (steps 3 and 4 are skipped).
+Strangely enough (in the Java code generated by Apache Thrift 0.9.1 up to
0.9.3), only responses of type `Response` are
+skipped. Responses of type `Exception` are always send. There is no
correct way to handle this situation from the client
+perspective; you either wait for a response or not, you can't do both.
Luckily this has been fixed _after_ Apache Thrift
+0.9.3 (THRIFT-3479). My advice is to avoid oneway methods unless you know
exactly what behavior your stack has.
+
+## Integer encoding
+
+In the _binary protocol_ integers are encoded with the most significant
byte first (big endian byte order, aka network
+order). An `int8` needs 1 byte, an `int16` 2, an `int32` 4 and an `int64`
needs 8 bytes.
+
+The CPP version has the option to use the binary protocol with little
endian order. Little endian gives a small but
+noticeable performance boost because contemporary CPUs use little endian
when storing integers to RAM.
+
+The _compact protocol_ uses multiple encodings for ints: the _zigzag int_,
and the _var int_.
+
+Values of type `int32` and `int64` are first transformed to a *zigzag
int*. A zigzag int folds positive and negative
+numbers into the positive number space. When we read 0, 1, 2, 3, 4 or 5
from the wire, this is translated to 0, -1, 1,
+-2 or 2 respectively. Here are the (scala) formulas to convert from
int32/int64 to a zigzag int and back:
+
+```scala
+def intToZigZag(n: Int): Int = (n << 1) ^ (n >> 31)
+def zigzagToInt(n: Int): Int = (n >>> 1) ^ - (n & 1)
+def longToZigZag(n: Long): Long = (n << 1) ^ (n >> 63)
+def zigzagToLong(n: Long): Long = (n >>> 1) ^ - (n & 1)
+```
+
+The zigzag int is then encoded as a *var int*. Var ints take 1 to 5 bytes
(int32) or 1 to 10 bytes (int64). The most
+significant bit of each byte indicates if more bytes follow. The
concatenation of the least significant 7 bits from each
+byte form the number, where the first byte has the most significant bits
(so they are in big endian or network order).
+
+Var ints are sometimes used directly inside the compact protocol to
represent positive numbers.
+
+To encode an `int16` as zigzag int, it is first converted to an `int32`
and then encoded as such. The type `int8` simply
+uses a single byte as in the binary protocol.
+
+## Enum encoding
+
+The generated code encodes `Enum`s by taking the ordinal value and then
encoding that as an int32.
+
+## String encoding
+
+*String*s are first encoded to UTF-8, and then send as Binary. Binary
encoding is described later.
+
+## Double encoding
+
+Values of type `double` are first converted to a int64 according to the
IEEE 754 floating-point "double format" bit
+layout. Most run-times provide a library to make this conversion. Both the
binary protocol as the compact protocol then
+encode the int64 in 8 bytes in big endian order.
+
+## TMessage encoding
+
+A `TMessage` on the wire looks as follows:
+
+```
+Binary protocol (strict, 12+ bytes):
++--------+--------+--------+--------+--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+
+|1vvvvvvv|vvvvvvvv|unused |00000mmm| name length |
name | seq id |
++--------+--------+--------+--------+--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+
+
+Binary protocol (old, 9+ bytes):
++--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+--------+
+| name length | name |00000mmm| seq
id |
++--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+--------+
+```
+
+Where:
+
+* `vvvvvvvvvvvvvvv` is the version, an unsigned 15 bit number fixed to
`1`. The leading bit must be `1`.
+* `unused` is an ignored byte.
+* `mmm` is the message type, an unsigned 3 bit integer. The 5 leading bits
must be `0` as some clients (checked for java in 0.9.1) take the whole byte.
+* `name length` is the byte length of the name field, a signed 32 bit
integer encoded in network (big endian) order (must be >= 0).
+* `name` is the method name to invoke, a UTF-8 encoded string.
+* `seq id` is the sequence id, a signed 32 bit integer encoded in network
(big endian) order.
+
+Because name length must be positive (therefor the first bit is always
`0`), the first bit allows the receiver to see
+whether the strict format or the old format is used. Therefore a server
and client using the different variants of the
+binary protocol can transparently talk with each other. However, when
strict mode is enforced, the old format is
+rejected.
+
+```
+Compact protocol (4+ bytes):
++--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+
+|pppppppp|mmmvvvvv| seq id | name length | name
|
++--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+
+```
+
+Where:
+
+* `pppppppp` is the protocol id, fixed to `1000 0010`, 0x82
+* `vvvvv` is the version, an unsigned 5 bit integer, fixed to `00001`, 0x1
+* `mmm` is the message type, an unsigned 3 bit integer
+* `seq id` is the sequence id, a signed 32 bit integer encoded as a var
int.
+* `name length` is the byte length of the name field, a signed 32 bit
integer encoded as a var int (must be >= 0).
+* `name` is the method name to invoke, a UTF-8 encoded string.
+
+A server could automatically determine whether a client talks the binary
protocol or the compact protocol by
+investigating the first byte. If the value is `1000 0001` or `0000 0000`
(assuming a name shorter then ±16 MB) it is the
+binary protocol. When the value is `1000 0010` it is talking the compact
protocol.
+
+Message types are encoded with the following values:
+
+* _Call_: 1
+* _Reply_: 2
+* _Exception_: 3
+* _Oneway_: 4
+
+For a method name of 32 bytes, the binary protocol (strict) needs 44 bytes
and the compact protocol needs 36 to 40 bytes.
+
+## Method arguments, return types and exceptions
+
+TODO: method arguments are encoded as a struct
+
+TODO: return value are encoded as a ?
+
--- End diff --
... struct. They also may contain exceptions which are technically just a
special kind of structs.
> Specify BinaryProtocol and CompactProtocol
> ------------------------------------------
>
> Key: THRIFT-3867
> URL: https://issues.apache.org/jira/browse/THRIFT-3867
> Project: Thrift
> Issue Type: Documentation
> Components: Documentation
> Reporter: Erik van Oosten
>
> It would be nice when the protocol(s) would be specified somewhere. This
> should improve communication between developers, but also opens the way for
> alternative implementations so that Thrift can thrive even better.
> I have a fairly complete description of the BinaryProtocol and
> CompactProtocol which I will submit as a patch for further review and
> discussion.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)