[GitHub] thrift pull request #1036: THRIFT-3867 Specify BinaryProtocol and CompactPro...

erikvanoosten Tue, 05 Jul 2016 12:18:53 -0700

Github user erikvanoosten commented on a diff in the pull request:

    https://github.com/apache/thrift/pull/1036#discussion_r69621679
  
    --- Diff: doc/specs/thrift-binary-protocol-encoding.md ---
    @@ -0,0 +1,467 @@
    +Thrift Protocol Encoding for BinaryProtocol and CompactProtocol
    +====================================================================
    +
    +Last Modified: 2016-Jun-29
    +
    +! WARNING !
    +
    +This document is _work in progress_ and should not (yet) be seen as an 
authoritative source of information.
    +
    +This text is submitted to the Thrift community for review and improvements.
    +
    +--------------------------------------------------------------------
    +
    +Licensed to the Apache Software Foundation (ASF) under one
    +or more contributor license agreements. See the NOTICE file
    +distributed with this work for additional information
    +regarding copyright ownership. The ASF licenses this file
    +to you under the Apache License, Version 2.0 (the
    +"License"); you may not use this file except in compliance
    +with the License. You may obtain a copy of the License at
    +
    +  http://www.apache.org/licenses/LICENSE-2.0
    +
    +Unless required by applicable law or agreed to in writing,
    +software distributed under the License is distributed on an
    +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +KIND, either express or implied. See the License for the
    +specific language governing permissions and limitations
    +under the License.
    +
    +--------------------------------------------------------------------
    +
    +There are many ways to encode Thrift on the wire. This documents focuses 
on the wire encoding for services calls
    +(encoding and semantics) in the Thrift older *binary protocol* (which has 
not been documented before) and the
    +*compact protocol*. Both the regular socket transport (unframed) and the 
framed transport are described.
    +
    +Note that no effort is made to group descriptions of behavior of the 
Thrift server and the encodings used in the
    +Thrift wire format. The order in which things are described is such that 
you can read the document from top to bottom.
    +
    +The information here is mostly based on the Java implementation in the 
Apache thrift library (version 0.9.1) and
    +[THRIFT-110 A more compact 
format](https://issues.apache.org/jira/browse/THRIFT-110). Other implementation 
however,
    +should behave the same.
    +
    +## Message exchange
    +
    +Both the binary protocol and the compact protocol assume a transport layer 
that exposes a bi-directional byte stream,
    +for example a TCP socket. Both use the following message exchange:
    +
    +1. Client sends a `TMessage` (type `Call`). The TMessage contains some 
metadata and the name of the method to invoke.
    +2. Client sends method arguments (a struct defined by the generate code).
    +3. Server sends a `TMessage` (type `Response` or `Exception`) to start the 
response.
    +4. Server sends completes response with a struct (a predefined struct or 
one defined by generated code).
    +
    +The pattern is a simple half duplex protocol where the parties alternate 
in sending a `TMessage` followed by a struct.
    +What these are is described below.
    +
    +Although the standard Apache Thrift Java clients do not support pipelining 
(sending multiple requests without waiting
    +for an response), the standard Apache Thrift Java servers do support it.
    +
    +## TMessage
    +
    +A *TMessage* contains the following information:
    +
    +* _Message type_, a message types, one of `Call`, `Reply`, `Exception` and 
`Oneway`.
    +* _Sequence id_, an int32 integer.
    +* _Name_, a string (can be empty).
    +
    +The *sequence id* is a simple message id assigned by the client. The 
server will use the same sequence id in the
    +TMessage of the response. The client uses this number to detect out of 
order responses. Each client has a int32 field
    +which is increased for each message. The sequence id simply wraps around 
when it overflows.
    +
    +The *name* indicates the service method name to invoke. The server uses 
the same name in the TMessage of the response.
    +
    +When the *multiplexed protocol* is used, the name contains the service 
name, a colon `:` and the method name. The
    +multiplexed protocol is not compatible with other protocols.
    +
    +The *message type* indicates what kind of message is sent.
    +
    +Clients send requests with TMessages of type `Call` or `Oneway` (step 1 in 
the protocol exchange). Servers send
    +responses with TMessages of type `Exception` or `Reply`.
    +
    +### Oneway
    +
    +Type `Oneway` is only used starting from Apache Thrift 0.9.3. Earlier 
versions do _not_ send TMessages of type `Oneway`,
    +even for service methods defined with the `oneway` modifier.
    +
    +When client sends a request with type `Oneway`, the server must _not_ send 
a response (steps 3 and 4 are skipped).
    +Strangely enough (in the Java code generated by Apache Thrift 0.9.1 up to 
0.9.3), only responses of type `Response` are
    +skipped. Responses of type `Exception` are always send. There is no 
correct way to handle this situation from the client
    +perspective; you either wait for a response or not, you can't do both. 
Luckily this has been fixed _after_ Apache Thrift
    +0.9.3 (THRIFT-3479). My advice is to avoid oneway methods unless you know 
exactly what behavior your stack has.
    +
    +## Integer encoding
    +
    +In the _binary protocol_ integers are encoded with the most significant 
byte first (big endian byte order, aka network
    +order). An `int8` needs 1 byte, an `int16` 2, an `int32` 4 and an `int64` 
needs 8 bytes.
    +
    +The CPP version has the option to use the binary protocol with little 
endian order. Little endian gives a small but
    +noticeable performance boost because contemporary CPUs use little endian 
when storing integers to RAM.
    +
    +The _compact protocol_ uses multiple encodings for ints: the _zigzag int_, 
and the _var int_.
    +
    +Values of type `int32` and `int64` are first transformed to a *zigzag 
int*. A zigzag int folds positive and negative
    +numbers into the positive number space. When we read 0, 1, 2, 3, 4 or 5 
from the wire, this is translated to 0, -1, 1,
    +-2 or 2 respectively. Here are the (scala) formulas to convert from 
int32/int64 to a zigzag int and back:
    +
    +```scala
    +def intToZigZag(n: Int): Int = (n << 1) ^ (n >> 31)
    +def zigzagToInt(n: Int): Int = (n >>> 1) ^ - (n & 1)
    +def longToZigZag(n: Long): Long = (n << 1) ^ (n >> 63)
    +def zigzagToLong(n: Long): Long = (n >>> 1) ^ - (n & 1)
    +```
    +
    +The zigzag int is then encoded as a *var int*. Var ints take 1 to 5 bytes 
(int32) or 1 to 10 bytes (int64). The most
    +significant bit of each byte indicates if more bytes follow. The 
concatenation of the least significant 7 bits from each
    +byte form the number, where the first byte has the most significant bits 
(so they are in big endian or network order).
    +
    +Var ints are sometimes used directly inside the compact protocol to 
represent positive numbers.
    +
    +To encode an `int16` as zigzag int, it is first converted to an `int32` 
and then encoded as such. The type `int8` simply
    +uses a single byte as in the binary protocol.
    +
    +## Enum encoding
    +
    +The generated code encodes `Enum`s by taking the ordinal value and then 
encoding that as an int32.
    +
    +## String encoding
    +
    +*String*s are first encoded to UTF-8, and then send as Binary. Binary 
encoding is described later.
    +
    +## Double encoding
    +
    +Values of type `double` are first converted to a int64 according to the 
IEEE 754 floating-point "double format" bit
    +layout. Most run-times provide a library to make this conversion. Both the 
binary protocol as the compact protocol then
    +encode the int64 in 8 bytes in big endian order.
    +
    +## TMessage encoding
    +
    +A `TMessage` on the wire looks as follows:
    +
    +```
    +Binary protocol (strict, 12+ bytes):
    
++--------+--------+--------+--------+--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+
    +|1vvvvvvv|vvvvvvvv|unused  |00000mmm| name length                       | 
name                | seq id                            |
    
++--------+--------+--------+--------+--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+
    +
    +Binary protocol (old, 9+ bytes):
    
++--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+--------+
    +| name length                       | name                |00000mmm| seq 
id                            |
    
++--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+--------+
    +```
    +
    +Where:
    +
    +* `vvvvvvvvvvvvvvv` is the version, an unsigned 15 bit number fixed to 
`1`. The leading bit must be `1`.
    +* `unused` is an ignored byte.
    +* `mmm` is the message type, an unsigned 3 bit integer. The 5 leading bits 
must be `0` as some clients (checked for java in 0.9.1) take the whole byte.
    +* `name length` is the byte length of the name field, a signed 32 bit 
integer encoded in network (big endian) order (must be >= 0).
    +* `name` is the method name to invoke, a UTF-8 encoded string.
    +* `seq id` is the sequence id, a signed 32 bit integer encoded in network 
(big endian) order.
    +
    +Because name length must be positive (therefor the first bit is always 
`0`), the first bit allows the receiver to see
    +whether the strict format or the old format is used. Therefore a server 
and client using the different variants of the
    +binary protocol can transparently talk with each other. However, when 
strict mode is enforced, the old format is
    +rejected.
    +
    +```
    +Compact protocol (4+ bytes):
    
++--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+
    +|pppppppp|mmmvvvvv| seq id              | name length         | name       
         |
    
++--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+
    +```
    +
    +Where:
    +
    +* `pppppppp` is the protocol id, fixed to `1000 0010`, 0x82
    +* `vvvvv` is the version, an unsigned 5 bit integer, fixed to `00001`, 0x1
    +* `mmm` is the message type, an unsigned 3 bit integer
    +* `seq id` is the sequence id, a signed 32 bit integer encoded as a var 
int.
    +* `name length` is the byte length of the name field, a signed 32 bit 
integer encoded as a var int (must be >= 0).
    +* `name` is the method name to invoke, a UTF-8 encoded string.
    +
    +A server could automatically determine whether a client talks the binary 
protocol or the compact protocol by
    +investigating the first byte. If the value is `1000 0001` or `0000 0000` 
(assuming a name shorter then Â±16 MB) it is the
    +binary protocol. When the value is `1000 0010` it is talking the compact 
protocol.
    +
    +Message types are encoded with the following values:
    +
    +* _Call_: 1
    +* _Reply_: 2
    +* _Exception_: 3
    +* _Oneway_: 4
    +
    +For a method name of 32 bytes, the binary protocol (strict) needs 44 bytes 
and the compact protocol needs 36 to 40 bytes.
    +
    +## Method arguments, return types and exceptions
    +
    +TODO: method arguments are encoded as a struct
    +
    +TODO: return value are encoded as a ?
    +
    +Both the binary protocol and compact protocol encode the result the same. 
The result is encoded as a struct with a
    +single field. The field-id of that field is `0`. The type of the field 
corresponds to the type of the service method
    +return type. When the return value is `null`, an empty struct is encoded.
    +
    +TODO: describe TupleProtocol also?
    +
    +TODO: exceptions are encoded as a struct with what field?
    +
    +### Structs and Unions
    +
    +Both the binary protocol and the compact protocol encode structs as a 
sequence of fields, followed by a stop field. Each
    +field starts with a field header and is followed by the encoded field 
value. The encoding follows this BNF (`*` means 0
    +or more times, parenthesis are used for grouping):
    +
    +```
    +struct        => ( field-header field-value )* stop-field
    +field-header  => field-type field-id
    +```
    +
    +Because each field header contains the field-id (as defined by the IDL), 
the fields can be encoded in any order.
    +Thrift's type system is not extensible, you can only encode the primitive 
types and structs. Therefore is also possible
    +to handle unknown fields while decoding; these are simply ignored. While 
decoding the field type can be used to
    +determine how to decode the field value.
    +
    +In the binary protocol the field header is always 3 bytes long. In the 
compact protocol the field header packs a lot
    +more cleverness. In most cases it is 1 byte long. In special cases it can 
grow to 2 or 3 bytes. When your field-ids are
    +very large it can even grow to 4 bytes.
    --- End diff --
    
    Field headers are more clever then that: they have a short and long form. 
In the short form they contain a field-id delta.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] thrift pull request #1036: THRIFT-3867 Specify BinaryProtocol and CompactPro...

Reply via email to