[GitHub] thrift pull request #1036: THRIFT-3867 Specify BinaryProtocol and CompactPro...

erikvanoosten Tue, 05 Jul 2016 12:10:06 -0700

Github user erikvanoosten commented on a diff in the pull request:

    https://github.com/apache/thrift/pull/1036#discussion_r69620426
  
    --- Diff: doc/specs/thrift-binary-protocol-encoding.md ---
    @@ -0,0 +1,467 @@
    +Thrift Protocol Encoding for BinaryProtocol and CompactProtocol
    +====================================================================
    +
    +Last Modified: 2016-Jun-29
    +
    +! WARNING !
    +
    +This document is _work in progress_ and should not (yet) be seen as an 
authoritative source of information.
    +
    +This text is submitted to the Thrift community for review and improvements.
    +
    +--------------------------------------------------------------------
    +
    +Licensed to the Apache Software Foundation (ASF) under one
    +or more contributor license agreements. See the NOTICE file
    +distributed with this work for additional information
    +regarding copyright ownership. The ASF licenses this file
    +to you under the Apache License, Version 2.0 (the
    +"License"); you may not use this file except in compliance
    +with the License. You may obtain a copy of the License at
    +
    +  http://www.apache.org/licenses/LICENSE-2.0
    +
    +Unless required by applicable law or agreed to in writing,
    +software distributed under the License is distributed on an
    +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +KIND, either express or implied. See the License for the
    +specific language governing permissions and limitations
    +under the License.
    +
    +--------------------------------------------------------------------
    +
    +There are many ways to encode Thrift on the wire. This documents focuses 
on the wire encoding for services calls
    +(encoding and semantics) in the Thrift older *binary protocol* (which has 
not been documented before) and the
    +*compact protocol*. Both the regular socket transport (unframed) and the 
framed transport are described.
    +
    +Note that no effort is made to group descriptions of behavior of the 
Thrift server and the encodings used in the
    +Thrift wire format. The order in which things are described is such that 
you can read the document from top to bottom.
    +
    +The information here is mostly based on the Java implementation in the 
Apache thrift library (version 0.9.1) and
    +[THRIFT-110 A more compact 
format](https://issues.apache.org/jira/browse/THRIFT-110). Other implementation 
however,
    +should behave the same.
    +
    +## Message exchange
    +
    +Both the binary protocol and the compact protocol assume a transport layer 
that exposes a bi-directional byte stream,
    +for example a TCP socket. Both use the following message exchange:
    +
    +1. Client sends a `TMessage` (type `Call`). The TMessage contains some 
metadata and the name of the method to invoke.
    +2. Client sends method arguments (a struct defined by the generate code).
    +3. Server sends a `TMessage` (type `Response` or `Exception`) to start the 
response.
    +4. Server sends completes response with a struct (a predefined struct or 
one defined by generated code).
    +
    +The pattern is a simple half duplex protocol where the parties alternate 
in sending a `TMessage` followed by a struct.
    +What these are is described below.
    +
    +Although the standard Apache Thrift Java clients do not support pipelining 
(sending multiple requests without waiting
    +for an response), the standard Apache Thrift Java servers do support it.
    +
    +## TMessage
    +
    +A *TMessage* contains the following information:
    +
    +* _Message type_, a message types, one of `Call`, `Reply`, `Exception` and 
`Oneway`.
    +* _Sequence id_, an int32 integer.
    +* _Name_, a string (can be empty).
    +
    +The *sequence id* is a simple message id assigned by the client. The 
server will use the same sequence id in the
    +TMessage of the response. The client uses this number to detect out of 
order responses. Each client has a int32 field
    +which is increased for each message. The sequence id simply wraps around 
when it overflows.
    +
    +The *name* indicates the service method name to invoke. The server uses 
the same name in the TMessage of the response.
    +
    +When the *multiplexed protocol* is used, the name contains the service 
name, a colon `:` and the method name. The
    +multiplexed protocol is not compatible with other protocols.
    +
    +The *message type* indicates what kind of message is sent.
    +
    +Clients send requests with TMessages of type `Call` or `Oneway` (step 1 in 
the protocol exchange). Servers send
    +responses with TMessages of type `Exception` or `Reply`.
    +
    +### Oneway
    +
    +Type `Oneway` is only used starting from Apache Thrift 0.9.3. Earlier 
versions do _not_ send TMessages of type `Oneway`,
    +even for service methods defined with the `oneway` modifier.
    +
    +When client sends a request with type `Oneway`, the server must _not_ send 
a response (steps 3 and 4 are skipped).
    +Strangely enough (in the Java code generated by Apache Thrift 0.9.1 up to 
0.9.3), only responses of type `Response` are
    +skipped. Responses of type `Exception` are always send. There is no 
correct way to handle this situation from the client
    +perspective; you either wait for a response or not, you can't do both. 
Luckily this has been fixed _after_ Apache Thrift
    +0.9.3 (THRIFT-3479). My advice is to avoid oneway methods unless you know 
exactly what behavior your stack has.
    +
    +## Integer encoding
    +
    +In the _binary protocol_ integers are encoded with the most significant 
byte first (big endian byte order, aka network
    +order). An `int8` needs 1 byte, an `int16` 2, an `int32` 4 and an `int64` 
needs 8 bytes.
    +
    +The CPP version has the option to use the binary protocol with little 
endian order. Little endian gives a small but
    +noticeable performance boost because contemporary CPUs use little endian 
when storing integers to RAM.
    +
    +The _compact protocol_ uses multiple encodings for ints: the _zigzag int_, 
and the _var int_.
    +
    +Values of type `int32` and `int64` are first transformed to a *zigzag 
int*. A zigzag int folds positive and negative
    +numbers into the positive number space. When we read 0, 1, 2, 3, 4 or 5 
from the wire, this is translated to 0, -1, 1,
    +-2 or 2 respectively. Here are the (scala) formulas to convert from 
int32/int64 to a zigzag int and back:
    +
    +```scala
    +def intToZigZag(n: Int): Int = (n << 1) ^ (n >> 31)
    +def zigzagToInt(n: Int): Int = (n >>> 1) ^ - (n & 1)
    +def longToZigZag(n: Long): Long = (n << 1) ^ (n >> 63)
    +def zigzagToLong(n: Long): Long = (n >>> 1) ^ - (n & 1)
    +```
    +
    +The zigzag int is then encoded as a *var int*. Var ints take 1 to 5 bytes 
(int32) or 1 to 10 bytes (int64). The most
    +significant bit of each byte indicates if more bytes follow. The 
concatenation of the least significant 7 bits from each
    +byte form the number, where the first byte has the most significant bits 
(so they are in big endian or network order).
    +
    +Var ints are sometimes used directly inside the compact protocol to 
represent positive numbers.
    +
    +To encode an `int16` as zigzag int, it is first converted to an `int32` 
and then encoded as such. The type `int8` simply
    +uses a single byte as in the binary protocol.
    +
    +## Enum encoding
    +
    +The generated code encodes `Enum`s by taking the ordinal value and then 
encoding that as an int32.
    +
    +## String encoding
    +
    +*String*s are first encoded to UTF-8, and then send as Binary. Binary 
encoding is described later.
    +
    +## Double encoding
    +
    +Values of type `double` are first converted to a int64 according to the 
IEEE 754 floating-point "double format" bit
    +layout. Most run-times provide a library to make this conversion. Both the 
binary protocol as the compact protocol then
    +encode the int64 in 8 bytes in big endian order.
    +
    +## TMessage encoding
    +
    +A `TMessage` on the wire looks as follows:
    +
    +```
    +Binary protocol (strict, 12+ bytes):
    
++--------+--------+--------+--------+--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+
    +|1vvvvvvv|vvvvvvvv|unused  |00000mmm| name length                       | 
name                | seq id                            |
    
++--------+--------+--------+--------+--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+
    +
    +Binary protocol (old, 9+ bytes):
    
++--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+--------+
    +| name length                       | name                |00000mmm| seq 
id                            |
    
++--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+--------+
    +```
    +
    +Where:
    +
    +* `vvvvvvvvvvvvvvv` is the version, an unsigned 15 bit number fixed to 
`1`. The leading bit must be `1`.
    --- End diff --
    
    I don't understand. 
https://github.com/apache/thrift/blob/master/lib/cpp/src/thrift/protocol/TBinaryProtocol.h#L40
 indicates that the version mask takes the first 2 bytes = 16 bits. The first 
bit must be `1` is used to distinguish the format and is therefore not part of 
the version number.
    
    Would it be clearer if I say that the version is fixed to `000 0000 0000 
0001`?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] thrift pull request #1036: THRIFT-3867 Specify BinaryProtocol and CompactPro...

Reply via email to