[ 
https://issues.apache.org/jira/browse/THRIFT-3867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15363053#comment-15363053
 ] 

ASF GitHub Bot commented on THRIFT-3867:
----------------------------------------

Github user erikvanoosten commented on a diff in the pull request:

    https://github.com/apache/thrift/pull/1036#discussion_r69621679
  
    --- Diff: doc/specs/thrift-binary-protocol-encoding.md ---
    @@ -0,0 +1,467 @@
    +Thrift Protocol Encoding for BinaryProtocol and CompactProtocol
    +====================================================================
    +
    +Last Modified: 2016-Jun-29
    +
    +! WARNING !
    +
    +This document is _work in progress_ and should not (yet) be seen as an 
authoritative source of information.
    +
    +This text is submitted to the Thrift community for review and improvements.
    +
    +--------------------------------------------------------------------
    +
    +Licensed to the Apache Software Foundation (ASF) under one
    +or more contributor license agreements. See the NOTICE file
    +distributed with this work for additional information
    +regarding copyright ownership. The ASF licenses this file
    +to you under the Apache License, Version 2.0 (the
    +"License"); you may not use this file except in compliance
    +with the License. You may obtain a copy of the License at
    +
    +  http://www.apache.org/licenses/LICENSE-2.0
    +
    +Unless required by applicable law or agreed to in writing,
    +software distributed under the License is distributed on an
    +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +KIND, either express or implied. See the License for the
    +specific language governing permissions and limitations
    +under the License.
    +
    +--------------------------------------------------------------------
    +
    +There are many ways to encode Thrift on the wire. This documents focuses 
on the wire encoding for services calls
    +(encoding and semantics) in the Thrift older *binary protocol* (which has 
not been documented before) and the
    +*compact protocol*. Both the regular socket transport (unframed) and the 
framed transport are described.
    +
    +Note that no effort is made to group descriptions of behavior of the 
Thrift server and the encodings used in the
    +Thrift wire format. The order in which things are described is such that 
you can read the document from top to bottom.
    +
    +The information here is mostly based on the Java implementation in the 
Apache thrift library (version 0.9.1) and
    +[THRIFT-110 A more compact 
format](https://issues.apache.org/jira/browse/THRIFT-110). Other implementation 
however,
    +should behave the same.
    +
    +## Message exchange
    +
    +Both the binary protocol and the compact protocol assume a transport layer 
that exposes a bi-directional byte stream,
    +for example a TCP socket. Both use the following message exchange:
    +
    +1. Client sends a `TMessage` (type `Call`). The TMessage contains some 
metadata and the name of the method to invoke.
    +2. Client sends method arguments (a struct defined by the generate code).
    +3. Server sends a `TMessage` (type `Response` or `Exception`) to start the 
response.
    +4. Server sends completes response with a struct (a predefined struct or 
one defined by generated code).
    +
    +The pattern is a simple half duplex protocol where the parties alternate 
in sending a `TMessage` followed by a struct.
    +What these are is described below.
    +
    +Although the standard Apache Thrift Java clients do not support pipelining 
(sending multiple requests without waiting
    +for an response), the standard Apache Thrift Java servers do support it.
    +
    +## TMessage
    +
    +A *TMessage* contains the following information:
    +
    +* _Message type_, a message types, one of `Call`, `Reply`, `Exception` and 
`Oneway`.
    +* _Sequence id_, an int32 integer.
    +* _Name_, a string (can be empty).
    +
    +The *sequence id* is a simple message id assigned by the client. The 
server will use the same sequence id in the
    +TMessage of the response. The client uses this number to detect out of 
order responses. Each client has a int32 field
    +which is increased for each message. The sequence id simply wraps around 
when it overflows.
    +
    +The *name* indicates the service method name to invoke. The server uses 
the same name in the TMessage of the response.
    +
    +When the *multiplexed protocol* is used, the name contains the service 
name, a colon `:` and the method name. The
    +multiplexed protocol is not compatible with other protocols.
    +
    +The *message type* indicates what kind of message is sent.
    +
    +Clients send requests with TMessages of type `Call` or `Oneway` (step 1 in 
the protocol exchange). Servers send
    +responses with TMessages of type `Exception` or `Reply`.
    +
    +### Oneway
    +
    +Type `Oneway` is only used starting from Apache Thrift 0.9.3. Earlier 
versions do _not_ send TMessages of type `Oneway`,
    +even for service methods defined with the `oneway` modifier.
    +
    +When client sends a request with type `Oneway`, the server must _not_ send 
a response (steps 3 and 4 are skipped).
    +Strangely enough (in the Java code generated by Apache Thrift 0.9.1 up to 
0.9.3), only responses of type `Response` are
    +skipped. Responses of type `Exception` are always send. There is no 
correct way to handle this situation from the client
    +perspective; you either wait for a response or not, you can't do both. 
Luckily this has been fixed _after_ Apache Thrift
    +0.9.3 (THRIFT-3479). My advice is to avoid oneway methods unless you know 
exactly what behavior your stack has.
    +
    +## Integer encoding
    +
    +In the _binary protocol_ integers are encoded with the most significant 
byte first (big endian byte order, aka network
    +order). An `int8` needs 1 byte, an `int16` 2, an `int32` 4 and an `int64` 
needs 8 bytes.
    +
    +The CPP version has the option to use the binary protocol with little 
endian order. Little endian gives a small but
    +noticeable performance boost because contemporary CPUs use little endian 
when storing integers to RAM.
    +
    +The _compact protocol_ uses multiple encodings for ints: the _zigzag int_, 
and the _var int_.
    +
    +Values of type `int32` and `int64` are first transformed to a *zigzag 
int*. A zigzag int folds positive and negative
    +numbers into the positive number space. When we read 0, 1, 2, 3, 4 or 5 
from the wire, this is translated to 0, -1, 1,
    +-2 or 2 respectively. Here are the (scala) formulas to convert from 
int32/int64 to a zigzag int and back:
    +
    +```scala
    +def intToZigZag(n: Int): Int = (n << 1) ^ (n >> 31)
    +def zigzagToInt(n: Int): Int = (n >>> 1) ^ - (n & 1)
    +def longToZigZag(n: Long): Long = (n << 1) ^ (n >> 63)
    +def zigzagToLong(n: Long): Long = (n >>> 1) ^ - (n & 1)
    +```
    +
    +The zigzag int is then encoded as a *var int*. Var ints take 1 to 5 bytes 
(int32) or 1 to 10 bytes (int64). The most
    +significant bit of each byte indicates if more bytes follow. The 
concatenation of the least significant 7 bits from each
    +byte form the number, where the first byte has the most significant bits 
(so they are in big endian or network order).
    +
    +Var ints are sometimes used directly inside the compact protocol to 
represent positive numbers.
    +
    +To encode an `int16` as zigzag int, it is first converted to an `int32` 
and then encoded as such. The type `int8` simply
    +uses a single byte as in the binary protocol.
    +
    +## Enum encoding
    +
    +The generated code encodes `Enum`s by taking the ordinal value and then 
encoding that as an int32.
    +
    +## String encoding
    +
    +*String*s are first encoded to UTF-8, and then send as Binary. Binary 
encoding is described later.
    +
    +## Double encoding
    +
    +Values of type `double` are first converted to a int64 according to the 
IEEE 754 floating-point "double format" bit
    +layout. Most run-times provide a library to make this conversion. Both the 
binary protocol as the compact protocol then
    +encode the int64 in 8 bytes in big endian order.
    +
    +## TMessage encoding
    +
    +A `TMessage` on the wire looks as follows:
    +
    +```
    +Binary protocol (strict, 12+ bytes):
    
++--------+--------+--------+--------+--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+
    +|1vvvvvvv|vvvvvvvv|unused  |00000mmm| name length                       | 
name                | seq id                            |
    
++--------+--------+--------+--------+--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+
    +
    +Binary protocol (old, 9+ bytes):
    
++--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+--------+
    +| name length                       | name                |00000mmm| seq 
id                            |
    
++--------+--------+--------+--------+--------+...+--------+--------+--------+--------+--------+--------+
    +```
    +
    +Where:
    +
    +* `vvvvvvvvvvvvvvv` is the version, an unsigned 15 bit number fixed to 
`1`. The leading bit must be `1`.
    +* `unused` is an ignored byte.
    +* `mmm` is the message type, an unsigned 3 bit integer. The 5 leading bits 
must be `0` as some clients (checked for java in 0.9.1) take the whole byte.
    +* `name length` is the byte length of the name field, a signed 32 bit 
integer encoded in network (big endian) order (must be >= 0).
    +* `name` is the method name to invoke, a UTF-8 encoded string.
    +* `seq id` is the sequence id, a signed 32 bit integer encoded in network 
(big endian) order.
    +
    +Because name length must be positive (therefor the first bit is always 
`0`), the first bit allows the receiver to see
    +whether the strict format or the old format is used. Therefore a server 
and client using the different variants of the
    +binary protocol can transparently talk with each other. However, when 
strict mode is enforced, the old format is
    +rejected.
    +
    +```
    +Compact protocol (4+ bytes):
    
++--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+
    +|pppppppp|mmmvvvvv| seq id              | name length         | name       
         |
    
++--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+
    +```
    +
    +Where:
    +
    +* `pppppppp` is the protocol id, fixed to `1000 0010`, 0x82
    +* `vvvvv` is the version, an unsigned 5 bit integer, fixed to `00001`, 0x1
    +* `mmm` is the message type, an unsigned 3 bit integer
    +* `seq id` is the sequence id, a signed 32 bit integer encoded as a var 
int.
    +* `name length` is the byte length of the name field, a signed 32 bit 
integer encoded as a var int (must be >= 0).
    +* `name` is the method name to invoke, a UTF-8 encoded string.
    +
    +A server could automatically determine whether a client talks the binary 
protocol or the compact protocol by
    +investigating the first byte. If the value is `1000 0001` or `0000 0000` 
(assuming a name shorter then ±16 MB) it is the
    +binary protocol. When the value is `1000 0010` it is talking the compact 
protocol.
    +
    +Message types are encoded with the following values:
    +
    +* _Call_: 1
    +* _Reply_: 2
    +* _Exception_: 3
    +* _Oneway_: 4
    +
    +For a method name of 32 bytes, the binary protocol (strict) needs 44 bytes 
and the compact protocol needs 36 to 40 bytes.
    +
    +## Method arguments, return types and exceptions
    +
    +TODO: method arguments are encoded as a struct
    +
    +TODO: return value are encoded as a ?
    +
    +Both the binary protocol and compact protocol encode the result the same. 
The result is encoded as a struct with a
    +single field. The field-id of that field is `0`. The type of the field 
corresponds to the type of the service method
    +return type. When the return value is `null`, an empty struct is encoded.
    +
    +TODO: describe TupleProtocol also?
    +
    +TODO: exceptions are encoded as a struct with what field?
    +
    +### Structs and Unions
    +
    +Both the binary protocol and the compact protocol encode structs as a 
sequence of fields, followed by a stop field. Each
    +field starts with a field header and is followed by the encoded field 
value. The encoding follows this BNF (`*` means 0
    +or more times, parenthesis are used for grouping):
    +
    +```
    +struct        => ( field-header field-value )* stop-field
    +field-header  => field-type field-id
    +```
    +
    +Because each field header contains the field-id (as defined by the IDL), 
the fields can be encoded in any order.
    +Thrift's type system is not extensible, you can only encode the primitive 
types and structs. Therefore is also possible
    +to handle unknown fields while decoding; these are simply ignored. While 
decoding the field type can be used to
    +determine how to decode the field value.
    +
    +In the binary protocol the field header is always 3 bytes long. In the 
compact protocol the field header packs a lot
    +more cleverness. In most cases it is 1 byte long. In special cases it can 
grow to 2 or 3 bytes. When your field-ids are
    +very large it can even grow to 4 bytes.
    --- End diff --
    
    Field headers are more clever then that: they have a short and long form. 
In the short form they contain a field-id delta.


> Specify BinaryProtocol and CompactProtocol
> ------------------------------------------
>
>                 Key: THRIFT-3867
>                 URL: https://issues.apache.org/jira/browse/THRIFT-3867
>             Project: Thrift
>          Issue Type: Documentation
>          Components: Documentation
>            Reporter: Erik van Oosten
>
> It would be nice when the protocol(s) would be specified somewhere. This 
> should improve communication between developers, but also opens the way for 
> alternative implementations so that Thrift can thrive even better.
> I have a fairly complete description of the BinaryProtocol and 
> CompactProtocol which I will submit as a patch for further review and 
> discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to