Re: connecting C to thrift

Mayan Moudgill Sun, 14 Feb 2010 13:11:07 -0800


Mayan Moudgill wrote:

I was working on getting a C port of the thrift backend working, butafter looking at some of the issues involved, I decided that it mademore sense to ditch the Thrift IDL, and use a different approach toaccess the Thrift marshalling and RPC infrastructure.
I've descibed the approach in the attatched document.
I wrote a prototype that implements about 1/2 the features described inthe document; I can post the source code and some examples a littlelater if there is interest.
I'm focusing on the client side stubs. I suspect that the stubsgenerated by the prototype are several factors and may even be faster byorders of magintude than the equivalent stubs generated by the Thrift IDL.

Hmmm... looks like mailing-lists don't like attatchments. OK, so I'veinlined the document below (it's been exported as a MediaWiki .txt file).


= Using Thrift with C =
= Mayan Moudgill =
= Introduction =

Thrift is a RPC frame-work developed at Facebook, and then open-sourcedas a project in the Apache Software Foundation Incubator. It consists ofa code generation engine and libraries to allow services to be builtthat can work seamlessly across a variety of different languages,including C++ & Python. One of the languages that is conspicuously notsupported was C.

During an attempt to port the Thrift framework to C, it became fairlyclear that, while Thrift allowed services to be specified and builtquickly by auto-generating the inter-process communication code and mostof the other code needed for building services, the resulting code wasnot very efficient.

In this paper, we describe an alternative approach that leverages theexisting Thrift infrastructure using C as the specification andimplementation language. This allows us to develop client stubs that areextremely low overhead.


== Overview ==

Thrift is a framework primarily for developing client-server styleservices. It has several components, including:

* A method for describing data-structures that are to be communicatedbetween the client & server.* A method for descriptions the (RPC) functions that are to be calledfrom the client side and executed on the server side.

Note that these descriptions are type declarations; in the case of theRPCs, the behavior of the function is not defined, and it is up to theserver-programmer to write the actual body.

Thrift also provides a method for specifying other behaviors, includingthe auto-generating a sever template, but for now we shall ignore thosecomponents.

Also, the framework does not necessarily need to be for traditionalclient-server applications; one application is to store & retrieve theâ€œRPC messagesâ€� on disk.


== Client/Server Communication ==

The Thrift compiler takes the specification of the RPCs and the datastructures, and generates code that will execute the RPC. Thus, give aspecification for a RPC <tt>foo()</tt>, the Thrift framework willgenerate code so that a client can call a function <tt>foo()</tt> withthe some arguments and have the function <tt>foo()</tt> evaluated on theserver with those arguments. The generated code will generate code forstubs on the client & sever that:


* On the client side, package the function & the arguments,
* Sends it over some transport link to the server side
* On the server side, un-package the function and the arguments
* Invoke the function in the context of some server
* Package up the results of the function
* Send it over the transport link to the client side

* On the client-side, un-package the function result and return it asthe result of the client-side stub.


The Thrift framework provides several layers of abstraction, including:

* The stub: generated by the compiler, it contains calls to the protocollayer to package up the primitive elements of the arguments being passed.* The protocol: the methods for packaging & un-packaging the primitivesto be sent over the transport layer* The transport: the means for shipping the bytes from the client to theserver and back.

This layering allows for a fairly large amount of flexibility â€“ inparticular, it is possible at run-time to pick different combinationsprotocols (i.e. different ways of packaging the data) and differenttransports (i.e. different ways of shipping the data). For instancegiven a Binary protocol and a Text protocol and a TCP-based transportand a UDP-base transport, it is possible to select 4 combinations â€“{Binary over TCP, Text over TCP, Binary over UDP, Text over UDP} â€“ andto do this selection at run-time.


== Monolithic Implementation ==

Having multiple layers of abstraction adds flexibility; however, asusual, flexibility comes at the price of performance. If we picked aparticular protocol & transport, then the Thrift compiler could generatecode for the stubs that would package the arguments and invoke thetransport layer without going through the intervening abstraction layers.

In the specific case of using the Thrift TBinaryProtocol over TSockettransport, the client stub could, for instance, in certain circumstancebe generated to use a fixed sized buffer with most of the values alreadyfilled in. The client stub would then insert arguments where necessary,and issue a single write call to send it over a socket.

For an extreme example, consider the case of the <tt>voidping(void)</tt> RPC defined in tutorial.thrift. An optimizedimplementation<ref name="ftn1">This is just an example; the actualimplementation would have additional error checking code, as well assupport for retry on EINTR etc.</ref> of the client side stub in C would be:


 void
 ping( int socket_fd )
 {
   <nowiki>static unsigned char send_buf[] = { </nowiki>
     0x80, 0x01, 0x00, 0x01,
     0x00, 0x00, 0x00, 0x04, 'p', 'i', 'n', 'g',
     0x00, 0x00, 0x00, 0x00, 0x00
   };
   <nowiki>unsigned char recv_buf[17];</nowiki>

   write(socket_fd, send_buf, sizeof(send_buf));
   read(socket_fd, recv_buf, sizeof(recv_buf));
 }

As might be expected, this has considerably lower overhead than the C++implementation currently being generated by the cpp option of the Thriftcompiler.


== C-compatible specification ==

The Thrift compiler starts off with a description of the data and RPCswritten in a Thrift-specific IDL syntax, and converts them intodata-structure & stub headers and files that can be used by the calledby the rest of the client program. However, there is no reason why ithas to be that particular syntax. In fact, one could just as easily useC syntax, with Thrift specific syntax buried in comments.

Hypothetically, one could rewrite the data-structure defined intutorial.thrift using the Thrift IDL syntax:


 struct Work {
   1: i32 num1,
   2: i32 num2,
   3: Operation op,
   4: optional string comment,
 }

using a C compatible description syntax:

 struct Work {
   int num1; /* @thrift: 1 */
   int num2; /* @thrift: 2 */
   enum Operation op; /* @thrift: 3 */
   char * comment; /* @thrift: 4 optional */
 };

= C-Based Definition =

In this section, we shall describe a proposed set of annotations to Cstructure and function declarations that will allow a stub-compiler toproduce stubs similar to those produced by the existing Thriftinfrastructure. The initial focus shall be on producing efficient codefor the TbinaryProtocol over TSocket transport, but at first glance, itappears likely that the same set of annotations will allow astub-compiler to generate code for other protocol/transport combinations.


== Preliminaries ==

The Thrift annotations shall be embedded inside comments in the C code.The thrift annotations will use the syntax:


 /* @thrift: â€¦ */

The section of code containing declarations for the stub-compiler toprocess shall be demarcated with <tt>begin</tt> and <tt>end</tt>annotations, shown below. The stub-compiler shall skip all other partsof the input. There shall be no variable definitions or declarationsbetween the begin and end.


 /* stub-compiler ignores this code */
 /* '''@thrift: begin''' */
   /* code for the stub-compiler to process */
   â€¦
 /* '''@thrift: end''' */

Annotations can be global, or associated with structures, fields,function or arguments. The annoatations shall appear after the closing'<nowiki>;</nowiki>' or '<tt>,</tt>' for that syntax element<refname="ftn2">For the last argument in a function, it shall appear betweenthe name of the argument and the closing '<tt>)</tt>'.</ref>. Thus,


 /* @thrift: begin */
   /* @thrift: default-field-init 1 */ /* global annotation */
   struct Work {
      â€¦
      char * comment; /* @thrift: optional */ /*field annotation */
   };

   struct InvalidOperation {
      â€¦
   }; /* @thrift: exception */ /* structure annotation */

   int add(
         int socket_fd,
         int num1, /* @thrift: 1 */ /* argument annotations */
         int num2  /* @thrift: 2 */
         );
   void zip(int sfd); /* @thrift: oneway */
 /* @thrift: end */

Multiple annotations can be grouped in the same comment.

The first argument to a function will be the socket file descriptor. Itshall be used only for reading & writing, and obviously is not countedas an argument field.


== Basic types ==

There is a straight-forward equivalence between C types, the base typesdefined by the Thrift IDL and the Ttype defined in TProtocol . They aresummarized by the table below:



{| class="prettytable"
! <center>C</center>
! <center>IDL</center>
! <center>TType</center>

|-

| bool<ref name="ftn3"><tt>bool</tt> is introduced as a type in C99. Ifthe stub-compiler is supporting C99, then it should probably alsorecognize all the stdint.h types as well.</ref>

| bool

| T_BOOL<ref name="ftn4">In version 0.2.0, the python libraries generateT_I08 instead of T_BOOL.</ref>


|-
| signed char, char, unsigned char
| byte
| T_I08

|-
| signed short, short, unsigned short
| i16
| T_I16

|-
| signed int, int, unsigned int signed long, signed long, unsigned long
| i32
| T_I32

|-
| signed long long, long long, unsigned long long
| I64
| T_I64

|-
| float, double
| double
| T_DOUBLE

|}

Annotations bool,byte, i16, i32, i64 allow this equivalence to beoverridden. Thus, to declare a function as returning a boolean, use thefollowing annotation:


 int foo(...); /* @thrift: bool */

== String ==

C implements strings as NULL-terminated arrays of <tt>char</tt>s. Unlessoverridden, <tt>char *</tt> fields, arguments, and function returns willbe treated as equivalent to the Thrift <tt>string</tt> type. They shallbe transmitted using the corresponding TBinaryProtocol; i.e. using<tt>T_STRING</tt> followed by the 4-byte length of the string followedby the characters not including the terminating NULL.

The stub for a RPC function returning a string shall allocate space forthe string plus the terminating NULL, copy the received bytes into theallocated memory and add the terminating NULL.


== Structures ==

There is, again, a fairly obvious equivalence between C structs andThrift structs. We shall not allow anonymous C structs in the code thestub-compiler will process.

The C-struct fields can be annotated with a field identifier, usingsyntax <tt>/* @thrift: 1 */</tt>. However, the stub-compiler shallautomatically number the fields in a structure, starting at 1, andincrementing by 1 for each field. These values can be overridden by theglobal annotations <tt>default-field-init </tt>and<tt>default-field-incr</tt>. If a field identifier is specified, thenthe current field is set to that value, and the next field will be theresult of incrementing the specified identifier.

Fields may be annotated as optional, with equivalent behavior to theThrift optional. If the type of the optional field is a pointer, then aNULL pointer indicates that the field is not to be transmitted. Theisset interface may also be used to indicate whether the field is to betransmitted or not.

Fields may be annotated as skip, which means that the field is not to betransmitted or received.


There are three isset interfaces defined.

* Bit-vector: A field in the struct is annotated as the isset field. Itwill be used as a bit vector to indexed by the field identifiers, where1 means field is present and 0 means not present. Bit 0 of the bitvector is the smallest bit field. There must be one bit for every valuebetween the smallest and largest field identifier, even if someidentifiers are not used. The field can either be a scalar or an array.The isset field will not be transmitted.* Compressed bit-vector: Similar to bit-vector, except that there onlyneeds to be a bit for each field, and unused identifiers do not use up bits.* Functional: the user provides functions (or macros) to clear all theissets, set individual fields, and query individual fields. These willhave the formats:


 <nowiki>void clear_<struct>(<strcut> *);</nowiki>
 <nowiki>void set_<field>_<struct>(<struct> *);</nowiki>
 int  <nowiki>isset_<field>_<struct>(<struct> *);</nowiki>

The bit-vector interfaces are specified by annotating the isset field asisset or isset-compressed. The functional interface is specified byannotating the structure as isset-functional.

If one of the bit-vector based approaches is specified, then thestub-compiler can generate the functional macros as a convenience forthe client program.


== Pointers ==

In C structures are generally passed by address, not by value. Further,in Thrift, we pass around values, not references to objects. Therefore,when serializing values, all pointers will be dereferenced, and theobjects that they point to will be serialized in turn. Certain pointers,will be treated specially:

* string pointers (i.e. char * pointers that are not otherwiseannotated) will be transmitted as strings, as described above

* NULL pointers may be treated as optional pointers and not transmitted.

* Array pointers (i.e. pointers to a sequence of objects) will betreated as lists and transmitted as described below

A pointer that is part of the type being returned by a function impliesthat the stub is responsible for allocating memory, setting the pointerto point to this allocated memory, and using the allocated memory aspart of the deserialization process.


== Other types ==

It is possible to describe types as being equivalent to basic types,either by having the stub-compiler parse C typedefs, or by usingannotations. Thus, one could use either of:


 typedef int MyInteger;
 /* @thrift: typedef i32 MyInteger */

Another use is to introduce a type as opaque. In this case, thestub-compiler will accept it as a type. This is generally useful onlyfor specifying types for skipped fields and arguments. For example,


 /* @thrift: typedef opaque FILE */

== List as array ==

The list collection type at the TBinaryProtocol consists of a T_LIST<refname="ftn5">And the argument/field identifier, of course</ref> andanother TType byte depending on the type of list. This is followed bythe number of elements in the collection and the serializedrepresentations of the elements of the collection.

There are several different ways of converting between a list and its Crepresentation. For instance, one could represent a list of 4 integers as


 <nowiki>int x[4];</nowiki>

Because of the interchangeability between pointers and arrays in C, itmight be represented as:


 int * x;
 int x_count;

where<tt> x_count</tt> is the number of elements in the array pointed toby <tt>x</tt>.

Note that an array declared with some particular size may not have allof it in use, and so one may not want to send the entire array. Consider:


 <nowiki>int stack[MAX_STACK];</nowiki>
 int stack_count;

In this example, we may wish to transmit only a list of the first<tt>stack_count</tt> elements of the array <tt>stack</tt>.

Note that a pointer in C does not necessarily need to point to acollection of elements; it could be pointing to a singleton element. Ourannotations need to be able to distinguish between those two cases.

Currently, a pointer or array appearing as a field in a structure or anargument to a function can be annotated by the dimension annotation toindicate that it is to be treated as a list. The dimension annotationconsists of the <tt>dim</tt> keyword followed by a token indicating thefield/argument containing the size. Thus,


 struct stack {
   <nowiki>int stack[MAX_STACK]; /* @thrift: dim stack_count */</nowiki>
   int stack_count; /* @thrift: skip */
 };

 int sum(
    int socket_fd,
    int nvals, /* @thrift: skip */
    int * vals /* @thrift: dim nvals */
 );

Note the use of the skip annotation for the variable providing the sizeof the array. In general, we do not want to also transmit the size ofthe array.

When returning a list, the value returned may be a structure with apointer/array and a dimension variable. In that case the dimensionvariable will get set to the list count determined during thedeserialization process. If the function is set up to return multiplevalues (described below), then the list count may be assigned to one ofthe multiple values.


== List as lists ==

An alternative implementation of the list type in C would be as a truelist; i.e. as a data-structure with a next pointer. The value of theeach list cell could be inlined or we could use a pointer to point tothe llist element. Consider:


 struct Foo_cell {
   struct Foo_cell * foo_in_next; /* @thrift: list-next */
   struct Foo value;
 };
 struct Foo_cell {
   struct Foo_cell * next; /* @thrift: list-next */
   struct Foo * value;
 }

In both of these examples, a (pointer to a) Foo_cell will be serializedas lists of Foo.

If the structure contains more than one non-skipped field, then it isassumed that (unless otherwise annotated) there is no separatedata-structure singleton data structure type, and that the type of thestructure is synonymous with the type of the list of the structure.Consider:


 struct hash_string {
   struct hash_string * next; /* @thrift: list-next */
   char * str;
   int hash;
 };

This is equivalent to the (illegal) Thrift IDL declaration:

 struct hash_string' {
   1: string str,
   2: i32 hash
 }
 <nowiki>typedef list<hash_string'> hash_string;</nowiki>

By default, the stub-compiler will treat instances of hash_string asthough they were declared as lists; appropriate annotations will causethem to be serialized as though they were singletons.


 void send_1(
         int socket_fd,
         struct hash_string * val1 /* @thrift: single */
         );
 void send_n(
         int socket_fd,
         struct hash_string * valn
         );

In the first example, the annotation overrides the default listinterpretation to transmit only one element, while in the secondexample, the transmit stub generated will walk the list using the nextpointer.


== Multiple returns ==

C does not permit more than one value to be returned from a function.This causes problems when multiple values are expected to be returned.One case that we have identified above is the ability to return thenumber of elements returned and the elements themselves as two separateentities. The traditional method of doing this in C is to pass multipleaddresses of variables that will then be filled in with the multiplereturned values. We shall use the return annotation to indicate suchvariables. Thus,


 void return_array_and_count(
         â€¦
         int * x, /* @thrift: return dim x_count*/
         int * x_count /* @thrift: return */
         );

In this declaration, x and x_count are treated as two return parameters.

Note that specifying a dimension that is not a return value specifies:

* the array is already allocated

* at most count values can be received; if more are returned, then it isan error.

It is possible to specify both a return and a non-return dim variablefor a return pointer variable. In that case the return dim is set to theactual count of values returned.

We can use the multiple return mechanism to deal with exception returnsas well. In this case, we assume that functions return an integer todistinguish between normal (T_REPLY) returns and exceptional(T_EXCEPTION) returns. The returned value will be 0 for normal returns,and the specified field for exceptional returns.

Functions that return exceptions are defined as returning a value andhaving at least one exception return parameter. From the example intutorial.thrift,


 int calculate(
         int socket_fd,
         int logid,
         struct Work * w,
         int * result /* @thrift: return */
       struct InvalidOperation * ouch /* @thrift: exception 1 */
       );

If a normal return (T_REPLY) is received, then the function returns a 0,if the value returned is the exception (T_EXCEPTION) then it returns a 1.


== Internal Errors ==

We can extend this notion to deal with internal errors as well,returning an error number for various forms of internal errors as well.The internal errors can be:


* out of memory (i.e. malloc returns 0)
* read errors (non-0 return)
* write errors (non-byte count return)

Other error codes can include:

* All fields of return structure not set
* Unspecified structure field returned
* Unspecified exception returned
* Return type mismatch
* Too large a count on for a pre-allocated memory

Each of these situations can be set to:

* cause an assert failure
* return a particular error code
* be ignored

The special case of EINTR error on a read should be handled by retrying;however, it may also be specified to be an error.


The relevant annotations are:

* <nowiki>error write <number></nowiki>
* <nowiki>error read <number></nowiki>
* <nowiki>error malloc <number></nowiki>
* <nowiki>error decode <number></nowiki>

We can additionally define a return value to be the errno using theannotation return-errno. This shall capture the errno on an internalfailure, and be set to a mask of error codes based on the kinds ofdecode errors.


== Other issues ==

There is no reason that the name of the C function has to be the same asthe name of the function on the server. In particular, there may bemultiple client side functions which are invoked with different kinds ofparameters that end up calling the same server side function. Thedefault name is, of course, the name of the C function. However, it ispossible to override the name of the function called by using the callsfunction annotation:


 int ping_with_error_code(int sfd); /* @thrift: calls ping */

== Sockets & other transports ==

The current interface was designed to support blocking sockets. It useswrite, writev, read and readv on the sockets to perform the actualtransport.

Porting it to other transports may be fairly straight-forward. There hasbeen one parameter that we have passed to every function â€“ the socketfile descriptor. This mayu be replaced by an alternate type. Alternateread & write functions will need to be provided. Either readv & writevequivalents will be provided, or they must be replaced with sequences ofreads and writes. The error return codes will have to be modified tohandle the kinds of errors specific to the new transport.





----
<references/>

Re: connecting C to thrift

Reply via email to