Repository: thrift
Updated Branches:
  refs/heads/master 6cf0ffcec -> 347a5ebb2


http://git-wip-us.apache.org/repos/asf/thrift/blob/347a5ebb/doc/thrift.tex
----------------------------------------------------------------------
diff --git a/doc/thrift.tex b/doc/thrift.tex
deleted file mode 100644
index a706fcb..0000000
--- a/doc/thrift.tex
+++ /dev/null
@@ -1,1057 +0,0 @@
-%-----------------------------------------------------------------------------
-%
-%               Thrift whitepaper
-%
-% Name:         thrift.tex
-%
-% Authors:      Mark Slee ([email protected])
-%
-% Created:      05 March 2007
-%
-% You will need a copy of sigplanconf.cls to format this document.
-% It is available at <http://www.sigplan.org/authorInformation.htm>.
-%
-%-----------------------------------------------------------------------------
-
-
-\documentclass[nocopyrightspace,blockstyle]{sigplanconf}
-
-\usepackage{amssymb}
-\usepackage{amsfonts}
-\usepackage{amsmath}
-\usepackage{url}
-
-\begin{document}
-
-% \conferenceinfo{WXYZ '05}{date, City.}
-% \copyrightyear{2007}
-% \copyrightdata{[to be supplied]}
-
-% \titlebanner{banner above paper title}        % These are ignored unless
-% \preprintfooter{short description of paper}   % 'preprint' option specified.
-
-\title{Thrift: Scalable Cross-Language Services Implementation}
-\subtitle{}
-
-\authorinfo{Mark Slee, Aditya Agarwal and Marc Kwiatkowski}
-           {Facebook, 156 University Ave, Palo Alto, CA}
-           {\{mcslee,aditya,marc\}@facebook.com}
-
-\maketitle
-
-\begin{abstract}
-Thrift is a software library and set of code-generation tools developed at
-Facebook to expedite development and implementation of efficient and scalable
-backend services. Its primary goal is to enable efficient and reliable
-communication across programming languages by abstracting the portions of each
-language that tend to require the most customization into a common library
-that is implemented in each language. Specifically, Thrift allows developers to
-define datatypes and service interfaces in a single language-neutral file
-and generate all the necessary code to build RPC clients and servers.
-
-This paper details the motivations and design choices we made in Thrift, as
-well as some of the more interesting implementation details. It is not
-intended to be taken as research, but rather it is an exposition on what we did
-and why.
-\end{abstract}
-
-% \category{D.3.3}{Programming Languages}{Language constructs and features}
-
-%\terms
-%Languages, serialization, remote procedure call
-
-%\keywords
-%Data description language, interface definition language, remote procedure 
call
-
-\section{Introduction}
-As Facebook's traffic and network structure have scaled, the resource
-demands of many operations on the site (i.e. search,
-ad selection and delivery, event logging) have presented technical requirements
-drastically outside the scope of the LAMP framework. In our implementation of
-these services, various programming languages have been selected to
-optimize for the right combination of performance, ease and speed of
-development, availability of existing libraries, etc. By and large,
-Facebook's engineering culture has tended towards choosing the best
-tools and implementations available over standardizing on any one
-programming language and begrudgingly accepting its inherent limitations.
-
-Given this design choice, we were presented with the challenge of building
-a transparent, high-performance bridge across many programming languages.
-We found that most available solutions were either too limited, did not offer
-sufficient datatype freedom, or suffered from subpar performance.
-\footnote{See Appendix A for a discussion of alternative systems.}
-
-The solution that we have implemented combines a language-neutral software
-stack implemented across numerous programming languages and an associated code
-generation engine that transforms a simple interface and data definition
-language into client and server remote procedure call libraries.
-Choosing static code generation over a dynamic system allows us to create
-validated code that can be run without the need for
-any advanced introspective run-time type checking. It is also designed to
-be as simple as possible for the developer, who can typically define all
-the necessary data structures and interfaces for a complex service in a single
-short file.
-
-Surprised that a robust open solution to these relatively common problems
-did not yet exist, we committed early on to making the Thrift implementation
-open source.
-
-In evaluating the challenges of cross-language interaction in a networked
-environment, some key components were identified:
-
-\textit{Types.} A common type system must exist across programming languages
-without requiring that the application developer use custom Thrift datatypes
-or write their own serialization code. That is,
-a C++ programmer should be able to transparently exchange a strongly typed
-STL map for a dynamic Python dictionary. Neither
-programmer should be forced to write any code below the application layer
-to achieve this. Section 2 details the Thrift type system.
-
-\textit{Transport.} Each language must have a common interface to
-bidirectional raw data transport. The specifics of how a given
-transport is implemented should not matter to the service developer.
-The same application code should be able to run against TCP stream sockets,
-raw data in memory, or files on disk. Section 3 details the Thrift Transport
-layer.
-
-\textit{Protocol.} Datatypes must have some way of using the Transport
-layer to encode and decode themselves. Again, the application
-developer need not be concerned by this layer. Whether the service uses
-an XML or binary protocol is immaterial to the application code.
-All that matters is that the data can be read and written in a consistent,
-deterministic matter. Section 4 details the Thrift Protocol layer.
-
-\textit{Versioning.} For robust services, the involved datatypes must
-provide a mechanism for versioning themselves. Specifically,
-it should be possible to add or remove fields in an object or alter the
-argument list of a function without any interruption in service (or,
-worse yet, nasty segmentation faults). Section 5 details Thrift's versioning
-system.
-
-\textit{Processors.} Finally, we generate code capable of processing data
-streams to accomplish remote procedure calls. Section 6 details the generated
-code and TProcessor paradigm.
-
-Section 7 discusses implementation details, and Section 8 describes
-our conclusions.
-
-\section{Types}
-
-The goal of the Thrift type system is to enable programmers to develop using
-completely natively defined types, no matter what programming language they
-use. By design, the Thrift type system does not introduce any special dynamic
-types or wrapper objects. It also does not require that the developer write
-any code for object serialization or transport. The Thrift IDL (Interface
-Definition Language) file is
-logically a way for developers to annotate their data structures with the
-minimal amount of extra information necessary to tell a code generator
-how to safely transport the objects across languages.
-
-\subsection{Base Types}
-
-The type system rests upon a few base types. In considering which types to
-support, we aimed for clarity and simplicity over abundance, focusing
-on the key types available in all programming languages, omitting any
-niche types available only in specific languages.
-
-The base types supported by Thrift are:
-\begin{itemize}
-\item \texttt{bool} A boolean value, true or false
-\item \texttt{byte} A signed byte
-\item \texttt{i16} A 16-bit signed integer
-\item \texttt{i32} A 32-bit signed integer
-\item \texttt{i64} A 64-bit signed integer
-\item \texttt{double} A 64-bit floating point number
-\item \texttt{string} An encoding-agnostic text or binary string
-\item \texttt{binary} A byte array representation for blobs
-\end{itemize}
-
-Of particular note is the absence of unsigned integer types. Because these
-types have no direct translation to native primitive types in many languages,
-the advantages they afford are lost. Further, there is no way to prevent the
-application developer in a language like Python from assigning a negative value
-to an integer variable, leading to unpredictable behavior. From a design
-standpoint, we observed that unsigned integers were very rarely, if ever, used
-for arithmetic purposes, but in practice were much more often used as keys or
-identifiers. In this case, the sign is irrelevant. Signed integers serve this
-same purpose and can be safely cast to their unsigned counterparts (most
-commonly in C++) when absolutely necessary.
-
-\subsection{Structs}
-
-A Thrift struct defines a common object to be used across languages. A struct
-is essentially equivalent to a class in object oriented programming
-languages. A struct has a set of strongly typed fields, each with a unique
-name identifier. The basic syntax for defining a Thrift struct looks very
-similar to a C struct definition. Fields may be annotated with an integer field
-identifier (unique to the scope of that struct) and optional default values.
-Field identifiers will be automatically assigned if omitted, though they are
-strongly encouraged for versioning reasons discussed later.
-
-\subsection{Containers}
-
-Thrift containers are strongly typed containers that map to the most commonly
-used containers in common programming languages. They are annotated using
-the C++ template (or Java Generics) style. There are three types available:
-\begin{itemize}
-\item \texttt{list<type>} An ordered list of elements. Translates directly into
-an STL \texttt{vector}, Java \texttt{ArrayList}, or native array in scripting 
languages. May
-contain duplicates.
-\item \texttt{set<type>} An unordered set of unique elements. Translates into
-an STL \texttt{set}, Java \texttt{HashSet}, \texttt{set} in Python, or native
-dictionary in PHP/Ruby.
-\item \texttt{map<type1,type2>} A map of strictly unique keys to values
-Translates into an STL \texttt{map}, Java \texttt{HashMap}, PHP associative
-array, or Python/Ruby dictionary.
-\end{itemize}
-
-While defaults are provided, the type mappings are not explicitly fixed. Custom
-code generator directives have been added to substitute custom types in
-destination languages (i.e.
-\texttt{hash\_map} or Google's sparse hash map can be used in C++). The
-only requirement is that the custom types support all the necessary iteration
-primitives. Container elements may be of any valid Thrift type, including other
-containers or structs.
-
-\begin{verbatim}
-struct Example {
-  1:i32 number=10,
-  2:i64 bigNumber,
-  3:double decimals,
-  4:string name="thrifty"
-}\end{verbatim}
-
-In the target language, each definition generates a type with two methods,
-\texttt{read} and \texttt{write}, which perform serialization and transport
-of the objects using a Thrift TProtocol object.
-
-\subsection{Exceptions}
-
-Exceptions are syntactically and functionally equivalent to structs except
-that they are declared using the \texttt{exception} keyword instead of the
-\texttt{struct} keyword.
-
-The generated objects inherit from an exception base class as appropriate
-in each target programming language, in order to seamlessly
-integrate with native exception handling in any given
-language. Again, the design emphasis is on making the code familiar to the
-application developer.
-
-\subsection{Services}
-
-Services are defined using Thrift types. Definition of a service is
-semantically equivalent to defining an interface (or a pure virtual abstract
-class) in object oriented
-programming. The Thrift compiler generates fully functional client and
-server stubs that implement the interface. Services are defined as follows:
-
-\begin{verbatim}
-service <name> {
-  <returntype> <name>(<arguments>)
-    [throws (<exceptions>)]
-  ...
-}\end{verbatim}
-
-An example:
-
-\begin{verbatim}
-service StringCache {
-  void set(1:i32 key, 2:string value),
-  string get(1:i32 key) throws (1:KeyNotFound knf),
-  void delete(1:i32 key)
-}
-\end{verbatim}
-
-Note that \texttt{void} is a valid type for a function return, in addition to
-all other defined Thrift types. Additionally, an \texttt{async} modifier
-keyword may be added to a \texttt{void} function, which will generate code 
that does
-not wait for a response from the server. Note that a pure \texttt{void}
-function will return a response to the client which guarantees that the
-operation has completed on the server side. With \texttt{async} method calls
-the client will only be guaranteed that the request succeeded at the
-transport layer. (In many transport scenarios this is inherently unreliable
-due to the Byzantine Generals' Problem. Therefore, application developers
-should take care only to use the async optimization in cases where dropped
-method calls are acceptable or the transport is known to be reliable.)
-
-Also of note is the fact that argument lists and exception lists for functions
-are implemented as Thrift structs. All three constructs are identical in both
-notation and behavior.
-
-\section{Transport}
-
-The transport layer is used by the generated code to facilitate data transfer.
-
-\subsection{Interface}
-
-A key design choice in the implementation of Thrift was to decouple the
-transport layer from the code generation layer. Though Thrift is typically
-used on top of the TCP/IP stack with streaming sockets as the base layer of
-communication, there was no compelling reason to build that constraint into
-the system. The performance tradeoff incurred by an abstracted I/O layer
-(roughly one virtual method lookup / function call per operation) was
-immaterial compared to the cost of actual I/O operations (typically invoking
-system calls).
-
-Fundamentally, generated Thrift code only needs to know how to read and
-write data. The origin and destination of the data are irrelevant; it may be a
-socket, a segment of shared memory, or a file on the local disk. The Thrift
-transport interface supports the following methods:
-
-\begin{itemize}
-\item \texttt{open} Opens the transport
-\item \texttt{close} Closes the transport
-\item \texttt{isOpen} Indicates whether the transport is open
-\item \texttt{read} Reads from the transport
-\item \texttt{write} Writes to the transport
-\item \texttt{flush} Forces any pending writes
-\end{itemize}
-
-There are a few additional methods not documented here which are used to aid
-in batching reads and optionally signaling the completion of a read or
-write operation from the generated code.
-
-In addition to the above
-\texttt{TTransport} interface, there is a\\
-\texttt{TServerTransport} interface
-used to accept or create primitive transport objects. Its interface is as
-follows:
-
-\begin{itemize}
-\item \texttt{open} Opens the transport
-\item \texttt{listen} Begins listening for connections
-\item \texttt{accept} Returns a new client transport
-\item \texttt{close} Closes the transport
-\end{itemize}
-
-\subsection{Implementation}
-
-The transport interface is designed for simple implementation in any
-programming language. New transport mechanisms can be easily defined as needed
-by application developers.
-
-\subsubsection{TSocket}
-
-The \texttt{TSocket} class is implemented across all target languages. It
-provides a common, simple interface to a TCP/IP stream socket.
-
-\subsubsection{TFileTransport}
-
-The \texttt{TFileTransport} is an abstraction of an on-disk file to a data
-stream. It can be used to write out a set of incoming Thrift requests to a file
-on disk. The on-disk data can then be replayed from the log, either for
-post-processing or for reproduction and/or simulation of past events.
-
-\subsubsection{Utilities}
-
-The Transport interface is designed to support easy extension using common
-OOP techniques, such as composition. Some simple utilities include the
-\texttt{TBufferedTransport}, which buffers the writes and reads on an
-underlying transport, the \texttt{TFramedTransport}, which transmits data with 
frame
-size headers for chunking optimization or nonblocking operation, and the
-\texttt{TMemoryBuffer}, which allows reading and writing directly from the heap
-or stack memory owned by the process.
-
-\section{Protocol}
-
-A second major abstraction in Thrift is the separation of data structure from
-transport representation. Thrift enforces a certain messaging structure when
-transporting data, but it is agnostic to the protocol encoding in use. That is,
-it does not matter whether data is encoded as XML, human-readable ASCII, or a
-dense binary format as long as the data supports a fixed set of operations
-that allow it to be deterministically read and written by generated code.
-
-\subsection{Interface}
-
-The Thrift Protocol interface is very straightforward. It fundamentally
-supports two things: 1) bidirectional sequenced messaging, and
-2) encoding of base types, containers, and structs.
-
-\begin{verbatim}
-writeMessageBegin(name, type, seq)
-writeMessageEnd()
-writeStructBegin(name)
-writeStructEnd()
-writeFieldBegin(name, type, id)
-writeFieldEnd()
-writeFieldStop()
-writeMapBegin(ktype, vtype, size)
-writeMapEnd()
-writeListBegin(etype, size)
-writeListEnd()
-writeSetBegin(etype, size)
-writeSetEnd()
-writeBool(bool)
-writeByte(byte)
-writeI16(i16)
-writeI32(i32)
-writeI64(i64)
-writeDouble(double)
-writeString(string)
-
-name, type, seq = readMessageBegin()
-                  readMessageEnd()
-name =            readStructBegin()
-                  readStructEnd()
-name, type, id =  readFieldBegin()
-                  readFieldEnd()
-k, v, size =      readMapBegin()
-                  readMapEnd()
-etype, size =     readListBegin()
-                  readListEnd()
-etype, size =     readSetBegin()
-                  readSetEnd()
-bool =            readBool()
-byte =            readByte()
-i16 =             readI16()
-i32 =             readI32()
-i64 =             readI64()
-double =          readDouble()
-string =          readString()
-\end{verbatim}
-
-Note that every \texttt{write} function has exactly one \texttt{read} 
counterpart, with
-the exception of \texttt{writeFieldStop()}. This is a special method
-that signals the end of a struct. The procedure for reading a struct is to
-\texttt{readFieldBegin()} until the stop field is encountered, and then to
-\texttt{readStructEnd()}.  The
-generated code relies upon this call sequence to ensure that everything 
written by
-a protocol encoder can be read by a matching protocol decoder. Further note
-that this set of functions is by design more robust than necessary.
-For example, \texttt{writeStructEnd()} is not strictly necessary, as the end of
-a struct may be implied by the stop field. This method is a convenience for
-verbose protocols in which it is cleaner to separate these calls (e.g. a 
closing
-\texttt{</struct>} tag in XML).
-
-\subsection{Structure}
-
-Thrift structures are designed to support encoding into a streaming
-protocol. The implementation should never need to frame or compute the
-entire data length of a structure prior to encoding it. This is critical to
-performance in many scenarios. Consider a long list of relatively large
-strings. If the protocol interface required reading or writing a list to be an
-atomic operation, then the implementation would need to perform a linear pass 
over the
-entire list before encoding any data. However, if the list can be written
-as iteration is performed, the corresponding read may begin in parallel,
-theoretically offering an end-to-end speedup of $(kN - C)$, where $N$ is the 
size
-of the list, $k$ the cost factor associated with serializing a single
-element, and $C$ is fixed offset for the delay between data being written
-and becoming available to read.
-
-Similarly, structs do not encode their data lengths a priori. Instead, they are
-encoded as a sequence of fields, with each field having a type specifier and a
-unique field identifier. Note that the inclusion of type specifiers allows
-the protocol to be safely parsed and decoded without any generated code
-or access to the original IDL file. Structs are terminated by a field header
-with a special \texttt{STOP} type. Because all the basic types can be read
-deterministically, all structs (even those containing other structs) can be
-read deterministically. The Thrift protocol is self-delimiting without any
-framing and regardless of the encoding format.
-
-In situations where streaming is unnecessary or framing is advantageous, it
-can be very simply added into the transport layer, using the
-\texttt{TFramedTransport} abstraction.
-
-\subsection{Implementation}
-
-Facebook has implemented and deployed a space-efficient binary protocol which
-is used by most backend services. Essentially, it writes all data
-in a flat binary format. Integer types are converted to network byte order,
-strings are prepended with their byte length, and all message and field headers
-are written using the primitive integer serialization constructs. String names
-for fields are omitted - when using generated code, field identifiers are
-sufficient.
-
-We decided against some extreme storage optimizations (i.e. packing
-small integers into ASCII or using a 7-bit continuation format) for the sake
-of simplicity and clarity in the code. These alterations can easily be made
-if and when we encounter a performance-critical use case that demands them.
-
-\section{Versioning}
-
-Thrift is robust in the face of versioning and data definition changes. This
-is critical to enable staged rollouts of changes to deployed services. The
-system must be able to support reading of old data from log files, as well as
-requests from out-of-date clients to new servers, and vice versa.
-
-\subsection{Field Identifiers}
-
-Versioning in Thrift is implemented via field identifiers. The field header
-for every member of a struct in Thrift is encoded with a unique field
-identifier. The combination of this field identifier and its type specifier
-is used to uniquely identify the field. The Thrift definition language
-supports automatic assignment of field identifiers, but it is good
-programming practice to always explicitly specify field identifiers.
-Identifiers are specified as follows:
-
-\begin{verbatim}
-struct Example {
-  1:i32 number=10,
-  2:i64 bigNumber,
-  3:double decimals,
-  4:string name="thrifty"
-}\end{verbatim}
-
-To avoid conflicts between manually and automatically assigned identifiers,
-fields with identifiers omitted are assigned identifiers
-decrementing from -1, and the language only supports the manual assignment of
-positive identifiers.
-
-When data is being deserialized, the generated code can use these identifiers
-to properly identify the field and determine whether it aligns with a field in
-its definition file. If a field identifier is not recognized, the generated
-code can use the type specifier to skip the unknown field without any error.
-Again, this is possible due to the fact that all datatypes are self
-delimiting.
-
-Field identifiers can (and should) also be specified in function argument
-lists. In fact, argument lists are not only represented as structs on the
-backend, but actually share the same code in the compiler frontend. This
-allows for version-safe modification of method parameters
-
-\begin{verbatim}
-service StringCache {
-  void set(1:i32 key, 2:string value),
-  string get(1:i32 key) throws (1:KeyNotFound knf),
-  void delete(1:i32 key)
-}
-\end{verbatim}
-
-The syntax for specifying field identifiers was chosen to echo their structure.
-Structs can be thought of as a dictionary where the identifiers are keys, and
-the values are strongly-typed named fields.
-
-Field identifiers internally use the \texttt{i16} Thrift type. Note, however,
-that the \texttt{TProtocol} abstraction may encode identifiers in any format.
-
-\subsection{Isset}
-
-When an unexpected field is encountered, it can be safely ignored and
-discarded. When an expected field is not found, there must be some way to
-signal to the developer that it was not present. This is implemented via an
-inner \texttt{isset} structure inside the defined objects. (Isset functionality
-is implicit with a \texttt{null} value in PHP, \texttt{None} in Python
-and \texttt{nil} in Ruby.) Essentially,
-the inner \texttt{isset} object of each Thrift struct contains a boolean value
-for each field which denotes whether or not that field is present in the
-struct. When a reader receives a struct, it should check for a field being set
-before operating directly on it.
-
-\begin{verbatim}
-class Example {
- public:
-  Example() :
-    number(10),
-    bigNumber(0),
-    decimals(0),
-    name("thrifty") {}
-
-  int32_t number;
-  int64_t bigNumber;
-  double decimals;
-  std::string name;
-
-  struct __isset {
-    __isset() :
-      number(false),
-      bigNumber(false),
-      decimals(false),
-      name(false) {}
-    bool number;
-    bool bigNumber;
-    bool decimals;
-    bool name;
-  } __isset;
-...
-}
-\end{verbatim}
-
-\subsection{Case Analysis}
-
-There are four cases in which version mismatches may occur.
-
-\begin{enumerate}
-\item \textit{Added field, old client, new server.} In this case, the old
-client does not send the new field. The new server recognizes that the field
-is not set, and implements default behavior for out-of-date requests.
-\item \textit{Removed field, old client, new server.} In this case, the old
-client sends the removed field. The new server simply ignores it.
-\item \textit{Added field, new client, old server.} The new client sends a
-field that the old server does not recognize. The old server simply ignores
-it and processes as normal.
-\item \textit{Removed field, new client, old server.} This is the most
-dangerous case, as the old server is unlikely to have suitable default
-behavior implemented for the missing field. It is recommended that in this
-situation the new server be rolled out prior to the new clients.
-\end{enumerate}
-
-\subsection{Protocol/Transport Versioning}
-The \texttt{TProtocol} abstractions are also designed to give protocol
-implementations the freedom to version themselves in whatever manner they
-see fit. Specifically, any protocol implementation is free to send whatever
-it likes in the \texttt{writeMessageBegin()} call. It is entirely up to the
-implementor how to handle versioning at the protocol level. The key point is
-that protocol encoding changes are safely isolated from interface definition
-version changes.
-
-Note that the exact same is true of the \texttt{TTransport} interface. For
-example, if we wished to add some new checksumming or error detection to the
-\texttt{TFileTransport}, we could simply add a version header into the
-data it writes to the file in such a way that it would still accept old
-log files without the given header.
-
-\section{RPC Implementation}
-
-\subsection{TProcessor}
-
-The last core interface in the Thrift design is the \texttt{TProcessor},
-perhaps the most simple of the constructs. The interface is as follows:
-
-\begin{verbatim}
-interface TProcessor {
-  bool process(TProtocol in, TProtocol out)
-    throws TException
-}
-\end{verbatim}
-
-The key design idea here is that the complex systems we build can fundamentally
-be broken down into agents or services that operate on inputs and outputs. In
-most cases, there is actually just one input and output (an RPC client) that
-needs handling.
-
-\subsection{Generated Code}
-
-When a service is defined, we generate a
-\texttt{TProcessor} instance capable of handling RPC requests to that service,
-using a few helpers. The fundamental structure (illustrated in pseudo-C++) is
-as follows:
-
-\begin{verbatim}
-Service.thrift
- => Service.cpp
-     interface ServiceIf
-     class ServiceClient : virtual ServiceIf
-       TProtocol in
-       TProtocol out
-     class ServiceProcessor : TProcessor
-       ServiceIf handler
-
-ServiceHandler.cpp
- class ServiceHandler : virtual ServiceIf
-
-TServer.cpp
- TServer(TProcessor processor,
-         TServerTransport transport,
-         TTransportFactory tfactory,
-         TProtocolFactory pfactory)
- serve()
-\end{verbatim}
-
-From the Thrift definition file, we generate the virtual service interface.
-A client class is generated, which implements the interface and
-uses two \texttt{TProtocol} instances to perform the I/O operations. The
-generated processor implements the \texttt{TProcessor} interface. The generated
-code has all the logic to handle RPC invocations via the \texttt{process()}
-call, and takes as a parameter an instance of the service interface, as
-implemented by the application developer.
-
-The user provides an implementation of the application interface in separate,
-non-generated source code.
-
-\subsection{TServer}
-
-Finally, the Thrift core libraries provide a \texttt{TServer} abstraction.
-The \texttt{TServer} object generally works as follows.
-
-\begin{itemize}
-\item Use the \texttt{TServerTransport} to get a \texttt{TTransport}
-\item Use the \texttt{TTransportFactory} to optionally convert the primitive
-transport into a suitable application transport (typically the
-\texttt{TBufferedTransportFactory} is used here)
-\item Use the \texttt{TProtocolFactory} to create an input and output protocol
-for the \texttt{TTransport}
-\item Invoke the \texttt{process()} method of the \texttt{TProcessor} object
-\end{itemize}
-
-The layers are appropriately separated such that the server code needs to know
-nothing about any of the transports, encodings, or applications in play. The
-server encapsulates the logic around connection handling, threading, etc.
-while the processor deals with RPC. The only code written by the application
-developer lives in the definitional Thrift file and the interface
-implementation.
-
-Facebook has deployed multiple \texttt{TServer} implementations, including
-the single-threaded \texttt{TSimpleServer}, thread-per-connection
-\texttt{TThreadedServer}, and thread-pooling \texttt{TThreadPoolServer}.
-
-The \texttt{TProcessor} interface is very general by design. There is no
-requirement that a \texttt{TServer} take a generated \texttt{TProcessor}
-object. Thrift allows the application developer to easily write any type of
-server that operates on \texttt{TProtocol} objects (for instance, a server
-could simply stream a certain type of object without any actual RPC method
-invocation).
-
-\section{Implementation Details}
-\subsection{Target Languages}
-Thrift currently supports five target languages: C++, Java, Python, Ruby, and
-PHP. At Facebook, we have deployed servers predominantly in C++, Java, and
-Python. Thrift services implemented in PHP have also been embedded into the
-Apache web server, providing transparent backend access to many of our
-frontend constructs using a \texttt{THttpClient} implementation of the
-\texttt{TTransport} interface.
-
-Though Thrift was explicitly designed to be much more efficient and robust
-than typical web technologies, as we were designing our XML-based REST web
-services API we noticed that Thrift could be easily used to define our
-service interface. Though we do not currently employ SOAP envelopes (in the
-authors' opinions there is already far too much repetitive enterprise Java
-software to do that sort of thing), we were able to quickly extend Thrift to
-generate XML Schema Definition files for our service, as well as a framework
-for versioning different implementations of our web service. Though public
-web services are admittedly tangential to Thrift's core use case and design,
-Thrift facilitated rapid iteration and affords us the ability to quickly
-migrate our entire XML-based web service onto a higher performance system
-should the need arise.
-
-\subsection{Generated Structs}
-We made a conscious decision to make our generated structs as transparent as
-possible. All fields are publicly accessible; there are no \texttt{set()} and
-\texttt{get()} methods. Similarly, use of the \texttt{isset} object is not
-enforced. We do not include any \texttt{FieldNotSetException} construct.
-Developers have the option to use these fields to write more robust code, but
-the system is robust to the developer ignoring the \texttt{isset} construct
-entirely and will provide suitable default behavior in all cases.
-
-This choice was motivated by the desire to ease application development. Our 
stated
-goal is not to make developers learn a rich new library in their language of
-choice, but rather to generate code that allow them to work with the constructs
-that are most familiar in each language.
-
-We also made the \texttt{read()} and \texttt{write()} methods of the generated
-objects public so that the objects can be used outside of the context
-of RPC clients and servers. Thrift is a useful tool simply for generating
-objects that are easily serializable across programming languages.
-
-\subsection{RPC Method Identification}
-Method calls in RPC are implemented by sending the method name as a string. One
-issue with this approach is that longer method names require more bandwidth.
-We experimented with using fixed-size hashes to identify methods, but in the
-end concluded that the savings were not worth the headaches incurred. Reliably
-dealing with conflicts across versions of an interface definition file is
-impossible without a meta-storage system (i.e. to generate non-conflicting
-hashes for the current version of a file, we would have to know about all
-conflicts that ever existed in any previous version of the file).
-
-We wanted to avoid too many unnecessary string comparisons upon
-method invocation. To deal with this, we generate maps from strings to function
-pointers, so that invocation is effectively accomplished via a constant-time
-hash lookup in the common case. This requires the use of a couple interesting
-code constructs. Because Java does not have function pointers, process
-functions are all private member classes implementing a common interface.
-
-\begin{verbatim}
-private class ping implements ProcessFunction {
-  public void process(int seqid,
-                      TProtocol iprot,
-                      TProtocol oprot)
-    throws TException
-  { ...}
-}
-
-HashMap<String,ProcessFunction> processMap_ =
-  new HashMap<String,ProcessFunction>();
-\end{verbatim}
-
-In C++, we use a relatively esoteric language construct: member function
-pointers.
-
-\begin{verbatim}
-std::map<std::string,
-  void (ExampleServiceProcessor::*)(int32_t,
-  facebook::thrift::protocol::TProtocol*,
-  facebook::thrift::protocol::TProtocol*)>
- processMap_;
-\end{verbatim}
-
-Using these techniques, the cost of string processing is minimized, and we
-reap the benefit of being able to easily debug corrupt or misunderstood data by
-inspecting it for known string method names.
-
-\subsection{Servers and Multithreading}
-Thrift services require basic multithreading to handle simultaneous
-requests from multiple clients. For the Python and Java implementations of
-Thrift server logic, the standard threading libraries distributed with the
-languages provide adequate support. For the C++ implementation, no standard 
multithread runtime
-library exists. Specifically, robust, lightweight, and portable
-thread manager and timer class implementations do not exist. We investigated
-existing implementations, namely \texttt{boost::thread},
-\texttt{boost::threadpool}, \texttt{ACE\_Thread\_Manager} and
-\texttt{ACE\_Timer}.
-
-While \texttt{boost::threads}\cite{boost.threads}  provides clean,
-lightweight and robust implementations of multi-thread primitives (mutexes,
-conditions, threads) it does not provide a thread manager or timer
-implementation.
-
-\texttt{boost::threadpool}\cite{boost.threadpool} also looked promising but
-was not far enough along for our purposes. We wanted to limit the dependency on
-third-party libraries as much as possible. Because\\
-\texttt{boost::threadpool} is
-not a pure template library and requires runtime libraries and because it is
-not yet part of the official Boost distribution we felt it was not ready for
-use in Thrift. As \texttt{boost::threadpool} evolves and especially if it is
-added to the Boost distribution we may reconsider our decision to not use it.
-
-ACE has both a thread manager and timer class in addition to multi-thread
-primitives. The biggest problem with ACE is that it is ACE. Unlike Boost, ACE
-API quality is poor. Everything in ACE has large numbers of dependencies on
-everything else in ACE - thus forcing developers to throw out standard
-classes, such as STL collections, in favor of ACE's homebrewed 
implementations. In
-addition, unlike Boost, ACE implementations demonstrate little understanding
-of the power and pitfalls of C++ programming and take no advantage of modern
-templating techniques to ensure compile time safety and reasonable compiler
-error messages. For all these reasons, ACE was rejected. Instead, we chose
-to implement our own library, described in the following sections.
-
-\subsection{Thread Primitives}
-
-The Thrift thread libraries are implemented in the namespace\\
-\texttt{facebook::thrift::concurrency} and have three components:
-\begin{itemize}
-\item primitives
-\item thread pool manager
-\item timer manager
-\end{itemize}
-
-As mentioned above, we were hesitant to introduce any additional dependencies
-on Thrift. We decided to use \texttt{boost::shared\_ptr} because it is so
-useful for multithreaded application, it requires no link-time or
-runtime libraries (i.e. it is a pure template library) and it is due
-to become part of the C++0x standard.
-
-We implement standard \texttt{Mutex} and \texttt{Condition} classes, and a
- \texttt{Monitor} class. The latter is simply a combination of a mutex and
-condition variable and is analogous to the \texttt{Monitor} implementation 
provided for
-the Java \texttt{Object} class. This is also sometimes referred to as a 
barrier. We
-provide a \texttt{Synchronized} guard class to allow Java-like synchronized 
blocks.
-This is just a bit of syntactic sugar, but, like its Java counterpart, clearly
-delimits critical sections of code. Unlike its Java counterpart, we still
-have the ability to programmatically lock, unlock, block, and signal monitors.
-
-\begin{verbatim}
-void run() {
- {Synchronized s(manager->monitor);
-  if (manager->state == TimerManager::STARTING) {
-    manager->state = TimerManager::STARTED;
-    manager->monitor.notifyAll();
-  }
- }
-}
-\end{verbatim}
-
-We again borrowed from Java the distinction between a thread and a runnable
-class. A \texttt{Thread} is the actual schedulable object. The
-\texttt{Runnable} is the logic to execute within the thread.
-The \texttt{Thread} implementation deals with all the platform-specific thread
-creation and destruction issues, while the \texttt{Runnable} implementation 
deals
-with the application-specific per-thread logic. The benefit of this approach
-is that developers can easily subclass the Runnable class without pulling in
-platform-specific super-classes.
-
-\subsection{Thread, Runnable, and shared\_ptr}
-We use \texttt{boost::shared\_ptr} throughout the \texttt{ThreadManager} and
-\texttt{TimerManager} implementations to guarantee cleanup of dead objects 
that can
-be accessed by multiple threads. For \texttt{Thread} class implementations,
-\texttt{boost::shared\_ptr} usage requires particular attention to make sure
-\texttt{Thread} objects are neither leaked nor dereferenced prematurely while
-creating and shutting down threads.
-
-Thread creation requires calling into a C library. (In our case the POSIX
-thread library, \texttt{libpthread}, but the same would be true for WIN32 
threads).
-Typically, the OS makes few, if any, guarantees about when 
\texttt{ThreadMain}, a C thread's entry-point function, will be called. 
Therefore, it is
-possible that our thread create call,
-\texttt{ThreadFactory::newThread()} could return to the caller
-well before that time. To ensure that the returned \texttt{Thread} object is 
not
-prematurely cleaned up if the caller gives up its reference prior to the
-\texttt{ThreadMain} call, the \texttt{Thread} object makes a weak reference to
-itself in its \texttt{start} method.
-
-With the weak reference in hand the \texttt{ThreadMain} function can attempt 
to get
-a strong reference before entering the \texttt{Runnable::run} method of the
-\texttt{Runnable} object bound to the \texttt{Thread}. If no strong references 
to the
-thread are obtained between exiting \texttt{Thread::start} and entering 
\texttt{ThreadMain}, the weak reference returns \texttt{null} and the function
-exits immediately.
-
-The need for the \texttt{Thread} to make a weak reference to itself has a
-significant impact on the API. Since references are managed through the
-\texttt{boost::shared\_ptr} templates, the \texttt{Thread} object must have a 
reference
-to itself wrapped by the same \texttt{boost::shared\_ptr} envelope that is 
returned
-to the caller. This necessitated the use of the factory pattern.
-\texttt{ThreadFactory} creates the raw \texttt{Thread} object and a
-\texttt{boost::shared\_ptr} wrapper, and calls a private helper method of the 
class
-implementing the \texttt{Thread} interface (in this case, 
\texttt{PosixThread::weakRef})
- to allow it to make add weak reference to itself through the
- \texttt{boost::shared\_ptr} envelope.
-
-\texttt{Thread} and \texttt{Runnable} objects reference each other. A 
\texttt{Runnable}
-object may need to know about the thread in which it is executing, and a 
Thread, obviously,
-needs to know what \texttt{Runnable} object it is hosting. This 
interdependency is
-further complicated because the lifecycle of each object is independent of the
-other. An application may create a set of \texttt{Runnable} object to be 
reused in different threads, or it may create and forget a \texttt{Runnable} 
object
-once a thread has been created and started for it.
-
-The \texttt{Thread} class takes a \texttt{boost::shared\_ptr} reference to the 
hosted
-\texttt{Runnable} object in its constructor, while the \texttt{Runnable} class 
has an
-explicit \texttt{thread} method to allow explicit binding of the hosted thread.
-\texttt{ThreadFactory::newThread} binds the objects to each other.
-
-\subsection{ThreadManager}
-
-\texttt{ThreadManager} creates a pool of worker threads and
-allows applications to schedule tasks for execution as free worker threads
-become available. The \texttt{ThreadManager} does not implement dynamic
-thread pool resizing, but provides primitives so that applications can add
-and remove threads based on load. This approach was chosen because
-implementing load metrics and thread pool size is very application
-specific. For example some applications may want to adjust pool size based
-on running-average of work arrival rates that are measured via polled
-samples. Others may simply wish to react immediately to work-queue
-depth high and low water marks. Rather than trying to create a complex
-API abstract enough to capture these different approaches, we
-simply leave it up to the particular application and provide the
-primitives to enact the desired policy and sample current status.
-
-\subsection{TimerManager}
-
-\texttt{TimerManager} allows applications to schedule
- \texttt{Runnable} objects for execution at some point in the future. Its 
specific task
-is to allows applications to sample \texttt{ThreadManager} load at regular
-intervals and make changes to the thread pool size based on application policy.
-Of course, it can be used to generate any number of timer or alarm events.
-
-The default implementation of \texttt{TimerManager} uses a single thread to
-execute expired \texttt{Runnable} objects. Thus, if a timer operation needs to
-do a large amount of work and especially if it needs to do blocking I/O,
-that should be done in a separate thread.
-
-\subsection{Nonblocking Operation}
-Though the Thrift transport interfaces map more directly to a blocking I/O
-model, we have implemented a high performance \texttt{TNonBlockingServer}
-in C++ based on \texttt{libevent} and the \texttt{TFramedTransport}. We
-implemented this by moving all I/O into one tight event loop using a
-state machine. Essentially, the event loop reads framed requests into
-\texttt{TMemoryBuffer} objects. Once entire requests are ready, they are
-dispatched to the \texttt{TProcessor} object which can read directly from
-the data in memory.
-
-\subsection{Compiler}
-The Thrift compiler is implemented in C++ using standard 
\texttt{lex}/\texttt{yacc}
-lexing and parsing. Though it could have been implemented with fewer
-lines of code in another language (i.e. Python Lex-Yacc (PLY) or 
\texttt{ocamlyacc}), using C++
-forces explicit definition of the language constructs. Strongly typing the
-parse tree elements (debatably) makes the code more approachable for new
-developers.
-
-Code generation is done using two passes. The first pass looks only for
-include files and type definitions. Type definitions are not checked during
-this phase, since they may depend upon include files. All included files
-are sequentially scanned in a first pass. Once the include tree has been
-resolved, a second pass over all files is taken that inserts type definitions
-into the parse tree and raises an error on any undefined types. The program is
-then generated against the parse tree.
-
-Due to inherent complexities and potential for circular dependencies,
-we explicitly disallow forward declaration. Two Thrift structs cannot
-each contain an instance of the other. (Since we do not allow \texttt{null}
-struct instances in the generated C++ code, this would actually be impossible.)
-
-\subsection{TFileTransport}
-The \texttt{TFileTransport} logs Thrift requests/structs by
-framing incoming data with its length and writing it out to disk.
-Using a framed on-disk format allows for better error checking and
-helps with the processing of a finite number of discrete events. The\\
-\texttt{TFileWriterTransport} uses a system of swapping in-memory buffers
-to ensure good performance while logging large amounts of data.
-A Thrift log file is split up into chunks of a specified size; logged messages
-are not allowed to cross chunk boundaries. A message that would cross a chunk
-boundary will cause padding to be added until the end of the chunk and the
-first byte of the message are aligned to the beginning of the next chunk.
-Partitioning the file into chunks makes it possible to read and interpret data
-from a particular point in the file.
-
-\section{Facebook Thrift Services}
-Thrift has been employed in a large number of applications at Facebook, 
including
-search, logging, mobile, ads and the developer platform. Two specific usages 
are discussed below.
-
-\subsection{Search}
-Thrift is used as the underlying protocol and transport layer for the Facebook 
Search service.
-The multi-language code generation is well suited for search because it allows 
for application
-development in an efficient server side language (C++) and allows the Facebook 
PHP-based web application
-to make calls to the search service using Thrift PHP libraries. There is also 
a large
-variety of search stats, deployment and testing functionality that is built on 
top
-of generated Python code. Additionally, the Thrift log file format is
-used as a redo log for providing real-time search index updates. Thrift has 
allowed the
-search team to leverage each language for its strengths and to develop code at 
a rapid pace.
-
-\subsection{Logging}
-The Thrift \texttt{TFileTransport} functionality is used for structured 
logging. Each
-service function definition along with its parameters can be considered to be
-a structured log entry identified by the function name. This log can then be 
used for
-a variety of purposes, including inline and offline processing, stats 
aggregation and as a redo log.
-
-\section{Conclusions}
-Thrift has enabled Facebook to build scalable backend
-services efficiently by enabling engineers to divide and conquer. Application
-developers can focus on application code without worrying about the
-sockets layer. We avoid duplicated work by writing buffering and I/O logic
-in one place, rather than interspersing it in each application.
-
-Thrift has been employed in a wide variety of applications at Facebook,
-including search, logging, mobile, ads, and the developer platform. We have
-found that the marginal performance cost incurred by an extra layer of
-software abstraction is far eclipsed by the gains in developer efficiency and
-systems reliability.
-
-\appendix
-
-\section{Similar Systems}
-The following are software systems similar to Thrift. Each is (very!) briefly
-described:
-
-\begin{itemize}
-\item \textit{SOAP.} XML-based. Designed for web services via HTTP, excessive
-XML parsing overhead.
-\item \textit{CORBA.} Relatively comprehensive, debatably overdesigned and
-heavyweight. Comparably cumbersome software installation.
-\item \textit{COM.} Embraced mainly in Windows client software. Not an entirely
-open solution.
-\item \textit{Pillar.} Lightweight and high-performance, but missing versioning
-and abstraction.
-\item \textit{Protocol Buffers.} Closed-source, owned by Google. Described in
-Sawzall paper.
-\end{itemize}
-
-\acks
-
-Many thanks for feedback on Thrift (and extreme trial by fire) are due to
-Martin Smith, Karl Voskuil and Yishan Wong.
-
-Thrift is a successor to Pillar, a similar system developed
-by Adam D'Angelo, first while at Caltech and continued later at Facebook.
-Thrift simply would not have happened without Adam's insights.
-
-\begin{thebibliography}{}
-
-\bibitem{boost.threads}
-Kempf, William,
-``Boost.Threads'',
-\url{http://www.boost.org/doc/html/threads.html}
-
-\bibitem{boost.threadpool}
-Henkel, Philipp,
-``threadpool'',
-\url{http://threadpool.sourceforge.net}
-
-\end{thebibliography}
-
-\end{document}

Reply via email to