[Hadoop Wiki] Update of "ProtocolBuffers" by SteveLoughran

Apache Wiki Wed, 19 Oct 2011 02:35:15 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "ProtocolBuffers" page has been changed by SteveLoughran:
http://wiki.apache.org/hadoop/ProtocolBuffers

Comment:
Write up protocol buffers

New page:
= ProtocolBuffers =

!ProtocolBuffers is an open source project supporting Google's 
!ProtocolBuffer's platform-neutral and language-neutral 
interprocess-communication (IPC) and serialization framework. It has an 
Interface Definition Language (IDL) that is used to describe the wire- and file 
formats; this IDL is then pre-compiled into source code for the target 
languages (Python, Java and C++ included), which are then used in the 
applications.

Hadoop 0.23+ requires the protocol buffers JAR to be on the classpath of both 
clients and servers; the native binaries are required to compile this and later 
versions of Hadoop.

In comparison with previous IDLs (such as CORBA, DCOM and !SunOS RPC), 
!ProtocolBuffers are designed to be
 * Simple remote procedure calls (not Object-Oriented communication in the 
style of CORBA).
 * Usable for efficient binary serialization of raw data.
 * Highly efficient in terms of bandwidth, serialization and deserialization. 
In a large Hadoop cluster, network bandwidth, especially to and from the 
NameNode, JobTracker and -in NextGenMapReduce-, the ResourceManager, is 
precious. An efficient wire format not only saves bandwidth to and from these 
master nodes, it can reduce load and congestion on the main switching fabric of 
a large cluster. 
 * Excellent support for forward versioning, in which a remote service can 
support older versions of a client.
 * Workable support for backward versioning, in which a remote service can 
support newer versions of a client. This requires more careful programming in 
the service code.

It's closest equivalent formats are [[http://thrift.apache.org/|Apache 
Thrift]]. 

The protocol is significantly different from the Web Services WS-* stack, that 
has been criticised by [[SteveLoughran|Steve Loughran]] and Edmund Smith in  
[[http://www.hpl.hp.com/techreports/2005/HPL-2005-83.pdf|Rethinking the Java 
SOAP Stack]] and [[http://steve.vinoski.net/pdf/IEEE-RPC_Under_Fire.pdf|RPC 
under fire]] in that the WS-* language for describing data XML-Schema, is not 
completely mappable to the Object-Oriented model of today's languages, yet the 
WS-* stacks attempt to seamlessly do so, even across languages. Loughran and 
Smith regard such an O/X mapping to be as insolvable as a perfect O/R Mapping, 
and hence doomed. Instead SOAP stacks should embrace the XML nature of 
documents and use mechanisms such as !XPath to directly work with the XML 
content. No widely used SOAP stack does this, as WS-* developers appear to 
prefer to write implementation-first code in which the datatypes are written in 
their native language, the interface specification reverse-engineered from this 
and then everyone hopes that this specification will be convertable into usable 
datatypes in other languages, and stable across protocol versions.

!ProtocolBuffers and Thrift both require the IDL to be specified first, and 
have a code generation stage that generates language-specific code from it. 
Version support is explicitly handled, 

One criticism of both !ProtocolBuffers and Thrift is that the content is not 
self-describing; it is expected that the reader has compile-time expectations 
for the specific datatypes and interfaces, though possibly different versions. 
[[http://avro.apache.org/|Apache Avro]] does include in-content type 
declarations and runtime parsing, which is why some organizations using Hadoop 
consider it a significantly better format for persistent data: it becomes 
possible to parse files without advance knowledge of their structure.

[Hadoop Wiki] Update of "ProtocolBuffers" by SteveLoughran

Reply via email to