Re: Thrift and Hadoop; especially: Java support in Thrift

Jeff Hammerbacher Sun, 26 Oct 2008 15:16:51 -0700

Hey Siamak,

I am not entirely clear on the issues they have, but having a robust
server implementation using nio was a big one when we last spoke. They
also have concerns about using Thrift as a format for persisting data,
as the binary protocol is too verbose, but that issue has been
discussed to death on this list.

For more detail, a quote from Doug Cutting, one of the creators of
Hadoop, is posted below. You can reply to his comments on the
hadoop-dev list.

"""
I've been thinking about this, and here's where I've come to:

It's not just RPC.  We need a single, primary object serialization
system that's used for RPC and for most file-based application data.

Scripting languages are primary users of Hadoop.  We must thus make it
easy and natural for scripting languages to process data with Hadoop.

Data should be self-describing.  For example, a script should be able
to read a file without having to first generate code specific to the
records in that file.  Similarly, a script should be able to write
records without having to externally define their schema.

We need an efficient binary file format.  A file of records should not
repeat the record names with each record.  Rather, the record schema
used should be stored in the file once.  Programs should be able to
read the schema and efficiently produce instances from the file.

The schema language should support specification of required and
optional fields, so that class definitions may evolve.

For some languages (e.g., Java & C) one may wish to generate native
classes to represent a schema, and to read & write instances.

So, how well does Thrift meet these needs?  Thrift's IDL is a schema
language, and JSON is a self-describing data format.  But arbitrary
JSON data is not generally readable by any Thrift-based program.  And
Thrift's binary formats are not self-describing: they do not include
the IDL.  Nor does the Thrift runtime in each language permit one to
read an IDL specification and then use it to efficiently read and
write compact, self-describing data.

I wonder if we might instead use use JSON schemas to describe data.

http://groups.google.com/group/json-schema/web/json-schema-proposal---second-draft

We'd implement, in each language, a codec that, given a schema, can
efficiently read and write instances of that schema.  (JSON schemas
are JSON data, so any language that supports JSON can already read and
write a JSON schema.)  The writer could either take a provided schema,
or automatically induce a schema from the records written.  Schemas
would be stored in data files, with the data.

JSON's not perfect.  It doesn't (yet) support binary data: that would
need to be fixed.  But I think Thrift's focus on code-generation makes
it less friendly to scripting languages, which are primary users of
Hadoop.  Code generation is possible given a schema, and may be useful
as an optimization in many cases, but it should be optional, not
central.

Folks should be able to process any file without external information
or external compilers.  A small runtime codec is be all that should be
implemented in each language.  Even if that's not present, data could
be transparently and losslessly converted to and from textual JSON by,
e.g. C utility programs, since most languages already have JSON
codecs.
"""

Regards,
Jeff

On Fri, Oct 24, 2008 at 2:15 PM, Siamak Haschemi
<[EMAIL PROTECTED]> wrote:
> Hello Jeff,
>
> is it possible that you give some hints about *where* and *what* is poor
> supported?
>
>
> Kind regards,
>
> Siamak Haschemi
>
> Bryan Duxbury schrieb:
>> I've been doing lots of Java work on Thrift for a while now. Are there
>> particular things that need to be fixed, or are you just noting that the
>> Java library is poor in general?
>>
>> -Bryan
>>
>> On Oct 24, 2008, at 1:02 AM, Jeff Hammerbacher wrote:
>>
>>> Hey Thrift Users and Developers,
>>>
>>> The Apache Hadoop community is going through the process of hardening
>>> Hadoop in preparation of a 1.0 release. The process is being
>>> documented here: http://wiki.apache.org/hadoop/Release1.0Requirements.
>>>
>>> If you check out the "Multi-language serialization" part of the linked
>>> document, you'll see that there is a debate going on about which
>>> cross-language RPC framework to use for Hadoop going forward. The
>>> major contenders are Thrift, Protocol Buffers, Etch, and Hessian. One
>>> of the major reasons I pushed to get Thrift into Apache when at
>>> Facebook was the opportunity to replace Hadoop's RPC mechanisms with
>>> Thrift. I guess now is the moment of truth.
>>>
>>> If you'd like to see Hadoop adopt Thrift as it's internal and external
>>> RPC framework, please voice your opinion on the Hadoop development
>>> list. If you want to go the extra mile, the biggest blocker to Thrift
>>> adoption within the Hadoop community is its poor support for Java. If
>>> you have some time available and you're a Java wizard, any code you
>>> can contribute to Thrift in the next few weeks will make a difference
>>> in the push to get Thrift adopted by the Hadoop community.
>>>
>>> Anyways, as a Thrift and Hadoop fanboy, I'm just trying to do some
>>> cheerleading to make the marriage happen.
>>>
>>> Regards,
>>> Jeff
>>
>

Re: Thrift and Hadoop; especially: Java support in Thrift

Reply via email to