Am joining the conversation late, but another option is is Hadoop's own RecordIO. Like with Thrift, you need to use compiler-generated stubs to read and write records, but it also supports schemas. You can de/serialize schemas separately from content, which gives you lots of flexibility.
> -----Original Message----- > From: Bryan Duxbury [mailto:[EMAIL PROTECTED] > Sent: Saturday, May 24, 2008 12:13 AM > To: core-user@hadoop.apache.org > Subject: Re: Serialization format for structured data > > > On May 23, 2008, at 9:51 AM, Ted Dunning wrote: > > Relative to thrift, JSON has the advantage of not requiring > a schema > > as well as the disadvantage of not having a schema. The > advantage is > > that the data is more fluid and I don't have to generate code to > > handle the records. The disadvantage is that I lose some data > > completeness and typing guarantees. > > On balance, I would like to use JSON-like data quite a bit > in ad hoc > > data streams and in logs where the producer and consumer of > the data > > are not visible to parts of the data processing chain. > > That about sums it up. If you want schema, Thrift is your > friend. If you don't, JSON probably will do pretty well for you. > > -Bryan >