On Dec 1, 2010, at 11:11 AM, Owen O'Malley wrote:
All,
We really need some guidance on the general direction for the
project. Please comment and/or vote. If no one cares, then I'll
probably commit it to Yahoo's internal branch.
-- Owen
The question is how the Hadoop project wants to move forward.
It was motivated by Doug's veto of HADOOP-6685, which was based on
his personal decisions about how the project should go forward and
not on anything that had been decided by the PMC.
These decisions are much more important to MapReduce, which is a
framework, than HDFS which is a client/server model.
1. Should Hadoop include a user-facing library of useful code?
There has been a suggestion that user-facing library code, such as
SequenceFile, TFile, DistCp, etc. should be deprecated and that
Hadoop should allow third party projects like Avro to supply the
user-facing library code that makes Hadoop usable. I think it is
critical that we keep those components as part of Hadoop and extend
them as the framework evolves. Users depend heavily on SequenceFile
for storing their data in Hadoop and they should not be deprecated
as Doug has suggested.
2. Should MapReduce support non-Writables through the pipeline out
of the box?
There has also been a discussion about whether we should support non-
Writables natively. There is already library code in Avro that lets
users use Avro types in a custom MapReduce API. A general MapReduce
API that encompasses all of the serialization frameworks and does
not lock users into a particular one is much more powerful.
Furthermore, making it convenient for the users, by including the
plugins in the default configuration and class path, will enable the
use of Avro, Thrift and ProtoBuf objects by people who would rather
not focus on serialization. Avro and Writables should not be the
only first class serializations that Hadoop supports by default.
3. Should a framework dependency on ProtoBuf be allowed?
Doug has added several framework dependences on Avro. The question
is whether it is acceptable to use the ProtoBuf library in the
framework. Avro is good for uses where there are a lot of objects of
the same type. ProtoBuf is better for small number of objects. The
question is whether Avro, JSON, and XML should be the only
serialization libraries that are acceptable to use in the framework.