The Hadoop 1.0 wiki has a section on compatibility.
http://wiki.apache.org/hadoop/Release1.0Requirements

Since the wiki is awkward for discussions, I am continuing the discussion here.
I or someone will update the wiki when agreements are reached.

Here is the current list of compatibility requirements on the Hadoop 1.0 Wiki for the convenience of this email thread.
--------
What does Hadoop 1.0 mean?
* Standard release numbering: Only bug fixes in 1.x.y releases and new features in 1.x.0 releases. * No need for client recompilation when upgrading from 1.x to 1.y, where x <= y
          o  Can't remove deprecated classes or methods until 2.0
     * Old 1.x clients can connect to new 1.y servers, where x <= y
* New FileSystem clients must be able to call old methods when talking to old servers. This generally will be done by having old methods continue to use old rpc methods. However, it is legal to have new implementations of old methods call new rpcs methods, as long as the library transparently handles the fallback case for old servers.
-----------------

A couple of  additional compatibility requirements:

* HDFS metadata and data is preserved across release changes, both major and minor. That is, whenever a release is upgraded, the HDFS metadata from the old release will be converted automatically
as needed.

The above has been followed so far in Hadoop; I am just documenting it in the 1.0 requirements list.

* In a major release transition [ ie from a release x.y to a release (x+1).0], a user should be able to read data from the cluster running the old version. (OR shall we generalize this to: from x.y to (x+i).z ?)

The motivation: data copying across clusters is a common operation for many customers (for example this is routinely at done at Yahoo.). Today, http (or hftp) provides a guaranteed compatible way of copying data across versions. Clearly one cannot force a customer to simultaneously update all its hadoop clusters on to a new major release. The above documents this requirement; we can satisfy it via the http/hftp mechanism or some other mechanism.

Question: is one is willing to break applications that operate across clusters (ie an application that accesses data across clusters that cross a major release boundary? I asked the operations team at Yahoo that run our hadoop clusters. We currently do not have any applicaions that access data across clusters as part of a MR job. The reason being that Hadoop routinely breaks wire compatibility across releases and so such apps would be very unreliable. However, the copying of data across clusters is t is crucial and needs to be supported.

Shall we add a stronger requirement for 1.0: wire compatibility across major versions? This can be supported by class loading or other games. Note we can wait to provide this when 2.0 happens. If Hadoop provided this guarantee then it would allow customers to partition their data across clusters without risking apps breaking across major releases due to wire incompatibility issues.


Reply via email to