Hadoop 1.0 Compatibility Discussion.

Sanjay Radia Mon, 20 Oct 2008 18:50:47 -0700

The Hadoop 1.0 wiki has a section on compatibility.
http://wiki.apache.org/hadoop/Release1.0Requirements

Since the wiki is awkward for discussions, I am continuing thediscussion here.

I or someone will update the wiki when agreements are reached.

Here is the current list of compatibility requirements on the Hadoop1.0 Wiki for the convenience of this email thread.

--------
What does Hadoop 1.0 mean?

* Standard release numbering: Only bug fixes in 1.x.y releasesand new features in 1.x.0 releases.* No need for client recompilation when upgrading from 1.x to1.y, where x <= y

          o  Can't remove deprecated classes or methods until 2.0
     * Old 1.x clients can connect to new 1.y servers, where x <= y

* New FileSystem clients must be able to call old methods whentalking to old servers. This generally will be done by having oldmethods continue to use old rpc methods. However, it is legal to havenew implementations of old methods call new rpcs methods, as long asthe library transparently handles the fallback case for old servers.

-----------------

A couple of  additional compatibility requirements:

* HDFS metadata and data is preserved across release changes, bothmajor and minor. That is,whenever a release is upgraded, the HDFS metadata from the old releasewill be converted automatically

as needed.

The above has been followed so far in Hadoop; I am just documenting itin the 1.0 requirements list.

* In a major release transition [ ie from a release x.y to arelease (x+1).0], a user should be able to read data from the clusterrunning the old version. (OR shall we generalize this to: from x.y to(x+i).z ?)

The motivation: data copying across clusters is a common operation formany customers(for example this is routinely at done at Yahoo.). Today, http (orhftp) provides a guaranteed compatible way of copying data acrossversions. Clearly one cannot force a customer to simultaneouslyupdate all its hadoop clusters on toa new major release. The above documents this requirement; we cansatisfy it via the http/hftp mechanism or some other mechanism.

Question: is one is willing to break applications that operate acrossclusters (ie an application that accesses data across clusters thatcross a major release boundary? I asked the operations team at Yahoothat run our hadoop clusters. We currently do not have any applicaionsthat access data across clusters as part of a MR job. The reasonbeing that Hadoop routinely breaks wire compatibility across releasesand so such apps would be very unreliable. However, the copying ofdata across clusters is t is crucial and needs to be supported.

Shall we add a stronger requirement for 1.0: wire compatibilityacross major versions? This can be supported by class loading or othergames. Note we can wait to provide this when 2.0 happens. If Hadoopprovided this guarantee then it would allow customers to partitiontheir data across clusters without risking apps breaking across majorreleases due to wire incompatibility issues.

Hadoop 1.0 Compatibility Discussion.

Reply via email to