Re: Hadoop 1.0 Compatibility Discussion.

Konstantin Shvachko Tue, 21 Oct 2008 17:24:11 -0700

Sanjay Radia wrote:
>>          o  Can't remove deprecated classes or methods until 2.0


Dhruba Borthakur wrote:

1. APIs that are deprecated in x.y release can be removed in (x+1).0 release.


Current rule is that apis deprecated in M.x.y can be remove in M.(x+2).0
I don't think we want neither relax nor stiffen this requirement.

2.  Old 1.x clients can connect to new 1.y servers, where x <= y but
the old clients might get reduced functionality or performance. 1.x
clients might not be able to connect to 2.z servers.

3. HDFS disk format can change from 1.x to 1.y release and is
transparent to user-application. A cluster when rolling back to 1.x
from 1,y will revert back to the old disk format.

 * In a major release transition [ ie from a release x.y to a release
(x+1).0], a user should be able to read data from the cluster running the
old version.


I think this is a good requirement to have. This will be very useful
when we run multiple clusters, especially across data centers
(HADOOP-4058 is a use-case).


I don't see anything about compatibility model going from 1.*.* to 2.0.0.
Does that mean we do not provide compatibility between those?
Does that mean compatibility between 1.*.* and 2.*.* is provided by distcp?
Or another way to ask the same question: will HDFS-1 and HDFS-2 be
as different as ext2 and ext3?
I am not saying this is bad just want it to be clarified.

May be we should somehow structure this discussion into sections, e.g.:
- deprecation rules;
- client/server communication compatibility;
- inter version data format compatibility;
   = meta-data compatibility
   = block data compatibility

--Konstantin

--------
What does Hadoop 1.0 mean?
   * Standard release numbering: Only bug fixes in 1.x.y releases and new
features in 1.x.0 releases.
   * No need for client recompilation when upgrading from 1.x to 1.y, where
x <= y
         o  Can't remove deprecated classes or methods until 2.0
    * Old 1.x clients can connect to new 1.y servers, where x <= y
   * New FileSystem clients must be able to call old methods when talking to
old servers. This generally will be done by having old methods continue to
use old rpc methods. However, it is legal to have new implementations of old
methods call new rpcs methods, as long as the library transparently handles
the fallback case for old servers.
-----------------

A couple of  additional compatibility requirements:

* HDFS metadata and data is preserved across release changes, both major and
minor. That is,
whenever a release is upgraded, the HDFS metadata from the old release will
be converted automatically
as needed.

The above has been followed so far in Hadoop; I am just documenting it in
the 1.0 requirements list.

 * In a major release transition [ ie from a release x.y to a release
(x+1).0], a user should be able to read data from the cluster running the
old version.  (OR shall we generalize this to: from x.y to (x+i).z ?)

The motivation: data copying across clusters is a common operation for many
customers
(for example this is routinely at done at Yahoo.). Today, http (or hftp)
provides a guaranteed compatible way of copying data across versions.
 Clearly one cannot force a customer to simultaneously update all its hadoop
clusters on to
a new major release. The above documents this requirement; we can satisfy it
via the http/hftp mechanism or some other mechanism.

Question: is one is willing to break applications that operate across
clusters (ie an application that accesses data across clusters that cross a
major release boundary? I asked the operations team at Yahoo that run our
hadoop clusters. We currently do not have any applicaions that access data
across clusters as part  of a MR job. The reason being that Hadoop routinely
breaks  wire compatibility across releases and so such apps would be very
unreliable. However, the copying of data across clusters is t is crucial and
needs to be supported.

Shall we add a stronger requirement for 1.0:  wire compatibility across
major versions? This can be supported by class loading or other games. Note
we can wait to provide this when 2.0 happens. If Hadoop provided this
guarantee then it would allow customers to partition their data across
clusters without risking apps breaking across major releases due to wire
incompatibility issues.

Re: Hadoop 1.0 Compatibility Discussion.

Reply via email to