The new API in 0.20.x is likely not what you'll see in the final Hadoop 1.0 release, which I've heard some people forecast within the next 18 months or so (we'll see). There will likely be a 0.21.x series, and then the final release.

That having been said, its much more similar to what you'll see in the final release. Depending on how complex your jobs are, you may see minor or no changes in the final release, or you may see dramatic ones. I think (someone correct me if I'm wrong) the basic map and reduce abstract classes are just about set in stone, but if you're using other stuff like file formats, custom splits, etc. then you may see a lot of differences. I've also noticed a lot of changes in how the job and task trackers work, even in the current trunk. There's also some interesting work being done by yahoo on pipelining MR jobs, which will not be in any 0.20.x release.

The other thing about 0.20.x is that a lot of the old API (like joins, etc.) has not been updated, so your application may be a hodgepodge patchwork of the two APIs.

Are there any portions of the new API which are particularly attractive to you? That might help people suggest weather or not you should switch to satisfy that need. If you don't have any needs particular to the 0.20.x API then there's probably little reason to switch.

If you do upgrade to 0.20.1, make sure to get the cloudera or yahoo distributions. The current "stable" (0.20.1) release on the Apache page is very buggy.

On 11/10/09 3:30 PM, Mark Kerzner wrote:
Hi,

I've been working on my project for about a year, and I decided to upgrade
from 0.18.3 (which was stable and already old even back then). I have
started, but I see that many classes have changed, many are deprecated, and
I need to re-write some code. Is it worth it? What are the advantages of
doing this? Other areas of concern are:

    - Will Amazon EMR work with the latest Hadoop?
    - What about Cloudera distribution or Yahoo distribution?

Thank you,
Mark


Reply via email to