Re: Large feature development

Todd Lipcon Mon, 03 Sep 2012 00:32:35 -0700

On Mon, Sep 3, 2012 at 12:05 AM, Arun C Murthy <a...@hortonworks.com> wrote:
>>
>> But, I'll stand by my point that YARN is at this point more "alpha"
>> than HDFS2.
>
> I'll unfair to tag-team me while consistently ignoring what I write.


I'm not sure I ignored what you wrote. I understand that Yahoo is
deploying soon on one of their clusters. That's great news. My
original point was about the state of YARN when it was merged, and the
comment about its current state was more of an aside. Hardly worth
debating further. Best of luck with the deployment next week - I look
forward to reading about how it goes on the list.

>> You brought up two bugs in the HDFS2 code base as examples
>> of HDFS 2 not being high quality.
>
> Through a lot of words you just agreed with what I said - if people didn't 
> upgrade to HDFS2 (not just HA) they wouldn't hit any of these: HDFS-3626,

You could hit this on Hadoop 1, it was just harder to hit.

> HDFS-3731 etc.

The details of this bug have to do with the upgrade/snapshot behavior
of the blocksBeingWritten directory which was added in branch-1. In
fact, the same basic bug continues to exist in branch-1. If you
perform an upgrade, it doesn't hard-link the blocks into the new
"current" directory. Hence, if the upgraded cluster exits safe mode
(causing lease recovery of those blocks), and then the user issues a
rollback, the blocks will have been deleted from the pre-upgrade
image. This broken branch-1 behavior carried over into branch-2 as
well, but it's not a new bug, as I said before.

> There are more, for e.g. how do folks work around Secondary NN not starting 
> up on upgrades from hadoop-1 (HDFS-3597)? They just copy multiple PBs over to 
> a new hadoop-2 cluster, or patch SNN themselves post HDFS-1073?

No, they rm -Rf the contents of the 2NN directory, which is completely
safe and doesn't data loss in any way. In fact, the bug fix is exactly
that -- it just does the rm -Rf itself, automatically. It's a trivial
workaround similar to how other bugs in the Hadoop 1 branch have
required workarounds in the past. Certainly no data movement or local
patching. The SNN is transient state and can always be cleared.

If you have any questions about other bugs in the 2.x line, feel free
to ask on the relevant JIRAs. I'm still perfectly confident in the
stability of HDFS 2 vs HDFS 1. In fact my cell phone is likely the one
that would ring if any of these production HDFS 2 clusters had an
issue, and I'll offer the same publicly to anyone on this list. If you
experience a corruption or data loss issue on the tip of branch-2
HDFS, email me off-list and I'll personally diagnose the issue. I
would not make that same offer for branch-1 due to the fundamentally
less robust design which has caused a lot of subtle bugs over the past
several years.

Thanks
-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Large feature development

Reply via email to