On 9/24/09 11:29 PM, "Paul Smith" <psm...@aconex.com> wrote:
> I guess a secondary question goes to you as to how one could possibly
> manage large clusters with differing space requirements?  What does
> this feature mean to you with your background?  Here's little ole me
> thinking "this is probably such a dumb question to ask all these smart
> people who work at large sites".

A lot of this is in HDFS-284.

My reasoning was two fold:

A) The space calculations for HDFS get way easier (and much more consistent
from a capacity planning perspective).  Even with the changes in 0.20, it
still doesn't feel as though the name node tells the truth about how much
-real- space is usable.

B) MR is unbounded by Hadoop. Being able to actually tie it to a size
restriction is nearly impossible since we're doing with user submitted code.
Ignoring dedicated file systems, the next best thing is a file system quota
based around groups.

The two settings combined would allow for administrators to have much better
knowledge and control over where their space is going.

> I'm still new to Hadoop and learning the terminology, so it's still
> not 100% clear where non-DFS, but -still-Hadoop disk usage comes into
> play (is there a good page in this? I have the Definitive book, but
> that doesn't go into this sort of detail).

I don't think there is.  The sort of implied assumption throughout the docs
(and likely the book... I'll be honest, I've only read the pages that Tom
sent prior to publication and did a quick pass over the finished versions of
those same pages) is that the directory structure of the system is sort of
well understood.  But in general, the three big consumers of disk space:

- hdfs
- logs
- mr spill, task output
- mr support bits (job cache, etc)
- random junk that users create as part of their job

> I'm a log4j committer, so
> I'd love to think I can give back to Hadoop, but I still expect this
> is outside my capabilities at this stage.  I'm very familiar with
> Maven, but not Ivy, so the current build system for Hadoop eludes me
> and has been a bit of a barrier for me to even start considering
> direct contributions back (although see a future email shortly).

I'm fairly certain the Hadoop build system is a re-enactment of the first
act of Macbeth in code.  [Altho kudos to Lee, et. al., in making it better!]

> I guess I'm surprised if people from Facebook and Yahoo (pardon me if
> I drop the !) are needing this why it hasn't bubbled up (or just got
> done by the people with the itch).

There are lots of places where Hadoop operability is basically crap.  A lot
of this has to do with having different/higher priorities than operability. 

Reply via email to