Hi all, I'm trying to get the LinuxTaskController working (on the svn trunk) on a pseudo-distributed cluster. It's being quite frustrating.
I compiled common, hdfs, and mapred jars with 'ant jar' and copied everything together into the same directory structure. I then ran: $ cd src/git/mapred/src/c++/task-controller $ bash ./configure $ make $ cp task-controller ~/src/git/hadoop-common/bin $ cd ~/src/git/hadoop-common/bin $ sudo chown root:root task-controller $ sudo chmod 6655 task-controller $ ls -l task-controller -rwSr-sr-x 1 root root 45659 2009-10-23 00:31 task-controller My configuration is pretty minimal; I've not set much in mapred-site.xml besides mapred.job.tracker. I enabled the task controller with: <property> <name>mapreduce.tasktracker.taskcontroller</name> <value>org.apache.hadoop.mapred.LinuxTaskController</value> </property> core-site just sets fs.default.name. hdfs-site is empty. taskcontroller.cfg looks like: mapreduce.cluster.local.dir=/tmp/hadoop-aaron/mapred/local hadoop.pid.dir=/tmp hadoop.log.dir=/home/aaron/src/git/hadoop-common/logs hadoop.indent.str=#configured HADOOP_IDENT_STR (NB, typo "hadoop.indent.str" -- this was in the template file. I can't actually find a reference to either "hadoop.indent.str" or "hadoop.ident.str" in the task-controller C source, so I don't think this matters). I can verify that task-controller can do some stuff. For example, I can start another process (e.g., vim, find its pid, and then run) $ `readlink -f task-controller` aaron 6 <pid-of-vim> and task-controller will kill it. (The `readlink -f...` is needed because task-controller expects to get its full absolute path as argv[0] or else it segfaults on a malloc()... but that's another story.) Here are the permissions on my mapred.local.dir: aa...@jargon:/tmp/hadoop-aaron/mapred/local$ ls -l total 8 drwxrwxr-x 2 aaron aaron 4096 2009-10-23 01:03 jobTracker drwxr-xr-x 3 aaron aaron 4096 2009-10-23 01:01 taskTracker I start Hadoop using the standard scripts $ bin/start-dfs.sh $ bin/start-mapred.sh All of this is running as user "aaron", btw. ... so here's the problem -- I can't actually launch tasks! I try running a trivial job, and here's the output to the client: 09/10/23 12:08:39 INFO mapreduce.JobSubmitter: number of splits:1 09/10/23 12:08:40 INFO mapreduce.Job: Running job: job_200910231205_0002 09/10/23 12:08:41 INFO mapreduce.Job: map 0% reduce 0% 09/10/23 12:08:45 INFO mapreduce.Job: Task Id : attempt_200910231205_0002_m_000002_0, Status : FAILED Error initializing attempt_200910231205_0002_m_000002_0: java.io.IOException: Not able to initialize job directories in any of the configured local directories for job job_200910231205_0002 at org.apache.hadoop.mapreduce.server.tasktracker.Localizer.initializeJobDirs(Localizer.java:318) at org.apache.hadoop.mapred.TaskTracker.localizeJobFiles(TaskTracker.java:904) at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:860) at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1849) at org.apache.hadoop.mapred.TaskTracker.access$1200(TaskTracker.java:106) at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1814) <repeated several more times for the subsequent task attempts> ... and here are the log messages that appears in tasktracker.log: 2009-10-23 12:09:02,268 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_200910231205_0002_m_000001_3 which needs 1 slots 2009-10-23 12:09:02,268 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 4 and trying to launch attempt_200910231205_0002_m_000001_3 which needs 1 slots 2009-10-23 12:09:02,268 INFO org.apache.hadoop.mapreduce.server.tasktracker.Localizer: User-directories for the user aaron are already initialized on this TT. Not doing anything. 2009-10-23 12:09:02,271 WARN org.apache.hadoop.mapreduce.server.tasktracker.Localizer: Not able to create job directory /tmp/hadoop-aaron/mapred/local/taskTracker/aaron/jobcache/job_200910231205_0002 2009-10-23 12:09:02,272 WARN org.apache.hadoop.mapred.TaskTracker: Error initializing attempt_200910231205_0002_m_000001_3: java.io.IOException: Not able to initialize job directories in any of the configured local directories for job job_200910231205_0002 at org.apache.hadoop.mapreduce.server.tasktracker.Localizer.initializeJobDirs(Localizer.java:318) at org.apache.hadoop.mapred.TaskTracker.localizeJobFiles(TaskTracker.java:904) at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:860) at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1849) at org.apache.hadoop.mapred.TaskTracker.access$1200(TaskTracker.java:106) at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1814) If I look inside the tasktracker local dir, here's what I see: aa...@jargon:/tmp/hadoop-aaron/mapred/local$ cd taskTracker/ aa...@jargon:/tmp/hadoop-aaron/mapred/local/taskTracker$ ls -l total 4 dr-xrws--- 4 aaron root 4096 2009-10-23 12:06 aaron aa...@jargon:/tmp/hadoop-aaron/mapred/local/taskTracker$ cd aaron/ aa...@jargon:/tmp/hadoop-aaron/mapred/local/taskTracker/aaron$ ls -l total 8 dr-xrws--- 2 aaron root 4096 2009-10-23 12:06 distcache dr-xrws--- 2 aaron root 4096 2009-10-23 12:06 jobcache both of those dirs are empty. The /tmp/hadoop-aaron/mapred/local/taskTracker dir was created by the TT -- I had rm rf'd it before starting Hadoop, so all those permissions under there are those as-set by the TT itself. ... so this means that the owning user (aaron) can't write to his own directory, because I'm not a part of group 'root'! I tried setting u+w on distcache and jobcache. Now it can create a job dir, but the job dir it creates has these permissions: aa...@jargon:/tmp/hadoop-aaron/mapred/local/taskTracker/aaron/jobcache$ ls -l total 4 dr-xrws--- 4 aaron root 4096 2009-10-23 12:13 job_200910231205_0003 ... so it fails to make any attempt dirs. Is there something I should be doing differently with Linux users/groups? Or is this a bug in task-controller? Thanks, - Aaron