Re: Accessing stderr with Hadoop Streaming
S D wrote: Is there a way to access stderr when using Hadoop Streaming? I see how stdout is written to the log files but I'm more concerned about what happens when errors occur. Access to stderr would help debug when a run doesn't complete successfully but I haven't been able to figure out how to retrieve what's written to stderr. Presumably another approach would be to redirect stderr to stdout but I wanted to exhaust other approaches before trying that. Thanks, SD I normally see whats been written to stderr through the web interface. They're in the 'userlogs' directory in /opt/hadoop/logs. M
Re: Every time the mapping phase finishes I see this
I should mention..these are Hadoop streaming jobs, Hadoop version hadoop-0.18.3. Any idea about the empty stdout/stderr/syslog logs? I have no way to really track down whats causing them. thanks Steve Loughran wrote: Mayuran Yogarajah wrote: There are always a few 'Failed/Killed Task Attempts' and when I view the logs for these I see: - some that are empty, ie stdout/stderr/syslog logs are all blank - several that say: 2009-06-06 20:47:15,309 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.io.IOException: Filesystem closed at org.apache.hadoop.dfs.DFSClient.checkOpen(DFSClient.java:195) at org.apache.hadoop.dfs.DFSClient.access$600(DFSClient.java:59) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.close(DFSClient.java:1359) at java.io.FilterInputStream.close(FilterInputStream.java:159) at org.apache.hadoop.mapred.LineRecordReader$LineReader.close(LineRecordReader.java:103) at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:301) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:173) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:231) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198) Any idea why this happens? I don't understand why I'd be seeing these only as the mappers get to 100%. Seen this when something in the same process got a FileSystem reference by FileSystem.get() and then called close() on it -it closes the client for every thread/class that has a reference to the same object. We're planning on adding more diagnostics, by tracking who closed the filesystem https://issues.apache.org/jira/browse/HADOOP-5933
Every time the mapping phase finishes I see this
There are always a few 'Failed/Killed Task Attempts' and when I view the logs for these I see: - some that are empty, ie stdout/stderr/syslog logs are all blank - several that say: 2009-06-06 20:47:15,309 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.io.IOException: Filesystem closed at org.apache.hadoop.dfs.DFSClient.checkOpen(DFSClient.java:195) at org.apache.hadoop.dfs.DFSClient.access$600(DFSClient.java:59) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.close(DFSClient.java:1359) at java.io.FilterInputStream.close(FilterInputStream.java:159) at org.apache.hadoop.mapred.LineRecordReader$LineReader.close(LineRecordReader.java:103) at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:301) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:173) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:231) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198) Any idea why this happens? I don't understand why I'd be seeing these only as the mappers get to 100%. thanks
Logging in Hadoop Stream jobs
How do people handle logging in a Hadoop stream job? I'm currently looking at using syslog for this but would like to know of other ways people are doing this currently. thanks
java.io.IOException: All datanodes are bad. Aborting...
I have 2 directories listed for dfs.data.dir and one of them got to 100% used during a job I ran. I suspect thats the reason I see this error in the logs. Can someone please confirm this? thanks
Re: Sequence of Streaming Jobs
Billy Pearson wrote: I done this with and array of commands for the jobs in a php script checking the return of the job to tell if it failed or not. Billy I have this same issue.. How do you check if a job failed or not? You mentioned checking the return code? How are you doing that ? thanks "Dan Milstein" wrote in message news:58d66a11-b59c-49f8-b72f-7507482c3...@hubteam.com... If I've got a sequence of streaming jobs, each of which depends on the output of the previous one, is there a good way to launch that sequence? Meaning, I want step B to only start once step A has finished. From within Java JobClient code, I can do submitJob/runJob, but is there any sort of clean way to do this for a sequence of streaming jobs? Thanks, -Dan Milstein
Re: Master crashed
Alex Loddengaard wrote: I'm confused. Why are you trying to stop things when you're bringing the name node back up? Try running start-all.sh instead. Alex Won't that try to start the daemons on the slave nodes again? They're already running. M On Tue, Apr 28, 2009 at 4:00 PM, Mayuran Yogarajah < mayuran.yogara...@casalemedia.com> wrote: The master in my cluster crashed, the dfs/mapred java processes are still running on the slaves. What should I do next? I brought the master back up and ran stop-mapred.sh and stop-dfs.sh and it said this: slave1.test.com: no tasktracker to stop slave1.test.com: no datanode to stop Not sure what happened here, please advise. thanks, M
Master crashed
The master in my cluster crashed, the dfs/mapred java processes are still running on the slaves. What should I do next? I brought the master back up and ran stop-mapred.sh and stop-dfs.sh and it said this: slave1.test.com: no tasktracker to stop slave1.test.com: no datanode to stop Not sure what happened here, please advise. thanks, M
Checking if a streaming job failed
Hello, does anyone know how I can check if a streaming job (in Perl) has failed or succeeded? The only way I can see at the moment is to check the web interface for that jobID and parse out the '*Status:*' value. Is it not possible to do this using 'hadoop job -status' ? I see there is a count for failed map/reduce tasks, but map/reduce tasks failing is normal (or so I thought). I am under the impression that if a task fails it will simply be reassigned to a different node. Is this not the case? If this is normal then I can't reliably use this count to check if the job as a whole failed or succeeded. Any feedback is greatly appreciated. thanks, M
Hadoop Upgrade Wiki
Step 8 of the upgrade process mentions copying the 'edits' and 'fsimage' file to a backup directory. After step 19 it says: 'In case of failure the administrator should have the checkpoint files in order to be able to repeat the procedure from the appropriate point or to restart the old version of Hadoop.' Is this different from running 'start-dfs.sh -rollback' ? I'm not sure if the Wiki is outdated or not. If its the same then step #8 can be skipped altogether I'm guessing.. thanks
Re: HDFS is corrupt, need to salvage the data.
Mayuran Yogarajah wrote: Raghu Angadi wrote: The block files usually don't disappear easily. Check on the datanode if you find any files starting with "blk". Also check datanode log to see what happened there... may be use started on a different directory or something like that. Raghu. There are indeed blk files: find -name 'blk*' | wc -l 158 I didn't see anything out of the ordinary in the datanode log. At this point is there anything I can do to recover the files? Or do I need to reformat the data node and load the data in again ? thanks Sorry to resend this but I didn't receive a response and wanted to know how to proceed. Is it possible to recover the data at this stage? Or is it gone ? thanks
Re: HDFS is corrupt, need to salvage the data.
Raghu Angadi wrote: The block files usually don't disappear easily. Check on the datanode if you find any files starting with "blk". Also check datanode log to see what happened there... may be use started on a different directory or something like that. Raghu. There are indeed blk files: find -name 'blk*' | wc -l 158 I didn't see anything out of the ordinary in the datanode log. At this point is there anything I can do to recover the files? Or do I need to reformat the data node and load the data in again ? thanks
Re: HDFS is corrupt, need to salvage the data.
lohit wrote: How many Datanodes do you have. From the output it looks like at the point when you ran fsck, you had only one datanode connected to your NameNode. Did you have others? Also, I see that your default replication is set to 1. Can you check if your datanodes are up and running. Lohit There is only one data node at the moment. Does this mean the data is not recoverable? The HD on the machine seems fine so I'm a little confused as to what caused the HDFS to become corrupted. M
HDFS is corrupt, need to salvage the data.
Hello, it seems the HDFS in my cluster is corrupt. This is the output from hadoop fsck: Total size:9196815693 B Total dirs:17 Total files: 157 Total blocks: 157 (avg. block size 58578443 B) CORRUPT FILES:157 MISSING BLOCKS: 157 MISSING SIZE: 9196815693 B Minimally replicated blocks: 0 (0.0 %) Over-replicated blocks:0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor:1 Average block replication: 0.0 Missing replicas: 0 Number of data-nodes: 1 Number of racks: 1 It seems to say that there is 1 block missing from every file that was in the cluster.. I'm not sure how to proceed so any guidance would be much appreciated. My primary concern is recovering the data. thanks