Re: Accessing stderr with Hadoop Streaming

2009-06-23 Thread Mayuran Yogarajah

S D wrote:

Is there a way to access stderr when using Hadoop Streaming? I see how
stdout is written to the log files but I'm more concerned about what happens
when errors occur. Access to stderr would help debug when a run doesn't
complete successfully but I haven't been able to figure out how to retrieve
what's written to stderr. Presumably another approach would be to redirect
stderr to stdout but I wanted to exhaust other approaches before trying
that.

Thanks,
SD
  

I normally see whats been written to stderr through the web interface.
They're in the 'userlogs' directory in /opt/hadoop/logs. 


M


Re: Every time the mapping phase finishes I see this

2009-06-08 Thread Mayuran Yogarajah
I should mention..these are Hadoop streaming jobs, Hadoop version 
hadoop-0.18.3.


Any idea about the empty stdout/stderr/syslog logs? I have no way to 
really track down
whats causing them. 


thanks


Steve Loughran wrote:

Mayuran Yogarajah wrote:
  

There are always a few 'Failed/Killed Task Attempts' and when I view the
logs for
these I see:

- some that are empty, ie stdout/stderr/syslog logs are all blank
- several that say:

2009-06-06 20:47:15,309 WARN org.apache.hadoop.mapred.TaskTracker: Error
running child
java.io.IOException: Filesystem closed
at org.apache.hadoop.dfs.DFSClient.checkOpen(DFSClient.java:195)
at org.apache.hadoop.dfs.DFSClient.access$600(DFSClient.java:59)
at
org.apache.hadoop.dfs.DFSClient$DFSInputStream.close(DFSClient.java:1359)
at java.io.FilterInputStream.close(FilterInputStream.java:159)
at
org.apache.hadoop.mapred.LineRecordReader$LineReader.close(LineRecordReader.java:103)

at
org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:301)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:173)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:231)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)



Any idea why this happens? I don't understand why I'd be seeing these
only as the mappers get to
100%.



Seen this when something in the same process got a FileSystem reference
by FileSystem.get() and then called close() on it -it closes the client
for every thread/class that has a reference to the same object.


We're planning on adding more diagnostics, by tracking who closed the
filesystem
https://issues.apache.org/jira/browse/HADOOP-5933
  




Every time the mapping phase finishes I see this

2009-06-06 Thread Mayuran Yogarajah
There are always a few 'Failed/Killed Task Attempts' and when I view the 
logs for

these I see:

- some that are empty, ie stdout/stderr/syslog logs are all blank
- several that say:

2009-06-06 20:47:15,309 WARN org.apache.hadoop.mapred.TaskTracker: Error 
running child
java.io.IOException: Filesystem closed
at org.apache.hadoop.dfs.DFSClient.checkOpen(DFSClient.java:195)
at org.apache.hadoop.dfs.DFSClient.access$600(DFSClient.java:59)
at 
org.apache.hadoop.dfs.DFSClient$DFSInputStream.close(DFSClient.java:1359)
at java.io.FilterInputStream.close(FilterInputStream.java:159)
at 
org.apache.hadoop.mapred.LineRecordReader$LineReader.close(LineRecordReader.java:103)
at 
org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:301)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:173)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:231)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)



Any idea why this happens? I don't understand why I'd be seeing these 
only as the mappers get to

100%.

thanks


Logging in Hadoop Stream jobs

2009-05-08 Thread Mayuran Yogarajah

How do people handle logging in a Hadoop stream job?

I'm currently looking at using syslog for this but would like to know of 
other ways

people are doing this currently.

thanks


java.io.IOException: All datanodes are bad. Aborting...

2009-05-06 Thread Mayuran Yogarajah
I have 2 directories listed for dfs.data.dir and one of them got to 100% 
used
during a job I ran.  I suspect thats the reason I see this error in the 
logs.


Can someone please confirm this?

thanks


Re: Sequence of Streaming Jobs

2009-05-02 Thread Mayuran Yogarajah

Billy Pearson wrote:

I done this with and array of commands for the jobs in a php script checking
the return of the job to tell if it failed or not.

Billy

  
I have this same issue.. How do you check if a job failed or not? You 
mentioned checking

the return code? How are you doing that ?

thanks

Dan Milstein dmilst...@hubteam.com wrote in
message news:58d66a11-b59c-49f8-b72f-7507482c3...@hubteam.com...
  

If I've got a sequence of streaming jobs, each of which depends on the
output of the previous one, is there a good way to launch that  sequence?
Meaning, I want step B to only start once step A has  finished.

From within Java JobClient code, I can do submitJob/runJob, but is  there
any sort of clean way to do this for a sequence of streaming jobs?

Thanks,
-Dan Milstein





  




Re: Master crashed

2009-04-30 Thread Mayuran Yogarajah

Alex Loddengaard wrote:

I'm confused.  Why are you trying to stop things when you're bringing the
name node back up?  Try running start-all.sh instead.

Alex

  
Won't that try to start the daemons on the slave nodes again? They're 
already running.


M

On Tue, Apr 28, 2009 at 4:00 PM, Mayuran Yogarajah 
mayuran.yogara...@casalemedia.com wrote:

  

The master in my cluster crashed, the dfs/mapred java processes are
still running on the slaves.  What should I do next? I brought the master
back up and ran stop-mapred.sh and stop-dfs.sh and it said this:

slave1.test.com: no tasktracker to stop
slave1.test.com: no datanode to stop

Not sure what happened here, please advise.

thanks,
M






Master crashed

2009-04-28 Thread Mayuran Yogarajah

The master in my cluster crashed, the dfs/mapred java processes are
still running on the slaves.  What should I do next? I brought the master
back up and ran stop-mapred.sh and stop-dfs.sh and it said this:

slave1.test.com: no tasktracker to stop
slave1.test.com: no datanode to stop

Not sure what happened here, please advise.

thanks,
M


Checking if a streaming job failed

2009-04-02 Thread Mayuran Yogarajah

Hello, does anyone know how I can check if a streaming job (in Perl) has
failed or succeeded? The only way I can see at the moment is to check
the web interface for that jobID and parse out the '*Status:*' value.

Is it not possible to do this using 'hadoop job -status' ? I see there 
is a count

for failed map/reduce tasks, but map/reduce tasks failing is normal (or so
I thought).  I am under the impression that if a task fails it will 
simply be
reassigned to a different node.  Is this not the case?  If this is 
normal then I
can't reliably use this count to check if the job as a whole failed or 
succeeded.


Any feedback is greatly appreciated.

thanks,
M


Hadoop Upgrade Wiki

2009-03-13 Thread Mayuran Yogarajah
Step 8 of the upgrade process mentions copying the 'edits' and 'fsimage' 
file

to a backup directory.  After step 19 it says:

'In case of failure the administrator should have the checkpoint files
in order to be able to repeat the procedure from the appropriate point
or to restart the old version of Hadoop.'


Is this different from running 'start-dfs.sh -rollback' ?
I'm not sure if the Wiki is outdated or not.  If its the same then step 
#8 can be skipped

altogether I'm guessing..

thanks


Re: HDFS is corrupt, need to salvage the data.

2009-03-10 Thread Mayuran Yogarajah

lohit wrote:

How many Datanodes do you have.
From the output it looks like at the point when you ran fsck, you had only one 
datanode connected to your NameNode. Did you have others?
Also, I see that your default replication is set to 1. Can you check if  your 
datanodes are up and running.
Lohit


  
There is only one data node at the moment.  Does this mean the data is 
not recoverable?
The HD on the machine seems fine so I'm a little confused as to what 
caused the HDFS to

become corrupted.

M


HDFS is corrupt, need to salvage the data.

2009-03-09 Thread Mayuran Yogarajah
Hello, it seems the HDFS in my cluster is corrupt.  This is the output 
from hadoop fsck:

Total size:9196815693 B
Total dirs:17
Total files:   157
Total blocks:  157 (avg. block size 58578443 B)
 
 CORRUPT FILES:157
 MISSING BLOCKS:   157
 MISSING SIZE: 9196815693 B
 
Minimally replicated blocks:   0 (0.0 %)
Over-replicated blocks:0 (0.0 %)
Under-replicated blocks:   0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor:1
Average block replication: 0.0
Missing replicas:  0
Number of data-nodes:  1
Number of racks:   1

It seems to say that there is 1 block missing from every file that was 
in the cluster..


I'm not sure how to proceed so any guidance would be much appreciated.  
My primary

concern is recovering the data.

thanks