Rescheduling of already completed map/reduce task

2009-04-27 Thread Sagar Naik

Hi,
The job froze after the filesystem hung on a machine which had 
successfully completed a map task.

Is there a flag to enable the re scheduling of such a task ?


Jstack of job tracker

SocketListener0-2 prio=10 tid=0x08916000 nid=0x4a4f runnable 
[0x4d05c000..0x4d05ce30]

  java.lang.Thread.State: RUNNABLE
   at java.net.SocketInputStream.socketRead0(Native Method)
   at java.net.SocketInputStream.read(SocketInputStream.java:129)
   at org.mortbay.util.LineInput.fill(LineInput.java:469)
   at org.mortbay.util.LineInput.fillLine(LineInput.java:547)
   at org.mortbay.util.LineInput.readLineBuffer(LineInput.java:293)
   at org.mortbay.util.LineInput.readLineBuffer(LineInput.java:277)
   at org.mortbay.http.HttpRequest.readHeader(HttpRequest.java:238)
   at 
org.mortbay.http.HttpConnection.readRequest(HttpConnection.java:861)
   at 
org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:907)

   at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
   at 
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)

   at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
   at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)

  Locked ownable synchronizers:
   - None


SocketListener0-1 prio=10 tid=0x4da8c800 nid=0xeeb runnable 
[0x4d266000..0x4d2670b0]

  java.lang.Thread.State: RUNNABLE
   at java.net.SocketInputStream.socketRead0(Native Method)
   at java.net.SocketInputStream.read(SocketInputStream.java:129)
   at org.mortbay.util.LineInput.fill(LineInput.java:469)
   at org.mortbay.util.LineInput.fillLine(LineInput.java:547)
   at org.mortbay.util.LineInput.readLineBuffer(LineInput.java:293)
   at org.mortbay.util.LineInput.readLineBuffer(LineInput.java:277)
   at org.mortbay.http.HttpRequest.readHeader(HttpRequest.java:238)
   at 
org.mortbay.http.HttpConnection.readRequest(HttpConnection.java:861)
   at 
org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:907)

   at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
   at 
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)

   at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
   at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)

IPC Server listener on 54311 daemon prio=10 tid=0x4df70400 nid=0xe86 
runnable [0x4d9fe000..0x4d9feeb0]

  java.lang.Thread.State: RUNNABLE
   at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
   at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
   at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
   at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
   - locked 0x54fb4320 (a sun.nio.ch.Util$1)
   - locked 0x54fb4310 (a java.util.Collections$UnmodifiableSet)
   - locked 0x54fb40b8 (a sun.nio.ch.EPollSelectorImpl)
   at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
   at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
   at org.apache.hadoop.ipc.Server$Listener.run(Server.java:296)

  Locked ownable synchronizers:
   - None

IPC Server Responder daemon prio=10 tid=0x4da22800 nid=0xe85 runnable 
[0x4db75000..0x4db75e30]

  java.lang.Thread.State: RUNNABLE
   at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
   at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
   at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
   at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
   - locked 0x54f0 (a sun.nio.ch.Util$1)
   - locked 0x54fdce10 (a java.util.Collections$UnmodifiableSet)
   - locked 0x54fdcc18 (a sun.nio.ch.EPollSelectorImpl)
   at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
   at org.apache.hadoop.ipc.Server$Responder.run(Server.java:455)

  Locked ownable synchronizers:
   - None

RMI TCP Accept-0 daemon prio=10 tid=0x4da13400 nid=0xe31 runnable 
[0x4de55000..0x4de56130]

  java.lang.Thread.State: RUNNABLE
   at java.net.PlainSocketImpl.socketAccept(Native Method)
   at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:384)
   - locked 0x54f6dae0 (a java.net.SocksSocketImpl)
   at java.net.ServerSocket.implAccept(ServerSocket.java:453)
   at java.net.ServerSocket.accept(ServerSocket.java:421)
   at 
sun.management.jmxremote.LocalRMIServerSocketFactory$1.accept(LocalRMIServerSocketFactory.java:34)
   at 
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
   at 
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)

   at java.lang.Thread.run(Thread.java:619)

  Locked ownable synchronizers:
   - None

-Sagar


Multithreaded Reducer

2009-04-10 Thread Sagar Naik

Hi,
I would like to implement a Multi-threaded reducer.
As per my understanding , the system does not have one coz we expect the 
output to be sorted.


However, in my case I dont need the output sorted.

Can u pl point to me any other issues or it would be safe to do so

-Sagar


Re: Multithreaded Reducer

2009-04-10 Thread Sagar Naik


Two things
- multi-threaded is preferred over multi-processes. The process I m 
planning is IO bound so I can really take advantage of  multi-threads 
(100 threads)
- Correct me if I m wrong. The next MR_JOB in the pipeline will have  
increased number of splits to process as the number of reducer-outputs 
(from prev job) have increased . This leads to increase

  in the map-task completion time.



-Sagar

Aaron Kimball wrote:

Rather than implementing a multi-threaded reducer, why not simply increase
the number of reducer tasks per machine via
mapred.tasktracker.reduce.tasks.maximum, and increase the total number of
reduce tasks per job via mapred.reduce.tasks to ensure that they're all
filled. This will effectively utilize a higher number of cores.

- Aaron

On Fri, Apr 10, 2009 at 11:12 AM, Sagar Naik sn...@attributor.com wrote:

  

Hi,
I would like to implement a Multi-threaded reducer.
As per my understanding , the system does not have one coz we expect the
output to be sorted.

However, in my case I dont need the output sorted.

Can u pl point to me any other issues or it would be safe to do so

-Sagar




  


Re: connecting two clusters

2009-04-07 Thread Sagar Naik

Hi,
I m not sure if u have looked at this option.
But instead of having two HDFS , u can have one HDFS and two map-red 
clusters (pointing to same HDFS)  and then do the sync mechanisms


-Sagar

Mithila Nagendra wrote:

Hello Aaron
Yes it makes a lot of sense! Thank you! :)

The incremental wavefront model is another option we are looking at.
Currently we have a two map/reduce levels, the upper level has to wait until
the lower map/reduce has produced the entire result set. We want to avoid
this... We were thinking of using two separate clusters so that these levels
can run on them - hoping to achieve better resource utilization. We were
hoping to connect the two clusters in some way so that the processes can
interact - but it seems like Hadoop is limited in that sense. I was
wondering how a common HDFS system can be setup for this purpose.

I tried looking for material for synchronization between two map-reduce
clusters - there is limited/no data available out on the Web! If we stick to
the incremental wavefront model, then we could probably work with one
cluster.

Mithila

On Tue, Apr 7, 2009 at 7:05 PM, Aaron Kimball aa...@cloudera.com wrote:

  

Hi Mithila,

Unfortunately, Hadoop MapReduce jobs determine their inputs as soon as they
begin; the inputs for the job are then fixed. So additional files that
arrive in the input directory after processing has begun, etc, do not
participate in the job.

And HDFS does not currently support appends to files, so existing files
cannot be updated.

A typical way in which this sort of problem is handled is to do processing
in incremental wavefronts; process A generates some data which goes in an
incoming directory for process B; process B starts on a timer every so
often and collects the new input files and works on them. After it's done,
it moves those inputs which it processed into a done directory. In the
mean time, new files may have arrived. After another time interval, another
round of process B starts.  The major limitation of this model is that it
requires that your process work incrementally, or that you are emitting a
small enough volume of data each time in process B that subsequent
iterations can load into memory a summary table of results from previous
iterations. Look into using the DistributedCache to disseminate such files.

Also, why are you using two MapReduce clusters for this, as opposed to one?
Is there a common HDFS cluster behind them?  You'll probably get much
better
performance for the overall process if the output data from one job does
not
need to be transferred to another cluster before it is further processed.

Does this model make sense?
- Aaron

On Tue, Apr 7, 2009 at 1:06 AM, Mithila Nagendra mnage...@asu.edu wrote:



Aaron,
We hope to achieve a level of pipelining between two clusters - similar
  

to


how pipelining is done in executing RDB queries. You can look at it as
  

the


producer-consumer problem, one cluster produces some data and the other
cluster consumes it. The issue that has to be dealt with here is the data
exchange between the clusters - synchronized interaction between the
map-reduce jobs on the two clusters is what I m hoping to achieve.

Mithila

On Tue, Apr 7, 2009 at 10:10 AM, Aaron Kimball aa...@cloudera.com
  

wrote:


Clusters don't really have identities beyond the addresses of the


NameNodes
  

and JobTrackers. In the example below, nn1 and nn2 are the hostnames of


the
  

namenodes of the source and destination clusters. The 8020 in each


address
  

assumes that they're on the default port.

Hadoop provides no inter-task or inter-job synchronization primitives,


on


purpose (even within a cluster, the most you get in terms of
synchronization
is the ability to join on the status of a running job to determine


that


it's completed). The model is designed to be as identity-independent as
possible to make it more resiliant to failure. If individual jobs/tasks
could lock common resources, then the intermittent failure of tasks


could


easily cause deadlock.

Using a file as a scoreboard or other communication mechanism between
multiple jobs is not something explicitly designed for, and likely to


end


in
frustration. Can you describe the goal you're trying to accomplish?


It's


likely that there's another, more MapReduce-y way of looking at the job


and
  

refactoring the code to make it work more cleanly with the intended
programming model.

- Aaron

On Mon, Apr 6, 2009 at 10:08 PM, Mithila Nagendra mnage...@asu.edu
wrote:



Thanks! I was looking at the link sent by Philip. The copy is done
  

with


the


following command:
hadoop distcp hdfs://nn1:8020/foo/bar \
   hdfs://nn2:8020/bar/foo

I was wondering if nn1 and nn2 are the names of the clusters or the
  

name
  

of


the 

Re: safemode forever

2009-04-07 Thread Sagar Naik

It means tht not all blocks have been reported
Can u check how many datanodes have reported in UI or bin/hadoop 
dfsadmin -report


In case u have to disable the safemode check bin/hadoop dfsadmin 
-safemode command

it has options to enter/leave/get
-Sagar

javateck javateck wrote:

Hi,
  I'm wondering if anyone has solutions about the nonstopped safe mode, any
way to get it around?

  thanks,

error: org.apache.hadoop.dfs.SafeModeException: Cannot delete
/mapred/system. Name node is in safe mode.
The ratio of reported blocks 0.4696 has not reached the threshold 0.9990.
Safe mode will be turned off automatically.

  


Re: hadoop-a small doubt

2009-03-29 Thread Sagar Naik

Yes u can
Java Client :
Copy the conf dir (same as one on namenode/datanode) and hadoop jars 
shud be in the classpath of client

Non Java Client :
http://wiki.apache.org/hadoop/MountableHDFS



-Sagar

-Sagar

deepya wrote:

Hi,
   I am SreeDeepya doing MTech in IIIT.I am working on a project named cost
effective and scalable storage server.I configured a small hadoop cluster
with only two nodes one namenode and one datanode.I am new to hadoop.
I have a small doubt.

Can a system not in the hadoop cluster access the namenode or the
datanodeIf yes,then can you please tell me the necessary configurations
that has to be done.

Thanks in advance.

SreeDeepya
  


Re: Design issue for a problem using Map Reduce

2009-02-14 Thread Sagar Naik

Here is one thought
N maps and 1 Reduce,
input to map: t,w(t)
output of map t, w(t)*w(t)
I assume t is an integer. So in case of 1 reducer, u will receive
t0, square(w(0)
t1, square(w(1)
t2, square(w(2)
t3, square(w(3)
Note this wiil be a sorted series on t.

in reduce

static prevF = 0;

reduce(t, square_w_t)
{
  f = square_w_t * A  + B * prevF ;
  output.collect(t,f)
  prevF = f
}

According to me the step of B*F(t-1) is inherently sequential.
So all we can do is parallelize the a*w(t)*w(t) part.

-Sagar

some speed wrote:

Hello all,

I am trying to implement a Map Reduce Chain to solve a particular statistic
problem. I have come to a point where I have to solve the following type of
equation in Hadoop:

F(t)= A*w(t)*w(t) + B*F(t-1); Given: F(0)=0, A and B are Alpha and Beta
and their values are known.

Now, W is series of numbers (There could be *a million* or more numbers).

So to Solve the equation in terms of Map Reduce, there are basically 2
issues which I can think of:

1) How will I be able to get the value of F(t-1) since it means as each step
i need the value from the previous iteration. And that is not possible while
computing parallely.
2) the w(t) values have to be read and applied in order also ,and, again
that is a prb while computing parallely.

Can some please help me go abt this problem and overcome the issues?

Thanks,

Sharath

  


Re: Not able to copy a file to HDFS after installing

2009-02-04 Thread Sagar Naik


where is the namenode running ? localhost or some other host

-Sagar
Rajshekar wrote:
Hello, 
I am new to Hadoop and I jus installed on Ubuntu 8.0.4 LTS as per guidance

of a web site. I tested it and found working fine. I tried to copy a file
but it is giving some error pls help me out

had...@excel-desktop:/usr/local/hadoop/hadoop-0.17.2.1$  bin/hadoop jar
hadoop-0.17.2.1-examples.jar wordcount /home/hadoop/Download\ URLs.txt
download-output
09/02/02 11:18:59 INFO ipc.Client: Retrying connect to server:
localhost/127.0.0.1:9000. Already tried 1 time(s).
09/02/02 11:19:00 INFO ipc.Client: Retrying connect to server:
localhost/127.0.0.1:9000. Already tried 2 time(s).
09/02/02 11:19:01 INFO ipc.Client: Retrying connect to server:
localhost/127.0.0.1:9000. Already tried 3 time(s).
09/02/02 11:19:02 INFO ipc.Client: Retrying connect to server:
localhost/127.0.0.1:9000. Already tried 4 time(s).
09/02/02 11:19:04 INFO ipc.Client: Retrying connect to server:
localhost/127.0.0.1:9000. Already tried 5 time(s).
09/02/02 11:19:05 INFO ipc.Client: Retrying connect to server:
localhost/127.0.0.1:9000. Already tried 6 time(s).
09/02/02 11:19:06 INFO ipc.Client: Retrying connect to server:
localhost/127.0.0.1:9000. Already tried 7 time(s).
09/02/02 11:19:07 INFO ipc.Client: Retrying connect to server:
localhost/127.0.0.1:9000. Already tried 8 time(s).
09/02/02 11:19:08 INFO ipc.Client: Retrying connect to server:
localhost/127.0.0.1:9000. Already tried 9 time(s).
09/02/02 11:19:09 INFO ipc.Client: Retrying connect to server:
localhost/127.0.0.1:9000. Already tried 10 time(s).
java.lang.RuntimeException: java.net.ConnectException: Connection refused
at org.apache.hadoop.mapred.JobConf.getWorkingDirecto ry(JobConf.java:356)
at org.apache.hadoop.mapred.FileInputFormat.setInputP
aths(FileInputFormat.java:331)
at org.apache.hadoop.mapred.FileInputFormat.setInputP
aths(FileInputFormat.java:304)
at org.apache.hadoop.examples.WordCount.run(WordCount .java:146)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.j ava:65)
at org.apache.hadoop.examples.WordCount.main(WordCoun t.java:155)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Nativ e Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Native
MethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(De
legatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.ProgramDriver$ProgramDescri
ption.invoke(ProgramDriver.java:6
at org.apache.hadoop.util.ProgramDriver.driver(Progra mDriver.java:139)
at org.apache.hadoop.examples.ExampleDriver.main(Exam pleDriver.java:53)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Nativ e Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Native
MethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(De
legatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155 )
at org.apache.hadoop.mapred.JobShell.run(JobShell.jav a:194)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.j ava:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.j ava:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.ja va:220)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketC hannelImpl.java:592)
at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.jav a:11
at org.apache.hadoop.ipc.Client$Connection.setupIOstr eams(Client.java:174)
at org.apache.hadoop.ipc.Client.getConnection(Client. java:623)
at org.apache.hadoop.ipc.Client.call(Client.java:546)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java: 212)
at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(U nknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:313)
at org.apache.hadoop.dfs.DFSClient.createRPCNamenode( DFSClient.java:102)
at org.apache.hadoop.dfs.DFSClient.init(DFSClient.j ava:17
at org.apache.hadoop.dfs.DistributedFileSystem.initia
lize(DistributedFileSystem.java:6
at org.apache.hadoop.fs.FileSystem.createFileSystem(F ileSystem.java:1280)
at org.apache.hadoop.fs.FileSystem.access$300(FileSys tem.java:56)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSyst em.java:1291)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.jav a:203)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.jav a:10
at org.apache.hadoop.mapred.JobConf.getWorkingDirecto ry(JobConf.java:352)
  


Re: My tasktrackers keep getting lost...

2009-02-02 Thread Sagar Naik

Can u post the output from
hadoop-argus-hostname-jobtracker.out

-Sagar

jason hadoop wrote:

When I was at Attributor we experienced periodic odd XFS hangs that would
freeze up the Hadoop Server processes resulting in them going away.
Sometimes XFS would deadlock all writes to the log file and the server would
freeze trying to log a message. Can't even JSTACK the jvm.
We never had any traction on resolving the XFS deadlocks and simply reboot
the machines when the problem occured.

On Mon, Feb 2, 2009 at 7:09 PM, Ian Soboroff ian.sobor...@nist.gov wrote:

  

I hope someone can help me out.  I'm getting started with Hadoop,
have written the firt part of my project (a custom InputFormat), and am
now using that to test out my cluster setup.

I'm running 0.19.0.  I have five dual-core Linux workstations with most
of a 250GB disk available for playing, and am controlling things from my
Mac Pro.  (This is not the production cluster, that hasn't been
assembled yet.  This is just to get the code working and figure out the
bumps.)

My test data is about 18GB of web pages, and the test app at the moment
just counts the number of web pages in each bundle file.  The map jobs
run just fine, but when it gets into the reduce, the TaskTrackers all
get lost to the JobTracker.  I can't see why, because the TaskTrackers
are all still running on the slaves.  Also, the jobdetails URL starts
returning an HTTP 500 error, although other links from that page still
work.

I've tried going onto the slaves and manually restarting the
tasktrackers with hadoop-daemon.sh, and also turning on job restarting
in the site conf and then running stop-mapred/start-mapred.  The
trackers start up and try to clean up and get going again, but they then
just get lost again.

Here's some error output from the master jobtracker:

2009-02-02 13:39:40,904 INFO org.apache.hadoop.mapred.JobTracker: Removed
completed task 'attempt_200902021252_0002_r_05_1' from
'tracker_darling:localhost.localdomain/127.0.0.1:58336'
2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker:
attempt_200902021252_0002_m_004592_1 is 796370 ms debug.
2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching
task attempt_200902021252_0002_m_004592_1 timed out.
2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker:
attempt_200902021252_0002_m_004582_1 is 794199 ms debug.
2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching
task attempt_200902021252_0002_m_004582_1 timed out.
2009-02-02 13:41:22,271 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
'duplicate' heartbeat from 'tracker_cheyenne:localhost.localdomain/
127.0.0.1:52769'; resending the previous 'lost' response
2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
'duplicate' heartbeat from 'tracker_tigris:localhost.localdomain/
127.0.0.1:52808'; resending the previous 'lost' response
2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
'duplicate' heartbeat from 'tracker_monocacy:localhost.localdomain/
127.0.0.1:54464'; Resending the previous 'lost' response
2009-02-02 13:41:22,298 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
'duplicate' heartbeat from 'tracker_129.6.101.41:127.0.0.1/127.0.0.1:58744';
resending the previous 'lost' response
2009-02-02 13:41:22,421 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
'duplicate' heartbeat from 'tracker_rhone:localhost.localdomain/
127.0.0.1:45749'; resending the previous 'lost' response
2009-02-02 13:41:22,421 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 9
on 54311 caught: java.lang.NullPointerException
   at org.apache.hadoop.mapred.MapTask.write(MapTask.java:123)
   at
org.apache.hadoop.mapred.LaunchTaskAction.write(LaunchTaskAction.java
:48)
   at
org.apache.hadoop.mapred.HeartbeatResponse.write(HeartbeatResponse.ja
va:101)
   at
org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:1
59)
   at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:907)

2009-02-02 13:41:27,275 WARN org.apache.hadoop.mapred.JobTracker: Status
from unknown Tracker : tracker_monocacy:localhost.localdomain/
127.0.0.1:54464

And from a slave:

2009-02-02 13:26:39,440 INFO
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 129.6.101.18:50060,
dest: 129.6.101.12:37304, bytes: 6, op: MAPRED_SHUFFLE, cliID:
attempt_200902021252_0002_m_000111_0
2009-02-02 13:41:40,165 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
exception: java.io.IOException: Call to rogue/129.6.101.41:54311 failed on
local exception: null
   at org.apache.hadoop.ipc.Client.call(Client.java:699)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
   at org.apache.hadoop.mapred.$Proxy4.heartbeat(Unknown Source)
   at
org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1164)
   at

Re: Question about HDFS capacity and remaining

2009-02-01 Thread Sagar Naik

Hi Brian,
Is it possible to publish these test results along with configuration 
options ?


-Sagar

Brian Bockelman wrote:
For what it's worth, our organization did extensive tests on many 
filesystems benchmarking their performance when they are 90 - 95% full.


Only XFS retained most of its performance when it was mostly full 
(ext4 was not tested)... so, if you are thinking of pushing things to 
the limits, that might be something worth considering.


Brian

On Jan 30, 2009, at 11:18 AM, stephen mulcahy wrote:



Bryan Duxbury wrote:
Hm, very interesting. Didn't know about that. What's the purpose of 
the reservation? Just to give root preference or leave wiggle room? 
If it's not strictly necessary it seems like it would make sense to 
reduce it to essentially 0%.


AFAIK It is needed for defragmentation / fsck to work properly and 
your filesystem performance will degrade a lot if you reduce this to 
0% (but I'd love to hear otherwise :)


-stephen




Re: sudden instability in 0.18.2

2009-01-28 Thread Sagar Naik

Pl check which nodes have these failures.

I guess the new tasktrackers/machines  are not configured correctly.
As a result, the map-task will die and the remaining map-tasks will be 
sucked onto these machines



-Sagar

David J. O'Dell wrote:

We've been running 0.18.2 for over a month on an 8 node cluster.
Last week we added 4 more nodes to the cluster and have experienced 2
failures to the tasktrackers since then.
The namenodes are running fine but all jobs submitted will die when
submitted with this error on the tasktrackers.

2009-01-28 08:07:55,556 INFO org.apache.hadoop.mapred.TaskTracker:
LaunchTaskAction: attempt_200901280756_0012_m_74_2
2009-01-28 08:07:55,682 WARN org.apache.hadoop.mapred.TaskRunner:
attempt_200901280756_0012_m_74_2 Child Error
java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)

I tried running the tasktrackers in debug mode but the entries above are
all that show up in the logs.
As of now my cluster is down.

  


Re: tools for scrubbing HDFS data nodes?

2009-01-28 Thread Sagar Naik

Check out fsck

bin/hadoop fsck path -files -location -blocks

Sriram Rao wrote:

By scrub I mean, have a tool that reads every block on a given data
node.  That way, I'd be able to find corrupted blocks proactively
rather than having an app read the file and find it.

Sriram

On Wed, Jan 28, 2009 at 5:57 PM, Aaron Kimball aa...@cloudera.com wrote:
  

By scrub do you mean delete the blocks from the node?

Read your conf/hadoop-site.xml file to determine where dfs.data.dir points,
then for each directory in that list, just rm the directory. If you want to
ensure that your data is preserved with appropriate replication levels on
the rest of your clutser, you should use Hadoop's DataNode Decommission
feature to up-replicate the data before you blow a copy away.

- Aaron

On Wed, Jan 28, 2009 at 2:10 PM, Sriram Rao srirams...@gmail.com wrote:



Hi,

Is there a tool that one could run on a datanode to scrub all the
blocks on that node?

Sriram

  


Re: tools for scrubbing HDFS data nodes?

2009-01-28 Thread Sagar Naik

In addition to datanode itself finding corrupted blocks (As Owen mention)
if the  client finds a corrupted - block, it will go to other replica

Whts your replication factor ?

-Sagar

Sriram Rao wrote:

Does this read every block of every file from all replicas and verify
that the checksums are good?

Sriram

On Wed, Jan 28, 2009 at 6:20 PM, Sagar Naik sn...@attributor.com wrote:
  

Check out fsck

bin/hadoop fsck path -files -location -blocks

Sriram Rao wrote:


By scrub I mean, have a tool that reads every block on a given data
node.  That way, I'd be able to find corrupted blocks proactively
rather than having an app read the file and find it.

Sriram

On Wed, Jan 28, 2009 at 5:57 PM, Aaron Kimball aa...@cloudera.com wrote:

  

By scrub do you mean delete the blocks from the node?

Read your conf/hadoop-site.xml file to determine where dfs.data.dir
points,
then for each directory in that list, just rm the directory. If you want
to
ensure that your data is preserved with appropriate replication levels on
the rest of your clutser, you should use Hadoop's DataNode Decommission
feature to up-replicate the data before you blow a copy away.

- Aaron

On Wed, Jan 28, 2009 at 2:10 PM, Sriram Rao srirams...@gmail.com wrote:




Hi,

Is there a tool that one could run on a datanode to scrub all the
blocks on that node?

Sriram


  


Re: HDFS - millions of files in one directory?

2009-01-27 Thread Sagar Naik


System with: 1 billion small files.
Namenode will need to maintain the data-structure for all those files.
System will have atleast 1 block per file. And if u have replication 
factor set to 3, the system will have 3 billion blocks.
Now , if you try to read all these files in a job , you will be making 
as many as 1 billion socket connections to get these blocks. (Big 
Brothers, correct me if I m wrong)


Datanodes routinely check for available disk space and collect block 
reports. These operations are directly dependent on number of blocks on 
a datanode.


Getting all data in one file, avoids all this unnecessary  IO and memory 
occupied by namenode


Number of maps in map-reduce job are based on number of blocks. In case 
of multiple files, we will have a large number of map-tasks.


-Sagar


Mark Kerzner wrote:

Carfield,

you might be right, and I may be able to combine them in one large file.
What would one use for a delimiter, so that it would never be encountered in
normal binary files? Performance does matter (rarely it doesn't). What are
the differences in performance between using multiple files and one large
file? I would guess that one file should in fact give better hardware/OS
performance, because it is more predictable and allows buffering.

thank you,
Mark

On Sun, Jan 25, 2009 at 9:50 PM, Carfield Yim carfi...@carfield.com.hkwrote:

  

Really? I thought any file can be combines as long as you can figure
out an delimiter is ok, and you really cannot have some delimiters?
Like X? And in the worst case, or if performance is not
really a matter, may be just encode all binary to and from ascii?

On Mon, Jan 26, 2009 at 5:49 AM, Mark Kerzner markkerz...@gmail.com
wrote:


Yes, flip suggested such solution, but his files are text, so he could
combine them all in a large text file, with each lined representing
  

initial


files. My files, however, are binary, so I do not see how I could combine
them.

However, since my numbers are limited by about 1 billion files total, I
should be OK to put them all in a few directories with under, say, 10,000
files each. Maybe a little balanced tree, but 3-4 four levels should
suffice.

Thank you,
Mark

On Sun, Jan 25, 2009 at 11:43 AM, Carfield Yim carfi...@carfield.com.hk
wrote:

  

Possible simple having a file large in size instead of having a lot of
small files?

On Sat, Jan 24, 2009 at 7:03 AM, Mark Kerzner markkerz...@gmail.com
wrote:


Hi,

there is a performance penalty in Windows (pardon the expression) if
  

you


put


too many files in the same directory. The OS becomes very slow, stops
  

seeing


them, and lies about their status to my Java requests. I do not know
  

if


this


is also a problem in Linux, but in HDFS - do I need to balance a
  

directory


tree if I want to store millions of files, or can I put them all in
  

the


same


directory?

Thank you,
Mark
  


  


Mapred job parallelism

2009-01-26 Thread Sagar Naik

Hi Guys,

I was trying to setup a cluster so that two jobs can run simultaneously.

The conf :
number of nodes : 4(say)
mapred.tasktracker.map.tasks.maximum=2


and in the joblClient
mapred.map.tasks=4 (# of nodes)


I also have a condition, that each job should have only one map-task per 
node


In short, created 8 map slots and set the number of mappers to 4.
So now, we have two jobs running simultaneously

However, I realized that, if a tasktracker happens to die, potentially, 
I will have 2 map-tasks running on a node



Setting mapred.tasktracker.map.tasks.maximum=1 in Jobclient has no 
effect. It is tasktracker property and cant be changed per job


Any ideas on how to have 2 jobs running simultaneously ?


-Sagar








Re: Calling a mapreduce job from inside another

2009-01-19 Thread Sagar Naik
You can also play with the priority of the jobs to have the innermost 
job finish first


-Sagar

Devaraj Das wrote:

You can chain job submissions at the client. Also, you can run more than one
job in parallel (if you have enough task slots). An example of chaining jobs
is there in src/examples/org/apache/hadoop/examples/Grep.java where the jobs
grep-search and grep-sort are chained..


On 1/18/09 9:58 AM, Aditya Desai aditya3...@gmail.com wrote:

  

Is it possible to call a mapreduce job from inside another, if yes how?
and is it possible to disable the reducer completely that is suspend the job
immediately after call to map has been terminated.
I have tried -reducer NONE. I am using the streaming api to code in python

Regards,
Aditya Desai.




  


Locks in hadoop

2009-01-15 Thread Sagar Naik

I would like to implement a  locking mechanism across the hdfs cluster
I assume there is no inherent support for it

I was going to do it with files. According to my knowledge, file 
creation is an atomic operation. So the file-based lock should work.
I need to think through with all conditions but if some one has better 
idea/solution, pl share


Thanks
-Sagar



Namenode freeze

2009-01-14 Thread Sagar Naik

Hi
Datanode goes down. and then looks like ReplicationMonitor tries to 
even-out the replication

However while doing so,
it holds the lock on FsNameSystem
With this lock held, other threads wait on this lock to respond
As a result, the namenode does not list the dirs/ Web-UI does not respond

I would appreciate any pointers for this problem ?

(Hadoop .18.1)

-Sagar


Namenode freeze stackdump :


2009-01-14 00:57:02
Full thread dump Java HotSpot(TM) 64-Bit Server VM (10.0-b23 mixed mode):

SocketListener0-4 prio=10 tid=0x2aac54008000 nid=0x644d in 
Object.wait() [0x4535a000..0x4535aa80]

  java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on 0x2aab6cb1dba0 (a 
org.mortbay.util.ThreadPool$PoolThread)

   at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:522)
   - locked 0x2aab6cb1dba0 (a org.mortbay.util.ThreadPool$PoolThread)

  Locked ownable synchronizers:
   - None

SocketListener0-5 prio=10 tid=0x2aac54008c00 nid=0x63f1 in 
Object.wait() [0x4545b000..0x4545bb00]

  java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on 0x2aab6c2ea1a8 (a 
org.mortbay.util.ThreadPool$PoolThread)

   at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:522)
   - locked 0x2aab6c2ea1a8 (a org.mortbay.util.ThreadPool$PoolThread)

  Locked ownable synchronizers:
   - None

Trash Emptier daemon prio=10 tid=0x511ca400 nid=0x1fd waiting 
on condition [0x45259000..0x45259a00]

  java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
   at org.apache.hadoop.fs.Trash$Emptier.run(Trash.java:219)
   at java.lang.Thread.run(Thread.java:619)

  Locked ownable synchronizers:
   - None

org.apache.hadoop.dfs.dfsclient$leasechec...@767a9224 daemon prio=10 
tid=0x51384400 nid=0x1fc 
sleeping[0x45158000..0x45158a80]

  java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
   at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:792)
   at java.lang.Thread.run(Thread.java:619)

  Locked ownable synchronizers:
   - None

IPC Server handler 44 on 54310 daemon prio=10 tid=0x2aac40183c00 
nid=0x1f4 waiting for monitor entry [0x44f56000..0x44f56d80]

  java.lang.Thread.State: BLOCKED (on object monitor)
   at 
org.apache.hadoop.dfs.FSNamesystem.blockReportProcessed(FSNamesystem.java:1880)
   - waiting to lock 0x2aaab423a530 (a 
org.apache.hadoop.dfs.FSNamesystem)
   at 
org.apache.hadoop.dfs.FSNamesystem.handleHeartbeat(FSNamesystem.java:2127)

   at org.apache.hadoop.dfs.NameNode.sendHeartbeat(NameNode.java:602)
   at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)

  Locked ownable synchronizers:
   - None

IPC Server handler 43 on 54310 daemon prio=10 tid=0x2aac40182400 
nid=0x1f3 waiting for monitor entry [0x44e55000..0x44e55a00]

  java.lang.Thread.State: BLOCKED (on object monitor)
   at 
org.apache.hadoop.dfs.FSNamesystem.startFileInternal(FSNamesystem.java:922)
   - waiting to lock 0x2aaab423a530 (a 
org.apache.hadoop.dfs.FSNamesystem)

   at org.apache.hadoop.dfs.FSNamesystem.startFile(FSNamesystem.java:903)
   at org.apache.hadoop.dfs.NameNode.create(NameNode.java:284)
   at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)

  Locked ownable synchronizers:
   - None

IPC Server handler 42 on 54310 daemon prio=10 tid=0x2aac40181000 
nid=0x1f2 waiting for monitor entry [0x44d54000..0x44d54a80]

  java.lang.Thread.State: BLOCKED (on object monitor)
   at 
org.apache.hadoop.dfs.FSNamesystem.blockReportProcessed(FSNamesystem.java:1880)
   - waiting to lock 0x2aaab423a530 (a 
org.apache.hadoop.dfs.FSNamesystem)
   at 
org.apache.hadoop.dfs.FSNamesystem.handleHeartbeat(FSNamesystem.java:2127)

   at org.apache.hadoop.dfs.NameNode.sendHeartbeat(NameNode.java:602)
   at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)

  Locked ownable synchronizers:
   - None

IPC Server handler 41 on 54310 daemon 

Re: 0.18.1 datanode psuedo deadlock problem

2009-01-10 Thread Sagar Naik

Hi Raghu,


The periodic du and block reports thread thrash the disk. (Block 
Reports takes abt on an avg 21 mins )


and I think all the datanode threads are not able to do much and freeze

org.apache.hadoop.dfs.datanode$dataxcei...@f2127a daemon prio=10 
tid=0x41f06000 nid=0x7c7c waiting for monitor entry [0x43918000..0x43918f50]

  java.lang.Thread.State: BLOCKED (on object monitor)
   at org.apache.hadoop.dfs.FSDataset.getFile(FSDataset.java:1158)
   - waiting to lock 0x54e550e0 (a org.apache.hadoop.dfs.FSDataset)
   at 
org.apache.hadoop.dfs.FSDataset.validateBlockFile(FSDataset.java:1074)

   at org.apache.hadoop.dfs.FSDataset.isValidBlock(FSDataset.java:1066)
   at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:894)
   at 
org.apache.hadoop.dfs.DataNode$BlockReceiver.init(DataNode.java:2322)
   at 
org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1187)

   at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1045)
   at java.lang.Thread.run(Thread.java:619)

  Locked ownable synchronizers:
   - None

org.apache.hadoop.dfs.datanode$dataxcei...@1bcee17 daemon prio=10 
tid=0x4da8d000 nid=0x7ae4 waiting for monitor entry [0x459fe000..0x459ff0d0]

  java.lang.Thread.State: BLOCKED (on object monitor)
   at 
org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getNextVolume(FSDataset.java:473)
   - waiting to lock 0x551e8d48 (a 
org.apache.hadoop.dfs.FSDataset$FSVolumeSet)

   at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:934)
   - locked 0x54e550e0 (a org.apache.hadoop.dfs.FSDataset)
   at 
org.apache.hadoop.dfs.DataNode$BlockReceiver.init(DataNode.java:2322)
   at 
org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1187)

   at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1045)
   at java.lang.Thread.run(Thread.java:619)

  Locked ownable synchronizers:
   - None

DataNode: [/data/dfs-video-18/dfs/data] daemon prio=10 tid=0x4d7ad400 
nid=0x7c40 runnable [0x4c698000..0x4c6990d0]

  java.lang.Thread.State: RUNNABLE
   at java.lang.String.lastIndexOf(String.java:1628)
   at java.io.File.getName(File.java:399)
   at 
org.apache.hadoop.dfs.FSDataset$FSDir.getGenerationStampFromFile(FSDataset.java:148)
   at 
org.apache.hadoop.dfs.FSDataset$FSDir.getBlockInfo(FSDataset.java:181)
   at 
org.apache.hadoop.dfs.FSDataset$FSVolume.getBlockInfo(FSDataset.java:412)
   at 
org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getBlockInfo(FSDataset.java:511)

   - locked 0x551e8d48 (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
   at org.apache.hadoop.dfs.FSDataset.getBlockReport(FSDataset.java:1053)
   at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:708)
   at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2890)
   at java.lang.Thread.run(Thread.java:619)


and the lock is *0x54e550e0 *is held by another similar thread and tht 
thread is waiting on FSVolume blocked by getBlockReport()


Infact, during this time, the datanode appears as dead node and clients 
keep on getting createBlockException with timeout


We dont see this problem on other DNs with less number of blocks. So I 
think, 2Million files is an issue here


Pl correct me if I missed on something

-Sagar

Raghu Angadi wrote:


The scan required for each block report is well known issue and it can 
be fixed. It was discussed multiple times (e.g. 
https://issues.apache.org/jira/browse/HADOOP-3232?focusedCommentId=12587795#action_12587795 
).


Earlier, inline 'du' on datanodes used to cause the same problem and 
they they were moved to a separate thread (HADOOP-3232). block reports 
can do the same...


Though 2M blocks on DN is very large, there is no reason block reports 
should break things. Once we fix block reports, something else might 
break.. but that is different issue.


Raghu.

Jason Venner wrote:
The problem we are having is that datanodes periodically stall for 
10-15 minutes and drop off the active list and then come back.


What is going on is that a long operation set is holding the lock on 
on FSDataset.volumes, and all of the other block service requests 
stall behind this lock.


DataNode: [/data/dfs-video-18/dfs/data] daemon prio=10 
tid=0x4d7ad400 nid=0x7c40 runnable [0x4c698000..0x4c6990d0]

  java.lang.Thread.State: RUNNABLE
   at java.lang.String.lastIndexOf(String.java:1628)
   at java.io.File.getName(File.java:399)
   at 
org.apache.hadoop.dfs.FSDataset$FSDir.getGenerationStampFromFile(FSDataset.java:148) 

   at 
org.apache.hadoop.dfs.FSDataset$FSDir.getBlockInfo(FSDataset.java:181)
   at 
org.apache.hadoop.dfs.FSDataset$FSVolume.getBlockInfo(FSDataset.java:412) 

   at 
org.apache.hadoop.dfs.FSDataset$FSVolumeSet.getBlockInfo(FSDataset.java:511) 


   - locked 0x551e8d48 (a org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
   at 
org.apache.hadoop.dfs.FSDataset.getBlockReport(FSDataset.java:1053)

   at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:708)
   at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2890)
   at 

Re: cannot allocate memory error

2008-12-31 Thread Sagar Naik

{HADOOP_HOME}/conf/hadoop-env.sh
export HADOOP_HEAPSIZE

the default is 1M, so I think that there could be another issue

-Sagar
sagar arlekar wrote:

Hello,

I am new to hadoop. I am running hapdoop 0.17 in a Eucalyptus cloud
instance (its a centos image on xen)

bin/hadoop dfs -ls /
gives the following Exception

08/12/31 08:58:10 WARN fs.FileSystem: localhost:9000 is a deprecated
filesystem name. Use hdfs://localhost:9000/ instead.
08/12/31 08:58:10 WARN fs.FileSystem: uri=hdfs://localhost:9000
javax.security.auth.login.LoginException: Login failed: Cannot run
program whoami: java.io.IOException: error=12, Cannot allocate
memory
at 
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
at 
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
at 
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
at 
org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
at 
org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:1353)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1289)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:203)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:108)
at org.apache.hadoop.fs.FsShell.init(FsShell.java:87)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:1717)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1866)
Bad connection to FS. command aborted.

Running the command again gives.
bin/hadoop dfs -ls /
Error occurred during initialization of VM
Could not reserve enough space for object heap

Changing value of 'mapred.child.java.opts' property in hadoop-site.xml
did not help.

Kindly help me. What could I do to give more memory to hadoop?

BTW is there a way to search through the mail archive? I only saw the
mails listed according to year and months.

Regards,
Sagar
  




Re: Threads per mapreduce job

2008-12-27 Thread Sagar Naik

mapred.map.multithreadedrunner.threads
is the property u r looking for


Michael wrote:

Hi everyone:
How do I control the number of threads per mapreduce job.  I am using
bin/hadoop jar wordcount to run jobs and even though I have found these
settings in hadoop-default.xml and changed the values to 1:
namemapred.tasktracker.map.tasks.maximum/name
namemapred.tasktracker.reduce.tasks.maximum/name

The output of the job seems to indicate otherwise.
08/12/26 18:21:12 INFO mapred.JobClient:   Job Counters
08/12/26 18:21:12 INFO mapred.JobClient: Launched reduce tasks=1
08/12/26 18:21:12 INFO mapred.JobClient: Rack-local map tasks=12
08/12/26 18:21:12 INFO mapred.JobClient: Launched map tasks=17
08/12/26 18:21:12 INFO mapred.JobClient: Data-local map tasks=4

I have 2 servers running the mapreduce process and the datanode process.
Thanks,
Michael

  




Re: Failed to start TaskTracker server

2008-12-19 Thread Sagar Naik
Well u have some process which grabs this port and Hadoop is not able to 
bind the port
By the time u check, there is a chance that socket connection has died 
but was occupied when hadoop processes was attempting


Check all the processes running on the system
Do any of the processes acquire ports ?

-Sagar
ascend1 wrote:

I have made a Hadoop platform on 15 machines recently. NameNode - DataNodes 
work properly but when I use bin/start-mapred.sh to start MapReduce framework 
only 3 or 4 TaskTracker could be started properly. All those couldn't be 
started have the same error.
Here's the log:
 
2008-12-19 16:16:31,951 INFO org.apache.hadoop.mapred.TaskTracker: STARTUP_MSG: 
/

STARTUP_MSG: Starting TaskTracker
STARTUP_MSG:   host = msra-5lcd05/172.23.213.80
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.19.0
STARTUP_MSG:   build = 
https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 713890; 
compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008
/
2008-12-19 16:16:33,248 INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4
2008-12-19 16:16:33,248 INFO org.mortbay.util.Credential: Checking Resource 
aliases
2008-12-19 16:16:33,608 INFO org.mortbay.util.Container: Started 
org.mortbay.jetty.servlet.webapplicationhand...@e51b2c
2008-12-19 16:16:33,655 INFO org.mortbay.util.Container: Started 
WebApplicationContext[/static,/static]
2008-12-19 16:16:33,811 INFO org.mortbay.util.Container: Started 
org.mortbay.jetty.servlet.webapplicationhand...@edf389
2008-12-19 16:16:33,936 INFO org.mortbay.util.Container: Started 
WebApplicationContext[/logs,/logs]
2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started 
org.mortbay.jetty.servlet.webapplicationhand...@17b0998
2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started 
WebApplicationContext[/,/]
2008-12-19 16:16:34,155 WARN org.mortbay.util.ThreadedServer: Failed to start: 
socketlisten...@0.0.0.0:50060
2008-12-19 16:16:34,155 ERROR org.apache.hadoop.mapred.TaskTracker: Can not 
start task tracker because java.net.BindException: Address already in use: 
JVM_Bind
 at java.net.PlainSocketImpl.socketBind(Native Method)
 at java.net.PlainSocketImpl.bind(PlainSocketImpl.java:359)
 at java.net.ServerSocket.bind(ServerSocket.java:319)
 at java.net.ServerSocket.init(ServerSocket.java:185)
 at org.mortbay.util.ThreadedServer.newServerSocket(ThreadedServer.java:391)
 at org.mortbay.util.ThreadedServer.open(ThreadedServer.java:477)
 at org.mortbay.util.ThreadedServer.start(ThreadedServer.java:503)
 at org.mortbay.http.SocketListener.start(SocketListener.java:203)
 at org.mortbay.http.HttpServer.doStart(HttpServer.java:761)
 at org.mortbay.util.Container.start(Container.java:72)
 at org.apache.hadoop.http.HttpServer.start(HttpServer.java:321)
 at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:894)
 at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2698)
2008-12-19 16:16:34,155 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: 
/

SHUTDOWN_MSG: Shutting down TaskTracker at msra-5lcd05/172.23.213.80
/

 Then I use netstat -an, but port 50060 isn't in the list. ps -af also show that no program 
using 50060. The strange point is that when I repeat bin/start-mapred.sh and bin/stop-mapred.sh 
several times, the machines list that could start TaskTracker seems randomly.
 
Could anybody help me solve this problem?
  




Re: Failed to start TaskTracker server

2008-12-19 Thread Sagar Naik

 - check hadoop-default.xml
in here u will find all the ports used. Copy the xml-nodes from 
hadoop-default.xml to hadoop-site.xml. Change the port values in 
hadoop-site.xml

and deploy it on datanodes .


Rico wrote:
Well the machines are all servers that probably running many services 
but I have no permission to change or modify other users' programs or 
settings. Is there any way to change 50060 to other port?


Sagar Naik wrote:
Well u have some process which grabs this port and Hadoop is not able 
to bind the port
By the time u check, there is a chance that socket connection has 
died but was occupied when hadoop processes was attempting


Check all the processes running on the system
Do any of the processes acquire ports ?

-Sagar
ascend1 wrote:
I have made a Hadoop platform on 15 machines recently. NameNode - 
DataNodes work properly but when I use bin/start-mapred.sh to start 
MapReduce framework only 3 or 4 TaskTracker could be started 
properly. All those couldn't be started have the same error.

Here's the log:

2008-12-19 16:16:31,951 INFO org.apache.hadoop.mapred.TaskTracker: 
STARTUP_MSG: 
/

STARTUP_MSG: Starting TaskTracker
STARTUP_MSG: host = msra-5lcd05/172.23.213.80
STARTUP_MSG: args = []
STARTUP_MSG: version = 0.19.0
STARTUP_MSG: build = 
https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 
713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008

/
2008-12-19 16:16:33,248 INFO org.mortbay.http.HttpServer: Version 
Jetty/5.1.4
2008-12-19 16:16:33,248 INFO org.mortbay.util.Credential: Checking 
Resource aliases
2008-12-19 16:16:33,608 INFO org.mortbay.util.Container: Started 
org.mortbay.jetty.servlet.webapplicationhand...@e51b2c
2008-12-19 16:16:33,655 INFO org.mortbay.util.Container: Started 
WebApplicationContext[/static,/static]
2008-12-19 16:16:33,811 INFO org.mortbay.util.Container: Started 
org.mortbay.jetty.servlet.webapplicationhand...@edf389
2008-12-19 16:16:33,936 INFO org.mortbay.util.Container: Started 
WebApplicationContext[/logs,/logs]
2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started 
org.mortbay.jetty.servlet.webapplicationhand...@17b0998
2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started 
WebApplicationContext[/,/]
2008-12-19 16:16:34,155 WARN org.mortbay.util.ThreadedServer: Failed 
to start: socketlisten...@0.0.0.0:50060
2008-12-19 16:16:34,155 ERROR org.apache.hadoop.mapred.TaskTracker: 
Can not start task tracker because java.net.BindException: Address 
already in use: JVM_Bind

at java.net.PlainSocketImpl.socketBind(Native Method)
at java.net.PlainSocketImpl.bind(PlainSocketImpl.java:359)
at java.net.ServerSocket.bind(ServerSocket.java:319)
at java.net.ServerSocket.init(ServerSocket.java:185)
at 
org.mortbay.util.ThreadedServer.newServerSocket(ThreadedServer.java:391) 


at org.mortbay.util.ThreadedServer.open(ThreadedServer.java:477)
at org.mortbay.util.ThreadedServer.start(ThreadedServer.java:503)
at org.mortbay.http.SocketListener.start(SocketListener.java:203)
at org.mortbay.http.HttpServer.doStart(HttpServer.java:761)
at org.mortbay.util.Container.start(Container.java:72)
at org.apache.hadoop.http.HttpServer.start(HttpServer.java:321)
at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:894)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2698)
2008-12-19 16:16:34,155 INFO org.apache.hadoop.mapred.TaskTracker: 
SHUTDOWN_MSG: 
/

SHUTDOWN_MSG: Shutting down TaskTracker at msra-5lcd05/172.23.213.80
/

Then I use netstat -an, but port 50060 isn't in the list. ps -af 
also show that no program using 50060. The strange point is that 
when I repeat bin/start-mapred.sh and bin/stop-mapred.sh several 
times, the machines list that could start TaskTracker seems randomly.


Could anybody help me solve this problem?










.18.1 jobtracker deadlock

2008-12-17 Thread Sagar Naik

Hi,

Found one Java-level deadlock:
=
SocketListener0-7:
 waiting to lock monitor 0x0845e1fc (object 0x54f95838, a 
org.apache.hadoop.mapred.JobTracker),

 which is held by IPC Server handler 0 on 54311
IPC Server handler 0 on 54311:
 waiting to lock monitor 0x4d671064 (object 0x57250a60, a 
org.apache.hadoop.mapred.JobInProgress),

 which is held by initJobs
initJobs:
 waiting to lock monitor 0x0845e1fc (object 0x54f95838, a 
org.apache.hadoop.mapred.JobTracker),

 which is held by IPC Server handler 0 on 54311

Java stack information for the threads listed above:
===
SocketListener0-7:
   at 
org.apache.hadoop.mapred.JobTracker.getClusterStatus(JobTracker.java:1826)
   - waiting to lock 0x54f95838 (a 
org.apache.hadoop.mapred.JobTracker)
   at 
org.apache.hadoop.mapred.jobtracker_jsp._jspService(jobtracker_jsp.java:135)
   at 
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)

   at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
   at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
   at 
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)
   at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)

   at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
   at 
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)

   at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
   at org.mortbay.http.HttpServer.service(HttpServer.java:954)
   at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
   at 
org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)

   at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
   at 
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)

   at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
   at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
IPC Server handler 0 on 54311:
   at 
org.apache.hadoop.mapred.JobInProgress.kill(JobInProgress.java:1451)
   - waiting to lock 0x57250a60 (a 
org.apache.hadoop.mapred.JobInProgress)

   at org.apache.hadoop.mapred.JobTracker.killJob(JobTracker.java:1843)
   - locked 0x54f95838 (a org.apache.hadoop.mapred.JobTracker)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
initJobs:
   at 
org.apache.hadoop.mapred.JobTracker.finalizeJob(JobTracker.java:1015)
   - waiting to lock 0x54f95838 (a 
org.apache.hadoop.mapred.JobTracker)
   at 
org.apache.hadoop.mapred.JobInProgress.garbageCollect(JobInProgress.java:1656)

   - locked 0x57250a60 (a org.apache.hadoop.mapred.JobInProgress)
   at 
org.apache.hadoop.mapred.JobInProgress.kill(JobInProgress.java:1469)

   - locked 0x57250a60 (a org.apache.hadoop.mapred.JobInProgress)
   at 
org.apache.hadoop.mapred.JobTracker$JobInitThread.run(JobTracker.java:416)

   at java.lang.Thread.run(Thread.java:619)

Found 1 deadlock.



I found this condition. I will try to work on this

-Sagar



DiskUsage ('du -sk') probably hangs Datanode

2008-12-17 Thread Sagar Naik


I see createBlockException and Abandoning block quite often
When I check the datanode, they are running. I can browse file system 
from that datanode:50075
However, I also notice tht a  du forked off from the DN. This 'du' run 
anywhere from 6mins to 30 mins.


During this time no logs are generated . DN appears in S1 state and the 
'du' in D state.


Is it possible tht jvm has bug or hdd is bad.
I m using /usr/java/jdk1.6.0_07/bin/java and planing to move onto 11

However, I start noticing this after DFS is 50% (on avg) full

Pl help me with some pointers

Hadoop version : .18.1

-Sagar








Re: DiskUsage ('du -sk') probably hangs Datanode

2008-12-17 Thread Sagar Naik

Brian Bockelman wrote:

Hey Sagar,

If the 'du' is in the D state, then that probably means bad things 
for your hardware.


I recommend looking in dmesg and /var/log/messages for anything 
interesting, as well as perform a hard-drive diagnostic test (may be 
as simple as a SMART tests) to see if there's an issue.


I can't say for sure, but the 'du' is probably not hanging the 
Datanode; it's probably a symptom of larger problems.



Thanks Brian
I will start SMART tests
Pl tell me what direction I should look in case of larger problems.



Brian

On Dec 17, 2008, at 8:29 PM, Sagar Naik wrote:



I see createBlockException and Abandoning block quite often
When I check the datanode, they are running. I can browse file system 
from that datanode:50075
However, I also notice tht a  du forked off from the DN. This 'du' 
run anywhere from 6mins to 30 mins.


During this time no logs are generated . DN appears in S1 state and 
the 'du' in D state.


Is it possible tht jvm has bug or hdd is bad.
I m using /usr/java/jdk1.6.0_07/bin/java and planing to move onto 11

However, I start noticing this after DFS is 50% (on avg) full

Pl help me with some pointers

Hadoop version : .18.1

-Sagar











Re: occasional createBlockException in Hadoop .18.1

2008-12-15 Thread Sagar Naik

Hi,
Some data points on this issue.
1) du runs for 20-30 secs.
2) after some time , I dont see any activity in datanode logs
3) I cant even jstack the datanode (forced it , gave me a 
DebuggerException, double checked the pid), the datanode:50075/stacks 
takes forever to respond


I can telnet to datanode:50010

I think, the disk is bad or something

Pl suggest some pointers to analyze this problem

-Sagar
Sagar Naik wrote:



CLIENT EXCEPTION:

2008-12-14 08:41:46,919 [Thread-90] INFO 
org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream 
java.net.SocketTimeoutException: 69000 millis timeout while waiting 
for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/10.50.80.133:54045 
remote=/10.50.80.108:50010] 2008-12-14 08:41:46,919 [Thread-90] INFO 
org.apache.hadoop.dfs.DFSClient: Abandoning block 
blk_-7364265396616885025_5870078 2008-12-14 08:41:46,920 [Thread-90] 
INFO org.apache.hadoop.dfs.DFSClient: Waiting to find target node: 
10.50.80.108:50010



DATANODE

2008-12-14 08:40:39,215 INFO org.apache.hadoop.dfs.DataNode: Receiving 
block blk_-7364265396616885025_5870078 src: /10.50.80.133:54045 dest: 
/10.50.80.133:50010

.
.
.
.
. 
I occasionally see the datanode as deadnode. When the datanode is 
deadnode, I see the du forked from datanode. The du is seen in D 
state




Any pointers to debug this information would help me

-Sagar




Re: Q about storage architecture

2008-12-06 Thread Sagar Naik

http://hadoop.apache.org/core/docs/r0.18.2/hdfs_design.html

Sirisha Akkala wrote:

Hi
I would like to know if Hadoop architecture more resembles SAN or NAS? -I'm 
guessing it is NAS.
Or does it fall under a totally different category? If so, can you please email 
brief information?

thanks,sirisha.
  




Re: getting Configuration object in mapper

2008-12-05 Thread Sagar Naik

check : mapred.task.is.map

Craig Macdonald wrote:
I have a related question - I have a class which is both mapper and 
reducer. How can I tell in configure() if the current task is map or a 
reduce task? Parse the taskid?


C

Owen O'Malley wrote:


On Dec 4, 2008, at 9:19 PM, abhinit wrote:


I have set some variable using the JobConf object.

jobConf.set(Operator, operator) etc.

How can I get an instance of Configuration object/ JobConf object 
inside

a map method so that I can retrieve these variables.


In your Mapper class, implement a method like:
 public void configure(JobConf job) { ... }

This will be called when the object is created with the job conf.

-- Owen






Re: Bad connection to FS. command aborted.

2008-12-04 Thread Sagar Naik

Check u r conf in the classpath.
Check if Namenode is running
U r not able to connect to the intended Namenode

-Sagar
elangovan anbalahan wrote:

im getting this error message when i am dong

*bash-3.2$ bin/hadoop dfs -put urls urls*


please lemme know the resolution, i have a project submission in a few hours

  




Re: Bad connection to FS. command aborted.

2008-12-04 Thread Sagar Naik

hadoop version ?
command : bin/hadoop version

-Sagar


elangovan anbalahan wrote:

i tried that but nothing happened

bash-3.2$ bin/hadoop dfs -put urll urll
put: java.io.IOException: failed to create file /user/nutch/urll/.urls.crc
on client 192.168.1.6 because target-length is 0, below MIN_REPLICATION (1)
bash-3.2$ bin/hadoop dfs -cat urls/part-0*  urls
bash-3.2$ bin/hadoop dfs -ls urls
Found 0 items
bash-3.2$ bin/hadoop dfs -ls urll
Found 0 items
bash-3.2$ bin/hadoop dfs -ls
Found 2 items
/user/nutch/$dir
/user/nutch/urlldir


how do i get rid of the foll error:
*put: java.io.IOException: failed to create file /user/nutch/urll/.urls.crc
on client 192.168.1.6 because target-length is 0, below MIN_REPLICATION (1)


*
On Thu, Dec 4, 2008 at 1:29 PM, Elia Mazzawi
[EMAIL PROTECTED]wrote:

  

you didn't say what the error was?

but you can try this it should do the same thing

bin/hadoop dfs -cat urls/part-0*  urls


elangovan anbalahan wrote:



im getting this error message when i am dong

*bash-3.2$ bin/hadoop dfs -put urls urls*


please lemme know the resolution, i have a project submission in a few
hours



  



  




Re: Hadoop datanode crashed - SIGBUS

2008-12-01 Thread Sagar Naik



Brian Bockelman wrote:

Hardware/memory problems?

I m not sure.


SIGBUS is relatively rare; it sometimes indicates a hardware error in 
the memory system, depending on your arch.



*uname -a : *
Linux hdimg53 2.6.15-1.2054_FC5smp #1 SMP Tue Mar 14 16:05:46 EST 2006 
i686 i686 i386 GNU/Linux

*top's top*
Cpu(s):  0.1% us,  1.1% sy,  0.0% ni, 98.0% id,  0.8% wa,  0.0% hi,  0.0% si
Mem:   8288280k total,  1575680k used,  6712600k free, 5392k buffers
Swap: 16386292k total,   68k used, 16386224k free,   522408k cached

8 core , xeon  2GHz


Brian

On Dec 1, 2008, at 3:00 PM, Sagar Naik wrote:


Couple of the datanodes crashed with the following error
The /tmp is 15% occupied

#
# An unexpected error has been detected by Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0xb4edcb6a, pid=10111, tid=1212181408
#
[Too many errors, abort]

Pl suggest how should I go to debug this particular problem


-Sagar




Thanks to Brian

-Sagar


Re: Hadoop datanode crashed - SIGBUS

2008-12-01 Thread Sagar Naik

None of the jobs use compression for sure

-Sagar
Brian Bockelman wrote:
I'd run memcheck overnight on the nodes that caused the problem, just 
to be sure.


Another (unlikely) possibility is that the JNI callouts for the native 
libraries Hadoop use (for the Compression codecs, I believe) have 
crashed or were set up wrong, and died fatally enough to take out the 
JVM.  Are you using any compression?  Does your job complete 
successfully in local mode, if the crash correlates well with a job 
running?


Brian

On Dec 1, 2008, at 3:32 PM, Sagar Naik wrote:




Brian Bockelman wrote:

Hardware/memory problems?

I m not sure.


SIGBUS is relatively rare; it sometimes indicates a hardware error 
in the memory system, depending on your arch.



*uname -a : *
Linux hdimg53 2.6.15-1.2054_FC5smp #1 SMP Tue Mar 14 16:05:46 EST 
2006 i686 i686 i386 GNU/Linux

*top's top*
Cpu(s):  0.1% us,  1.1% sy,  0.0% ni, 98.0% id,  0.8% wa,  0.0% hi,  
0.0% si

Mem:   8288280k total,  1575680k used,  6712600k free, 5392k buffers
Swap: 16386292k total,   68k used, 16386224k free,   522408k cached

8 core , xeon  2GHz


Brian

On Dec 1, 2008, at 3:00 PM, Sagar Naik wrote:


Couple of the datanodes crashed with the following error
The /tmp is 15% occupied

#
# An unexpected error has been detected by Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0xb4edcb6a, pid=10111, tid=1212181408
#
[Too many errors, abort]

Pl suggest how should I go to debug this particular problem


-Sagar




Thanks to Brian

-Sagar






Re: Hadoop datanode crashed - SIGBUS

2008-12-01 Thread Sagar Naik


hi,
I dont have additional information on it. If u know any other flag tht I 
need to turn on , pl do tell me . The flags tht are currently on  are  
-XX:+HeapDumpOnOutOfMemoryError -XX:+UseParallelGC 
-Dcom.sun.management.jmxremote

But this is what is listed in stdout (datanode.out) file

Java version :
java version 1.6.0_07
Java(TM) SE Runtime Environment (build 1.6.0_07-b06)
Java HotSpot(TM) Server VM (build 10.0-b23, mixed mode)


I will try to stress test the memory.

-Sagar

Chris Collins wrote:
Was there anything mentioned as part of the tombstone message about 
problematic frame?  What java are you using?  There are a few 
reasons for SIGBUS errors, one is illegal address alignment, but from 
java thats very unlikelythere were some issues with the native zip 
library in older vm's.  As Brian pointed out, sometimes this points to 
a hw issue.


C
On Dec 1, 2008, at 1:32 PM, Sagar Naik wrote:




Brian Bockelman wrote:

Hardware/memory problems?

I m not sure.


SIGBUS is relatively rare; it sometimes indicates a hardware error 
in the memory system, depending on your arch.



*uname -a : *
Linux hdimg53 2.6.15-1.2054_FC5smp #1 SMP Tue Mar 14 16:05:46 EST 
2006 i686 i686 i386 GNU/Linux

*top's top*
Cpu(s):  0.1% us,  1.1% sy,  0.0% ni, 98.0% id,  0.8% wa,  0.0% hi,  
0.0% si

Mem:   8288280k total,  1575680k used,  6712600k free, 5392k buffers
Swap: 16386292k total,   68k used, 16386224k free,   522408k cached

8 core , xeon  2GHz


Brian

On Dec 1, 2008, at 3:00 PM, Sagar Naik wrote:


Couple of the datanodes crashed with the following error
The /tmp is 15% occupied

#
# An unexpected error has been detected by Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0xb4edcb6a, pid=10111, tid=1212181408
#
[Too many errors, abort]

Pl suggest how should I go to debug this particular problem


-Sagar




Thanks to Brian

-Sagar






Re: Namenode BlocksMap on Disk

2008-11-26 Thread Sagar Naik
We can also try to mount the particular dir on ramfs and reduce the 
performance degradation


-Sagar
Billy Pearson wrote:
I would like to see something like this also I run 32bit servers so I 
am limited on how much memory I can use for heap. Besides just storing 
to disk I would like to see some sort of cache like a block cache that 
will cache parts the BlocksMap this would help reduce the hits to disk 
for lookups and still give us the ability to lower the memory 
requirement for the namenode.


Billy


Dennis Kubes [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
From time to time a message pops up on the mailing list about OOM 
errors for the namenode because of too many files.  Most recently 
there was a 1.7 million file installation that was failing.  I know 
the simple solution to this is to have a larger java heap for the 
namenode.  But the non-simple way would be to convert the BlocksMap 
for the NameNode to be stored on disk and then queried and updated 
for operations.  This would eliminate memory problems for large file 
installations but also might degrade performance slightly.  Questions:


1) Is there any current work to allow the namenode to store on disk 
versus is memory?  This could be a configurable option.


2) Besides possible slight degradation in performance, is there a 
reason why the BlocksMap shouldn't or couldn't be stored on disk?


I am willing to put forth the work to make this happen.  Just want to 
make sure I am not going down the wrong path to begin with.


Dennis








64 bit namenode and secondary namenode 32 bit datanode

2008-11-25 Thread Sagar Naik

I am trying to migrate from 32 bit jvm and 64 bit for namenode only.
*setup*
NN - 64 bit
Secondary namenode (instance 1) - 64 bit
Secondary namenode (instance 2)  - 32 bit
datanode- 32 bit

From the mailing list I deduced that NN-64 bit and Datanode -32 bit 
combo works
But, I am not sure if S-NN-(instance 1--- 64 bit ) and S-NN (instance 2 
-- 32 bit) will work with this setup.


Also, do shud I be aware of any other issues for migrating over to 64 
bit namenode


Thanks in advance for all the suggestions


-Sagar


Re: 64 bit namenode and secondary namenode 32 bit datanode

2008-11-25 Thread Sagar Naik



lohit wrote:
I might be wrong, but my assumption is running SN either in 64/32 shouldn't matter. 
But I am curious how two instances of Secondary namenode is setup, will both of them talk to same NN and running in parallel? 
what are the advantages here.
  
I just have multiple entries master file. I am not aware of image 
corruption (did not take look into it). I did for SNN redundancy

Pl correct me if I am wrong
Thanks
Sagar

Wondering if there are chances of image corruption.

Thanks,
lohit

- Original Message 
From: Sagar Naik [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Tuesday, November 25, 2008 3:58:53 PM
Subject: 64 bit namenode and secondary namenode  32 bit datanode

I am trying to migrate from 32 bit jvm and 64 bit for namenode only.
*setup*
NN - 64 bit
Secondary namenode (instance 1) - 64 bit
Secondary namenode (instance 2)  - 32 bit
datanode- 32 bit

From the mailing list I deduced that NN-64 bit and Datanode -32 bit combo works
But, I am not sure if S-NN-(instance 1--- 64 bit ) and S-NN (instance 2 -- 32 
bit) will work with this setup.

Also, do shud I be aware of any other issues for migrating over to 64 bit 
namenode

Thanks in advance for all the suggestions


-Sagar

  




Re: Exception in thread main org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:

2008-11-24 Thread Sagar Naik

Include the ${HADOOP}/conf/ dir in the classpath of the java program
Alternatively,
u can also try,
bin/hadoop jar your_jar main_class args

-Sagar

Saju K K wrote:
This is in referance with the sample application in the JAVAWord 
http://www.javaworld.com/javaworld/jw-09-2008/jw-09-hadoop.html?page=5



bin/hadoop dfs -mkdir /opt/www/hadoop/hadoop-0.18.2/words
bin/hadoop dfs -put word1 /opt/www/hadoop/hadoop-0.18.2/words
bin/hadoop dfs -put word2 /opt/www/hadoop/hadoop-0.18.2/words
bin/hadoop dfs -put word3 /opt/www/hadoop/hadoop-0.18.2/words
bin/hadoop dfs -put word4 /opt/www/hadoop/hadoop-0.18.2/words

When i browse through the
http://serdev40.apac.nokia.com:50075/browseDirectory.jsp .I could see the
files in the directory

Also below commands execute properly 


bin/hadoop dfs -ls /opt/www/hadoop/hadoop-0.18.2/words/
bin/hadoop dfs -ls /opt/www/hadoop/hadoop-0.18.2/words/word1
bin/hadoop dfs -cat /opt/www/hadoop/hadoop-0.18.2/words/word1

But on executing this command ,i am  getting an  error
java -Xms1024m -Xmx1024m com.nokia.tag.test.EchoOhce
/opt/www/hadoop/hadoop-0.18.2/words/ result

 java -Xms1024m -Xmx1024m com.nokia.tag.test.EchoOhce
/opt/www/hadoop/hadoop-0.18.2/words result
08/11/24 10:52:54 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
Exception in thread main org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist: file:/opt/www/hadoop/hadoop-0.18.2/words
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:210)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:742)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1026)
at com.nokia.tag.test.EchoOhce.run(EchoOhce.java:123)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at com.nokia.tag.test.EchoOhce.main(EchoOhce.java:129)

Can  anybody know why there is a failure from the Java application

  




Re: Hadoop 18.1 ls stalls

2008-11-20 Thread Sagar Naik
Unfortunately, I am not getting it now, bcoz we have turned off our 
services. and ( I cant start them immediately)


But I used to get retryInvocationHandler (or similar)  in stack


and ls tht took 4-5 secs had listed 37 files only.
It was not lsr option.
Thats what surprised me.

-Sagar

Raghu Angadi wrote:

Sagar Naik wrote:

Thanks Raghu,
*datapoints:*
- So when I use FSShell client, it gets into retry mode for 
getFilesInfo() call and takes a long time.


What does retry mode mean?


 - Also, when do a ls operation, it takes secs(4/5) .

 - 1.6 million files and namenode is mostly full with heap(2400M)  
(from ui)


When you say 'ls', how many does it return? (ie. ls of one file, or 
-lsr of thousands of files etc).


None of the IPC threads in your stack trace is doing any work.






Re: Hadoop Installation

2008-11-19 Thread Sagar Naik

Mithila Nagendra wrote:

Hello
I m currently a student at Arizona State University, Tempe, Arizona,
pursuing my masters in Computer Science. I m currently involved in a
research project that makes use of Hadoop to run various map reduce
functions. Hence I searched the web on whats the best way to install Hadoop
on different nodes in a cluster, and I stumbled upon your website.

I used your tutorial on How to install hadoop on a linux system by Michael
Noll to setup Hadoop on a UNIX system. I have a few questions related to it:

1. Does it matter that I m installing hadoop on UNIX and not on LINUX - do I
have to follow different steps?
2. The configuration for hadoop-site.xml - does it remain the same no matter
what platform is being used? Do I just type the same thing out in the file
hadoop-site.xml present in the hadoop installation on the node?
3. When I try to start the daemons by executing the command
conf/start-all.sh, I get an exception which says hadoop: user specified log
class 'org.apache.commons.logging.impl.Log4JLogger' cannot be found or is
not usable - this happens when tasktracker is being started. What steps do
I take to deal with this exception?

  

start-all is in {HADOOP_HOME}/bin/ . What is your hadoop version ?



I could send you the screen shot of the exception if you wish. It would be
of immense help if you could provide answers for the above questions.

Thank you! Looking forward to your reply.

Best Regards
Mithila Nagendra

  




Re: Recovering NN failure when the SNN data is on another server

2008-11-16 Thread Sagar Naik


Take backup of you dfs.data.dir (both on namenode and secondary namenode).
If secondary namenode is not running on same machine as namenode, copy 
over the fs.checkpoint.dir from secondary onto namenode.


start only the namenode . The importCheckpoint fails for a valid NN 
image. If you want to override NN image by SNN's image , delete the 
dfs.name.dir


For additional info :
https://issues.apache.org/jira/browse/HADOOP-2585?focusedCommentId=12558173#action_12558173

Pl note I am not an expert.
Just had similar problem and this worked for me

-Sagar


Yossi Ittach wrote:

Hi all

I apologize if the topic has already been answered - I couldn't find it.

I'm trying to restart a failed NN using hadoop namenode -importCheckpoint
, and the SNN is configured on another server. However , the NN keeps
looking for the SNN data folder on the local server , and not on the SNN
Server.
Any ideas?

10X!

Vale et me ama
Yossi

  




Re: Recovering NN failure when the SNN data is on another server

2008-11-16 Thread Sagar Naik

Let me correct myself.
 - backup of dfs.data.dir and dfs.name.dir on NN and SNN
- If secondary namenode is not running on same machine as namenode, 
copy over the fs.checkpoint.dir from secondary onto namenode.
- If you want to override NN image by SNN's image , delete the 
dfs.name.dir  (dfs.name.dir has been backed-up)

- start only the namenode  with -importCheckpoint
-
For additional info :
https://issues.apache.org/jira/browse/HADOOP-2585?focusedCommentId=12558173#action_12558173 



-Sagar


Sagar Naik wrote:


Take backup of you dfs.data.dir (both on namenode and secondary 
namenode).
If secondary namenode is not running on same machine as namenode, copy 
over the fs.checkpoint.dir from secondary onto namenode.


start only the namenode . The importCheckpoint fails for a valid NN 
image. If you want to override NN image by SNN's image , delete the 
dfs.name.dir


For additional info :
https://issues.apache.org/jira/browse/HADOOP-2585?focusedCommentId=12558173#action_12558173 



Pl note I am not an expert.
Just had similar problem and this worked for me

-Sagar


Yossi Ittach wrote:

Hi all

I apologize if the topic has already been answered - I couldn't find it.

I'm trying to restart a failed NN using hadoop namenode 
-importCheckpoint

, and the SNN is configured on another server. However , the NN keeps
looking for the SNN data folder on the local server , and not on the SNN
Server.
Any ideas?

10X!

Vale et me ama
Yossi

  






Re: Recovery of files in hadoop 18

2008-11-14 Thread Sagar Naik

Hey Lohit,

Thanks for you help.
I did as per your suggestion. imported from secondary namenode.
we have some corrupted files.

But for some reason, the namenode is still in safe_mode. It has been an 
hour or so.

The fsck report is :

Total size:6954466496842 B (Total open files size: 543469222 B)
Total dirs:1159
Total files:   1354155 (Files currently being written: 7673)
Total blocks (validated):  1375725 (avg. block size 5055128 B) 
(Total open file blocks (not validated): 50)

 
 CORRUPT FILES:1574
 MISSING BLOCKS:   1574
 MISSING SIZE: 1165735334 B
 CORRUPT BLOCKS:   1574
 
Minimally replicated blocks:   1374151 (99.88559 %)
Over-replicated blocks:0 (0.0 %)
Under-replicated blocks:   26619 (1.9349071 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor:3
Average block replication: 2.977127
Corrupt blocks:1574
Missing replicas:  26752 (0.65317154 %)


Do you think, I should manually override the safemode and delete all the 
corrupted files and restart


-Sagar


lohit wrote:

If you have enabled thrash. They should be moved to trash folder before 
permanently deleting them, restore them back. (hope you have that set 
fs.trash.interval)

If not Shut down the cluster.
Take backup of you dfs.data.dir (both on namenode and secondary namenode).

Secondary namenode should have last updated image, try to start namenode from that image, dont use the edits from namenode yet. Try do importCheckpoint explained in here https://issues.apache.org/jira/browse/HADOOP-2585?focusedCommentId=12558173#action_12558173. Start only namenode and run fsck -files. it will throw lot of messages saying you are missing blocks but thats fine since you havent started datanodes yet. But if it shows your files, that means they havent been deleted yet. 
This will give you a view of system of last backup. Start datanode If its up, try running fsck and check consistency of the sytem. you would lose all changes that has happened since the last checkpoint. 



Hope that helps,
Lohit



- Original Message 
From: Sagar Naik [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Friday, November 14, 2008 10:38:45 AM
Subject: Recovery of files in hadoop 18

Hi,
I accidentally deleted the root folder in our hdfs.
I have stopped the hdfs

Is there any way to recover the files from secondary namenode

Pl help


-Sagar
  




Re: Recovery of files in hadoop 18

2008-11-14 Thread Sagar Naik

I had a secondary namenode running on the namenode machine.
I deleted the dfs.name.dir
then bin/hadoop namenode -importCheckpoint.

and restarted the dfs.

I guess the deletion of name.dir will delete the edit logs.
Can u pl tell me that this will not lead to replaying the delete 
transactions ?


Thanks for help/advice


-Sagar

lohit wrote:
NameNode would not come out of safe mode as it is still waiting for datanodes to report those blocks which it expects. 
I should have added, try to get a full output of fsck

fsck path -openforwrite -files -blocks -location.
-openforwrite files should tell you what files where open during the 
checkpoint, you might want to double check that is the case, the files were 
being writting during that moment. May be by looking at the filename you could 
tell if that was part of a job which was running.

For any missing block, you might also want to cross verify on the datanode to 
see if is really missing.

Once you are convinced that those are the only corrupt files which you can live with, start datanodes. 
Namenode woudl still not come out of safemode as you have missing blocks, leave it for a while, run fsck look around, if everything ok, bring namenode out of safemode.

I hope you had started this namenode with old image and empty edits. You do not 
want your latest edits to be replayed, which has your delete transactions.

Thanks,
Lohit



- Original Message 
From: Sagar Naik [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Friday, November 14, 2008 12:11:46 PM
Subject: Re: Recovery of files in hadoop 18

Hey Lohit,

Thanks for you help.
I did as per your suggestion. imported from secondary namenode.
we have some corrupted files.

But for some reason, the namenode is still in safe_mode. It has been an hour or 
so.
The fsck report is :

Total size:6954466496842 B (Total open files size: 543469222 B)
Total dirs:1159
Total files:   1354155 (Files currently being written: 7673)
Total blocks (validated):  1375725 (avg. block size 5055128 B) (Total open 
file blocks (not validated): 50)

CORRUPT FILES:1574
MISSING BLOCKS:   1574
MISSING SIZE: 1165735334 B
CORRUPT BLOCKS:   1574

Minimally replicated blocks:   1374151 (99.88559 %)
Over-replicated blocks:0 (0.0 %)
Under-replicated blocks:   26619 (1.9349071 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor:3
Average block replication: 2.977127
Corrupt blocks:1574
Missing replicas:  26752 (0.65317154 %)


Do you think, I should manually override the safemode and delete all the 
corrupted files and restart

-Sagar


lohit wrote:
  

If you have enabled thrash. They should be moved to trash folder before 
permanently deleting them, restore them back. (hope you have that set 
fs.trash.interval)

If not Shut down the cluster.
Take backup of you dfs.data.dir (both on namenode and secondary namenode).

Secondary namenode should have last updated image, try to start namenode from that image, dont use the edits from namenode yet. Try do importCheckpoint explained in here https://issues.apache.org/jira/browse/HADOOP-2585?focusedCommentId=12558173#action_12558173. Start only namenode and run fsck -files. it will throw lot of messages saying you are missing blocks but thats fine since you havent started datanodes yet. But if it shows your files, that means they havent been deleted yet. This will give you a view of system of last backup. Start datanode If its up, try running fsck and check consistency of the sytem. you would lose all changes that has happened since the last checkpoint. 


Hope that helps,
Lohit



- Original Message 
From: Sagar Naik [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Friday, November 14, 2008 10:38:45 AM
Subject: Recovery of files in hadoop 18

Hi,
I accidentally deleted the root folder in our hdfs.
I have stopped the hdfs

Is there any way to recover the files from secondary namenode

Pl help


-Sagar
 





Re: HDFS from non-hadoop Program

2008-11-07 Thread Sagar Naik
Can u make sure the files in hadoop conf dir is in the classpath of the 
java program


-Sagar
Wasim Bari wrote:
Hello, 
 I am trying to access HDFS from a non-hadoop program using java.

When I try to get Configuration file, it shows exception both in DEBUG mode and 
normal one:

org.apache.hadoop.conf.Configuration: java.io.IOException: config()at 
org.apache.hadoop.conf.Configuration.init(Configuration.java:156)

With the same Configuration files when I try to access from a single stand alone program, it runs perfectly fine. 
Some people posted same issues before but no solution is posted. anyone found the solution ?


Thanks

wasim

  




Missing blocks from bin/hadoop text but fsck is all right

2008-11-04 Thread Sagar Naik

Hi,
We have a strange problem on getting out some of our files

bin/hadoop dfs -text dir/*  gives me missing block exceptions.
0/8/11/04 10:45:09 [main] INFO dfs.DFSClient: Could not obtain block 
blk_6488385702283300787_1247408 from any node:  java.io.IOException: No 
live nodes contain current block
08/11/04 10:45:12 [main] INFO dfs.DFSClient: Could not obtain block 
blk_6488385702283300787_1247408 from any node:  java.io.IOException: No 
live nodes contain current block
08/11/04 10:45:15 [main] INFO dfs.DFSClient: Could not obtain block 
blk_6488385702283300787_1247408 from any node:  java.io.IOException: No 
live nodes contain current block
08/11/04 10:45:18 [main] WARN dfs.DFSClient: DFS Read: 
java.io.IOException: Could not obtain block: 
blk_6488385702283300787_1247408 file=some_filepath-1
at 
org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1462)
at 
org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1312)

at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1417)
at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1369)
at java.io.DataInputStream.readShort(DataInputStream.java:295)
at org.apache.hadoop.fs.FsShell.forMagic(FsShell.java:396)
at org.apache.hadoop.fs.FsShell.access$1(FsShell.java:394)
at org.apache.hadoop.fs.FsShell$2.process(FsShell.java:419)
at 
org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(FsShell.java:1865)

at org.apache.hadoop.fs.FsShell.text(FsShell.java:421)
at org.apache.hadoop.fs.FsShell.doall(FsShell.java:1532)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:1730)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847)/

but when I do a
bin/hadoop dfs -text some_filepath-1. I do get all the data


the fsck on this parent of this file revealed no problems.



jstack on FSshell revealed nothin much

/Debugger attached successfully.
Server compiler detected.
JVM version is 10.0-b19
Deadlock Detection:

No deadlocks found.

Thread 3358: (state = BLOCKED)
- java.lang.Thread.sleep(long) @bci=0 (Interpreted frame)
- org.apache.hadoop.dfs.DFSClient$LeaseChecker.run() @bci=124, line=792 
(Interpreted frame)

- java.lang.Thread.run() @bci=11, line=619 (Interpreted frame)


Thread 3357: (state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Interpreted frame)
- org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=62, 
line=397 (Interpreted frame)
- org.apache.hadoop.ipc.Client$Connection.run() @bci=63, line=440 
(Interpreted frame)



Thread 3342: (state = BLOCKED)


Thread 3341: (state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Interpreted frame)
- java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=116 
(Interpreted frame)
- java.lang.ref.ReferenceQueue.remove() @bci=2, line=132 (Interpreted 
frame)
- java.lang.ref.Finalizer$FinalizerThread.run() @bci=3, line=159 
(Interpreted frame)



Thread 3340: (state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Interpreted frame)
- java.lang.Object.wait() @bci=2, line=485 (Interpreted frame)
- java.lang.ref.Reference$ReferenceHandler.run() @bci=46, line=116 
(Interpreted frame)



Thread 3330: (state = BLOCKED)
- java.lang.Thread.sleep(long) @bci=0 (Interpreted frame)
- 
org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(org.apache.hadoop.dfs.LocatedBlock) 
@bci=181, line=1470 (Interpreted frame)
- org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(long) 
@bci=133, line=1312 (Interpreted frame)
- org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(byte[], int, int) 
@bci=61, line=1417 (Interpreted frame)
- org.apache.hadoop.dfs.DFSClient$DFSInputStream.read() @bci=7, 
line=1369 (Compiled frame)

- java.io.DataInputStream.readShort() @bci=4, line=295 (Compiled frame)
- org.apache.hadoop.fs.FsShell.forMagic(org.apache.hadoop.fs.Path, 
org.apache.hadoop.fs.FileSystem) @bci=7, line=396 (Interpreted frame)
- org.apache.hadoop.fs.FsShell.access$1(org.apache.hadoop.fs.FsShell, 
org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem) @bci=3, 
line=394 (Interpreted frame)
- org.apache.hadoop.fs.FsShell$2.process(org.apache.hadoop.fs.Path, 
org.apache.hadoop.fs.FileSystem) @bci=28, line=419 (Interpreted frame)
- 
org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(org.apache.hadoop.fs.Path, 
org.apache.hadoop.fs.FileSystem) @bci=40, line=1865 (Interpreted frame)
- org.apache.hadoop.fs.FsShell.text(java.lang.String) @bci=26, line=421 
(Interpreted frame)
- org.apache.hadoop.fs.FsShell.doall(java.lang.String, 
java.lang.String[], int) @bci=246, line=1532 (Interpreted frame)
- org.apache.hadoop.fs.FsShell.run(java.lang.String[]) @bci=586, 
line=1730 (Interpreted frame)
- 
org.apache.hadoop.util.ToolRunner.run(org.apache.hadoop.conf.Configuration, 
org.apache.hadoop.util.Tool, java.lang.String[]) @bci=38, line=65 
(Interpreted frame)
- 

Re: Missing blocks from bin/hadoop text but fsck is all right

2008-11-04 Thread Sagar Naik

Hi,

We were hitting file descriptor limits :). Increased it and got solved.

Thanks Jason

-Sagar


Sagar Naik wrote:

Hi,
We have a strange problem on getting out some of our files

bin/hadoop dfs -text dir/*  gives me missing block exceptions.
0/8/11/04 10:45:09 [main] INFO dfs.DFSClient: Could not obtain block 
blk_6488385702283300787_1247408 from any node:  java.io.IOException: 
No live nodes contain current block
08/11/04 10:45:12 [main] INFO dfs.DFSClient: Could not obtain block 
blk_6488385702283300787_1247408 from any node:  java.io.IOException: 
No live nodes contain current block
08/11/04 10:45:15 [main] INFO dfs.DFSClient: Could not obtain block 
blk_6488385702283300787_1247408 from any node:  java.io.IOException: 
No live nodes contain current block
08/11/04 10:45:18 [main] WARN dfs.DFSClient: DFS Read: 
java.io.IOException: Could not obtain block: 
blk_6488385702283300787_1247408 file=some_filepath-1
at 
org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1462) 

at 
org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1312) 

at 
org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1417)
at 
org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1369)

at java.io.DataInputStream.readShort(DataInputStream.java:295)
at org.apache.hadoop.fs.FsShell.forMagic(FsShell.java:396)
at org.apache.hadoop.fs.FsShell.access$1(FsShell.java:394)
at org.apache.hadoop.fs.FsShell$2.process(FsShell.java:419)
at 
org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(FsShell.java:1865) 


at org.apache.hadoop.fs.FsShell.text(FsShell.java:421)
at org.apache.hadoop.fs.FsShell.doall(FsShell.java:1532)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:1730)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847)/

but when I do a
bin/hadoop dfs -text some_filepath-1. I do get all the data


the fsck on this parent of this file revealed no problems.



jstack on FSshell revealed nothin much

/Debugger attached successfully.
Server compiler detected.
JVM version is 10.0-b19
Deadlock Detection:

No deadlocks found.

Thread 3358: (state = BLOCKED)
- java.lang.Thread.sleep(long) @bci=0 (Interpreted frame)
- org.apache.hadoop.dfs.DFSClient$LeaseChecker.run() @bci=124, 
line=792 (Interpreted frame)

- java.lang.Thread.run() @bci=11, line=619 (Interpreted frame)


Thread 3357: (state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Interpreted frame)
- org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=62, 
line=397 (Interpreted frame)
- org.apache.hadoop.ipc.Client$Connection.run() @bci=63, line=440 
(Interpreted frame)



Thread 3342: (state = BLOCKED)


Thread 3341: (state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Interpreted frame)
- java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=116 
(Interpreted frame)
- java.lang.ref.ReferenceQueue.remove() @bci=2, line=132 (Interpreted 
frame)
- java.lang.ref.Finalizer$FinalizerThread.run() @bci=3, line=159 
(Interpreted frame)



Thread 3340: (state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Interpreted frame)
- java.lang.Object.wait() @bci=2, line=485 (Interpreted frame)
- java.lang.ref.Reference$ReferenceHandler.run() @bci=46, line=116 
(Interpreted frame)



Thread 3330: (state = BLOCKED)
- java.lang.Thread.sleep(long) @bci=0 (Interpreted frame)
- 
org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(org.apache.hadoop.dfs.LocatedBlock) 
@bci=181, line=1470 (Interpreted frame)
- org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(long) 
@bci=133, line=1312 (Interpreted frame)
- org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(byte[], int, 
int) @bci=61, line=1417 (Interpreted frame)
- org.apache.hadoop.dfs.DFSClient$DFSInputStream.read() @bci=7, 
line=1369 (Compiled frame)

- java.io.DataInputStream.readShort() @bci=4, line=295 (Compiled frame)
- org.apache.hadoop.fs.FsShell.forMagic(org.apache.hadoop.fs.Path, 
org.apache.hadoop.fs.FileSystem) @bci=7, line=396 (Interpreted frame)
- org.apache.hadoop.fs.FsShell.access$1(org.apache.hadoop.fs.FsShell, 
org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem) @bci=3, 
line=394 (Interpreted frame)
- org.apache.hadoop.fs.FsShell$2.process(org.apache.hadoop.fs.Path, 
org.apache.hadoop.fs.FileSystem) @bci=28, line=419 (Interpreted frame)
- 
org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(org.apache.hadoop.fs.Path, 
org.apache.hadoop.fs.FileSystem) @bci=40, line=1865 (Interpreted frame)
- org.apache.hadoop.fs.FsShell.text(java.lang.String) @bci=26, 
line=421 (Interpreted frame)
- org.apache.hadoop.fs.FsShell.doall(java.lang.String, 
java.lang.String[], int) @bci=246, line=1532 (Interpreted frame)
- org.apache.hadoop.fs.FsShell.run(java.lang.String[]) @bci=586, 
line=1730 (Interpreted frame)
- 
org.apache.hadoop.util.ToolRunner.run

Re: namenode failure

2008-10-30 Thread Sagar Naik

Pl check your classpath entries.
Looks like hadoop-core jar before you shutdown the cluster and after u 
changed hadoop-env.sh are different


-Sagar

Songting Chen wrote:

Hi,
  I modified the classpath in hadoop-env.sh in namenode and datanodes before shutting down the cluster. Then problem appears: I cannot stop hadoop cluster at all. The stop-all.sh shows no datanode/namenode, while all the java processes are running. 
  So I manually killed the java process. Now the namenode seems to be corrupted and always stays in Safe mode, while the datanodes complain the following weird error:


2008-10-27 17:28:44,141 FATAL org.apache.hadoop.dfs.DataNode: Incompatible 
build versions: namenode BV = ; datanode BV = 694836
2008-10-27 17:28:44,244 ERROR org.apache.hadoop.dfs.DataNode: 
java.io.IOException: Incompatible build versions: namenode BV = ; datanode BV = 
694836
at org.apache.hadoop.dfs.DataNode.handshake(DataNode.java:403)
at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:250)
at org.apache.hadoop.dfs.DataNode.init(DataNode.java:190)
at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:2987)
at 
org.apache.hadoop.dfs.DataNode.instantiateDataNode(DataNode.java:2942)
at org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:2950)
at org.apache.hadoop.dfs.DataNode.main(DataNode.java:3072)

  My question is how to recover from such failure. And I guess the correct 
practice for changing the CLASSPATH is to shut down the cluster, apply the 
change, restart the cluster.

Thanks,
-Songting
  




Hadoop .16 : Task failures

2008-10-10 Thread Sagar Naik

Hi,

We are using Hadoop 0.16 and on our heavy IO job we are seeing lot of these 
exceptions.
We are seeing lot of task failures more than 50% :(. They are two reasons from 
log:
   a) Task task_200810092310_0003_m_20_0 failed to report status for 600 seconds. Killing! 	- 
	b) java.io.IOException: Could not get block locations. Aborting... at


org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:1824)
at

org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1100(DFSClient.java:1479)
at

org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1571)



/Tasktracker log:

/Exception in createBlockOutputStream java.net.SocketTimeoutException: Read 
timed out
2008-10-10 05:50:10,485 INFO org.apache.hadoop.fs.DFSClient: Abandoning block 
blk_-5660296346325180487
.
..
.
Parent Died.

/Datanode log /
2008-10-10 00:00:23,066 INFO org.apache.hadoop.dfs.DataNode: PacketResponder 
blk_6562287961399683551 1 Exception java.net.SocketException: Broken pipe
   at java.net.SocketOutputStream.socketWrite0(Native Method)
   at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
   at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
   at java.io.DataOutputStream.writeLong(DataOutputStream.java:207)
   at org.apache.hadoop.dfs.DataNode$PacketResponder.run(DataNode.java:1823)
   at java.lang.Thread.run(Thread.java:619)

2008-10-10 00:00:23,067 ERROR org.apache.hadoop.dfs.DataNode: /localhost ip 
/:50010:DataXceiver: java.io.EOFException
   at java.io.DataInputStream.readInt(DataInputStream.java:375)
   at 
org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(DataNode.java:2263)
   at 
org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1150)
   at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938)
   at java.lang.Thread.run(Thread.java:619)


2008-10-10 00:53:53,790 INFO org.apache.hadoop.dfs.DataNode: Exception in 
receiveBlock for block blk_-3482274249842371655 java.net.SocketException: 
Connection reset
2008-10-10 00:53:53,791 INFO org.apache.hadoop.dfs.DataNode: writeBlock 
blk_-3482274249842371655 received exception java.net.SocketException: 
Connection reset
2008-10-10 00:53:53,791 ERROR org.apache.hadoop.dfs.DataNode: /localhost 
ip/:50010:DataXceiver: java.net.SocketException: Connection reset
   at java.net.SocketInputStream.read(SocketInputStream.java:168)
   at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
   at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
   at java.io.DataInputStream.readInt(DataInputStream.java:370)
   at 
org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(DataNode.java:2263)
   at 
org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1150)
   at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938)
   at java.lang.Thread.run(Thread.java:619)



Any pointer would help us a lot

-Sagar



Re: Getting started questions

2008-09-08 Thread Sagar Naik

Dennis Kubes wrote:



John Howland wrote:

I've been reading up on Hadoop for a while now and I'm excited that I'm
finally getting my feet wet with the examples + my own variations. If 
anyone

could answer any of the following questions, I'd greatly appreciate it.

1. I'm processing document collections, with the number of documents 
ranging

from 10,000 - 10,000,000. What is the best way to store this data for
effective processing?


AFAIK hadoop doesn't do well with, although it can handle, a large 
number of small files.  So it would be better to read in the documents 
and store them in SequenceFile or MapFile format.  This would be 
similar to the way the Fetcher works in Nutch.  10M documents in a 
sequence/map file on DFS is comparatively small and can be handled 
efficiently.




 - The bodies of the documents usually range from 1K-100KB in size, 
but some

outliers can be as big as 4-5GB.


I would say store your document objects as Text objects, not sure if 
Text has a max size.  I think it does but not sure what that is.  If 
it does you can always store as a BytesWritable which is just an array 
of bytes.  But you are going to have memory issues reading in and 
writing out that large of a record.
 - I will also need to store some metadata for each document which I 
figure

could be stored as JSON or XML.
 - I'll typically filter on the metadata and then doing standard 
operations

on the bodies, like word frequency and searching.


It is possible to create an OutputFormat that writes out multiple 
files.  You could also use a MapWritable as the value to store the 
document and associated metadata.




Is there a canned FileInputFormat that makes sense? Should I roll my 
own?
How can I access the bodies as streams so I don't have to read them 
into RAM


A writable is read into RAM so even treating it like a stream doesn't 
get around that.


One thing you might want to consider is to  tar up say X documents at 
a time and store that as a file in DFS.  You would have many of these 
files.  Then have an index that has the offsets of the files and their 
keys (document ids).  That index can be passed as input into a MR job 
that can then go to DFS and stream out the file as you need it.  The 
job will be slower because you are doing it this way but it is a 
solution to handling such large documents as streams.


all at once? Am I right in thinking that I should treat each document 
as a

record and map across them, or do I need to be more creative in what I'm
mapping across?

2. Some of the tasks I want to run are pure map operations (no 
reduction),
where I'm calculating new metadata fields on each document. To end up 
with a
good result set, I'll need to copy the entire input record + new 
fields into
another set of output files. Is there a better way? I haven't wanted 
to go

down the HBase road because it can't handle very large values (for the
bodies) and it seems to make the most sense to keep the document bodies
together with the metadata, to allow for the greatest locality of 
reference

on the datanodes.


If you don't specify a reducer, the IdentityReducer is run which 
simply passes through output.
   One can set number of reducers to zero and reduce phase will not 
take place.




3. I'm sure this is not a new idea, but I haven't seen anything 
regarding
it... I'll need to run several MR jobs as a pipeline... is there any 
way for
the map tasks in a subsequent stage to begin processing data from 
previous

stage's reduce task before that reducer has fully finished?


Yup, just use FileOutputFormat.getOutputPath(previousJobConf);

Dennis


Whatever insight folks could lend me would be a big help in crossing the
chasm from the Word Count and associated examples to something more 
real.

A whole heap of thanks in advance,

John





Re: Aborting Map Function

2008-04-16 Thread Sagar Naik

Chaman Singh Verma wrote:

Hello,

I am developing one application with MapReduce and in that whenever some
MapTask condition is 
met, I would like to broadcast to all other MapTask to abort their work. I

am not quite sure whether
such broadcasting functionality currently exist in Hadoop MapReduce. Could
someone give some
hints.

Although extending this functionality may be easy as all the slaves
periodically ping the master,
I was just thinking of piggybacking one bit information from the slave to
the master and master
may send this information to all the slaves in the next round. Any
suggestions to this approach ?

Thanks.

With Regards 


-
Chaman Singh Verma
Poona, India
  
One possible solution could be to use Counters 
(http://hadoop.apache.org/core/docs/r0.16.2/api/org/apache/hadoop/mapred/Counters.html)
Though it is advisable to look into details of implementation of it, and 
see if it can be used for multi-process shared variable