I can issue a command 'hadoop dfsadmin -report', but it did not return any
result for a long time. Also, I can open the NN UI(http://namenode:50070),
but it is always keeping in the connecting status, and could not return any
cluster statistic.

The mem of NN:
                  total       used       free
Mem:          3834       3686        148

After running a top command, I can see following process are taking up the
memory: namenode, jobtracker, tasktracker, hbase, ...

I can restart the cluster, and then the cluster will be healthy. But this
issue will probably occur in a few days later. I think it's caused by
lacking of free/available mem, but do not know how many extra
free/available mem of node is required, besides the necessary mem for
running datanode/tasktracker process?




2013/5/13 Nitin Pawar <nitinpawar...@gmail.com>

> just one node not having memory does not mean your cluster is down.
>
> Can you see your hdfs health on NN UI?
>
> how much memory do you have on NN? if there are no jobs running on the
> cluster then you can safely restart datanode and tasktracker.
>
> Also run a top command and figure out which processes are taking up the
> memory and for what purpose?
>
>
> On Mon, May 13, 2013 at 11:28 AM, sam liu <samliuhad...@gmail.com> wrote:
>
>> Nitin,
>>
>> In my cluster, the tasktracker and datanode already have been launched,
>> and are still running now. But the free/available mem of node3 now is just
>> 167 mb, and do you think it's the reason why my hadoop is unhealthy now(it
>> does not return result of command 'hadoop dfs -ls /')?
>>
>>
>> 2013/5/13 Nitin Pawar <nitinpawar...@gmail.com>
>>
>>> Sam,
>>>
>>> There is no formula for determining how much memory one should give to
>>> datanode and tasktracker. Ther formula is available for how many slots you
>>> want to have on a machine.
>>>
>>> In my prior experience, we did give 512MB memory each to a datanode and
>>> tasktracker.
>>>
>>>
>>> On Mon, May 13, 2013 at 11:18 AM, sam liu <samliuhad...@gmail.com>wrote:
>>>
>>>> For node3, the memory is:
>>>>                    total       used       free     shared
>>>> buffers     cached
>>>> Mem:          3834       3666        167          0        187
>>>> 1136
>>>> -/+ buffers/cache:       2342       1491
>>>> Swap:         8196          0       8196
>>>>
>>>> To a 3 nodes cluster as mine, what's the required minimum
>>>> free/available memory for the datanode process and tasktracker process,
>>>> without running any map/reduce task?
>>>> Any formula to determine it?
>>>>
>>>>
>>>> 2013/5/13 Rishi Yadav <ri...@infoobjects.com>
>>>>
>>>>> can you tell specs of node3. Even on a test/demo cluster, anything
>>>>> below 4 GB ram makes the node almost inaccessible as per my experience.
>>>>>
>>>>>
>>>>>
>>>>> On Sun, May 12, 2013 at 8:25 PM, sam liu <samliuhad...@gmail.com>wrote:
>>>>>
>>>>>> Got some exceptions on node3:
>>>>>> 1. datanode log:
>>>>>> 2013-04-17 11:13:44,719 INFO
>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>>>>>> blk_2478755809192724446_1477 received exception
>>>>>> java.net.SocketTimeoutException: 63000 millis timeout while waiting for
>>>>>> channel to be ready for read. ch :
>>>>>> java.nio.channels.SocketChannel[connected 
>>>>>> local=/9.50.102.80:58371remote=/
>>>>>> 9.50.102.79:50010]
>>>>>> 2013-04-17 11:13:44,721 ERROR
>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
>>>>>> 9.50.102.80:50010,
>>>>>> storageID=DS-2038715921-9.50.102.80-50010-1366091297051, infoPort=50075,
>>>>>> ipcPort=50020):DataXceiver
>>>>>> java.net.SocketTimeoutException: 63000 millis timeout while waiting
>>>>>> for channel to be ready for read. ch :
>>>>>> java.nio.channels.SocketChannel[connected 
>>>>>> local=/9.50.102.80:58371remote=/
>>>>>> 9.50.102.79:50010]
>>>>>>         at
>>>>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>>>>>>         at
>>>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>>>>>>         at
>>>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>>>>>>         at
>>>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:116)
>>>>>>         at java.io.DataInputStream.readShort(DataInputStream.java:306)
>>>>>>         at
>>>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:359)
>>>>>>         at
>>>>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:112)
>>>>>>         at java.lang.Thread.run(Thread.java:738)
>>>>>> 2013-04-17 11:13:44,818 INFO
>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>>>>> blk_8413378381769505032_1477 src: /9.50.102.81:35279 dest: /
>>>>>> 9.50.102.80:50010
>>>>>>
>>>>>>
>>>>>> 2. tasktracker log:
>>>>>> 2013-04-23 11:48:26,783 INFO org.apache.hadoop.mapred.UserLogCleaner:
>>>>>> Deleting user log path job_201304152248_0011
>>>>>> 2013-04-30 14:48:15,506 ERROR org.apache.hadoop.mapred.TaskTracker:
>>>>>> Caught exception: java.io.IOException: Call to 
>>>>>> node1/9.50.102.81:9001failed on local exception: java.io.IOException: 
>>>>>> Connection reset by peer
>>>>>>         at
>>>>>> org.apache.hadoop.ipc.Client.wrapException(Client.java:1144)
>>>>>>         at org.apache.hadoop.ipc.Client.call(Client.java:1112)
>>>>>>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
>>>>>>         at org.apache.hadoop.mapred.$Proxy2.heartbeat(Unknown Source)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:2008)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1802)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:2654)
>>>>>>         at
>>>>>> org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3909)
>>>>>> Caused by: java.io.IOException: Connection reset by peer
>>>>>>         at sun.nio.ch.FileDispatcher.read0(Native Method)
>>>>>>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:33)
>>>>>>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:210)
>>>>>>         at sun.nio.ch.IOUtil.read(IOUtil.java:183)
>>>>>>         at
>>>>>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:257)
>>>>>>         at
>>>>>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>>>>>>         at
>>>>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>>>>>>         at
>>>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>>>>>>         at
>>>>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>>>>>>         at java.io.FilterInputStream.read(FilterInputStream.java:127)
>>>>>>         at
>>>>>> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:361)
>>>>>>         at
>>>>>> java.io.BufferedInputStream.fill(BufferedInputStream.java:229)
>>>>>>         at
>>>>>> java.io.BufferedInputStream.read(BufferedInputStream.java:248)
>>>>>>         at java.io.DataInputStream.readInt(DataInputStream.java:381)
>>>>>>         at
>>>>>> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:841)
>>>>>>         at
>>>>>> org.apache.hadoop.ipc.Client$Connection.run(Client.java:786)
>>>>>>
>>>>>> 2013-04-30 14:48:15,517 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>> Resending 'status' to 'node1' with reponseId '-12904
>>>>>> 2013-04-30 14:48:16,404 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>> SHUTDOWN_MSG:
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2013/5/13 Rishi Yadav <ri...@infoobjects.com>
>>>>>>
>>>>>>> do you get any error when trying to connect to cluster, something
>>>>>>> like 'tried n times' or replicated 0 times.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, May 12, 2013 at 7:28 PM, sam liu <samliuhad...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I setup a cluster with 3 nodes, and after that I did not submit any
>>>>>>>> job on it. But, after few days, I found the cluster is unhealthy:
>>>>>>>> - No result returned after issuing command 'hadoop dfs -ls /' or
>>>>>>>> 'hadoop dfsadmin -report' for a while
>>>>>>>> - The page of 'http://namenode:50070' could not be opened as
>>>>>>>> expected...
>>>>>>>> - ...
>>>>>>>>
>>>>>>>> I did not find any usefull info in the logs, but found the avaible
>>>>>>>> memory of the cluster nodes are very low at that time:
>>>>>>>> - node1(NN,JT,DN,TT): 158 mb mem is available
>>>>>>>> - node2(DN,TT): 75 mb mem is available
>>>>>>>> - node3(DN,TT): 174 mb mem is available
>>>>>>>>
>>>>>>>> I guess the issue of my cluster is caused by lacking of memeory,
>>>>>>>> and my questions are:
>>>>>>>> - Without running jobs, what's the minimum memory requirements to
>>>>>>>> datanode and namenode?
>>>>>>>> - How to define the minimum memeory for datanode and namenode?
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> Sam Liu
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>
>
> --
> Nitin Pawar
>

Reply via email to