> Can you check the DN logs for "exceeds the limit of concurrent
> xcievers"? You may need to bump the dfs.datanode.max.xcievers
> parameter in hdfs-site.xml, and also possibly the nfiles ulimit.

Thanks Todd, and sorry for the late reply - I missed this message.

I didn't see any xciever messages in the DN logs, but I figured it might be a 
good idea to up the nofiles uplimit. The result is a jsvc that is eating memory:

$top

Mem:  16320412k total, 16199036k used,   121376k free,    25412k buffers
Swap: 33554424k total,   291492k used, 33262932k free, 10966732k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
24835 mapred    18   0 2644m 157m 8316 S 34.1  1.0   7031:27 java
14794 hdfs      18   0 2430m 1.5g  10m S  3.3  9.8   3:39.56 jsvc


I'll revert it and see what effect dfs.datanode.max.xcievers will have.

Cheers,
Evert

>
> -Todd
>
>
> On Wed, Mar 9, 2011 at 3:27 AM, Evert Lammerts <evert.lamme...@sara.nl>
> wrote:
> > We see a lot of IOExceptions coming from HDFS during a job that does
> nothing but untar 100 files (1 per Mapper, sizes vary between 5GB and
> 80GB) that are in HDFS, to HDFS. DataNodes are also showing Exceptions
> that I think are related. (See stacktraces below.)
> >
> > This job should not be able to overload the system I think... I
> realize that much data needs to go over the lines, but HDFS should
> still be responsive. Any ideas / help is much appreciated!
> >
> > Some details:
> > * Hadoop 0.20.2 (CDH3b4)
> > * 5 node cluster plus 1 node for JT/NN (Sun Thumpers)
> > * 4 cores/node, 4GB RAM/core
> > * CentOS 5.5
> >
> > Job output:
> >
> > java.io.IOException: java.io.IOException: Could not obtain block:
> blk_-3695352030358969086_130839 file=/user/emeij/icwsm-data-test/01-26-
> SOCIAL_MEDIA.tar.gz
> >        at ilps.DownloadICWSM$UntarMapper.map(DownloadICWSM.java:449)
> >        at ilps.DownloadICWSM$UntarMapper.map(DownloadICWSM.java:1)
> >        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >        at
> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:390)
> >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
> >        at org.apache.hadoop.mapred.Child$4.run(Child.java:240)
> >        at java.security.AccessController.doPrivileged(Native Method)
> >        at javax.security.auth.Subject.doAs(Subject.java:396)
> >        at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati
> on.java:1115)
> >        at org.apache.hadoop.mapred.Child.main(Child.java:234)
> > Caused by: java.io.IOException: Could not obtain block: blk_-
> 3695352030358969086_130839 file=/user/emeij/icwsm-data-test/01-26-
> SOCIAL_MEDIA.tar.gz
> >        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClien
> t.java:1977)
> >        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.j
> ava:1784)
> >        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:193
> 2)
> >        at java.io.DataInputStream.read(DataInputStream.java:83)
> >        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:55)
> >        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:74)
> >        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:335)
> >        at ilps.DownloadICWSM$CopyThread.run(DownloadICWSM.java:149)
> >
> >
> > Example DataNode Exceptions (not that these come from the node at
> 192.168.28.211):
> >
> > 2011-03-08 19:40:40,297 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> receiveBlock for block blk_-9222067946733189014_3798233
> java.io.EOFException: while trying to read 3067064 bytes
> > 2011-03-08 19:40:41,018 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /192.168.28.211:50050, dest: /192.168.28.211:49748, bytes: 0, op:
> HDFS_READ, cliID: DFSClient_attempt_201103071120_0030_m_000032_0,
> offset: 30
> > 72, srvID: DS-568746059-145.100.2.180-50050-1291128670510, blockid:
> blk_3596618013242149887_4060598, duration: 2632000
> > 2011-03-08 19:40:41,049 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> receiveBlock for block blk_-9221028436071074510_2325937
> java.io.EOFException: while trying to read 2206400 bytes
> > 2011-03-08 19:40:41,348 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> receiveBlock for block blk_-9221549395563181322_4024529
> java.io.EOFException: while trying to read 3037288 bytes
> > 2011-03-08 19:40:41,357 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> receiveBlock for block blk_-9221885906633018147_3895876
> java.io.EOFException: while trying to read 1981952 bytes
> > 2011-03-08 19:40:41,434 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode: Block blk_-
> 9221885906633018147_3895876 unfinalized and removed.
> > 2011-03-08 19:40:41,434 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-
> 9221885906633018147_3895876 received exception java.io.EOFException:
> while trying to read 1981952 bytes
> > 2011-03-08 19:40:41,434 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(192.168.28.211:50050, storageID=DS-568746059-
> 145.100.2.180-50050-1291128670510, infoPort=50075,
> ipcPort=50020):DataXceiver
> > java.io.EOFException: while trying to read 1981952 bytes
> >        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockRec
> eiver.java:270)
> >        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(Blo
> ckReceiver.java:357)
> >        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(Bloc
> kReceiver.java:378)
> >        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(Block
> Receiver.java:534)
> >        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiv
> er.java:417)
> >        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java
> :122)
> > 2011-03-08 19:40:41,465 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode: Block blk_-
> 9221549395563181322_4024529 unfinalized and removed.
> > 2011-03-08 19:40:41,466 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-
> 9221549395563181322_4024529 received exception java.io.EOFException:
> while trying to read 3037288 bytes
> > 2011-03-08 19:40:41,466 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(192.168.28.211:50050, storageID=DS-568746059-
> 145.100.2.180-50050-1291128670510, infoPort=50075,
> ipcPort=50020):DataXceiver
> > java.io.EOFException: while trying to read 3037288 bytes
> >        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockRec
> eiver.java:270)
> >        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(Blo
> ckReceiver.java:357)
> >        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(Bloc
> kReceiver.java:378)
> >        at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(Block
> Receiver.java:534)
> >        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiv
> er.java:417)
> >        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java
> :122)
> >
> > Cheers,
> >
> > Evert Lammerts
> > Consultant eScience & Cloud Services
> > SARA Computing & Network Services
> > Operations, Support & Development
> >
> > Phone: +31 20 888 4101
> > Email: evert.lamme...@sara.nl
> > http://www.sara.nl
> >
> >
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera

Reply via email to