ack, after looking at the logs again, there are definitely xcievers errors. It's set to 256! I had thought I had cleared this a possible cause, but guess I was wrong. Gonna retest right away. Thanks!
On Fri, Feb 5, 2010 at 11:05 AM, Todd Lipcon <t...@cloudera.com> wrote: > Yes, you're likely to see an error in the DN log. Do you see anything > about max number of xceivers? > > -Todd > > On Thu, Feb 4, 2010 at 11:42 PM, Meng Mao <meng...@gmail.com> wrote: > > not sure what else I could be checking to see where the problem lies. > Should > > I be looking in the datanode logs? I looked briefly in there and didn't > see > > anything from around the time exceptions started getting reported. > > lsof during the job execution? Number of open threads? > > > > I'm at a loss here. > > > > On Thu, Feb 4, 2010 at 2:52 PM, Meng Mao <meng...@gmail.com> wrote: > > > >> I wrote a hadoop job that checks for ulimits across the nodes, and every > >> node is reporting: > >> core file size (blocks, -c) 0 > >> data seg size (kbytes, -d) unlimited > >> scheduling priority (-e) 0 > >> file size (blocks, -f) unlimited > >> pending signals (-i) 139264 > >> max locked memory (kbytes, -l) 32 > >> max memory size (kbytes, -m) unlimited > >> open files (-n) 65536 > >> pipe size (512 bytes, -p) 8 > >> POSIX message queues (bytes, -q) 819200 > >> real-time priority (-r) 0 > >> stack size (kbytes, -s) 10240 > >> cpu time (seconds, -t) unlimited > >> max user processes (-u) 139264 > >> virtual memory (kbytes, -v) unlimited > >> file locks (-x) unlimited > >> > >> > >> Is anything in there telling about file number limits? From what I > >> understand, a high open files limit like 65536 should be enough. I > estimate > >> only a couple thousand part-files on HDFS being written to at once, and > >> around 200 on the filesystem per node. > >> > >> On Wed, Feb 3, 2010 at 4:04 PM, Meng Mao <meng...@gmail.com> wrote: > >> > >>> also, which is the ulimit that's important, the one for the user who is > >>> running the job, or the hadoop user that owns the Hadoop processes? > >>> > >>> > >>> On Tue, Feb 2, 2010 at 7:29 PM, Meng Mao <meng...@gmail.com> wrote: > >>> > >>>> I've been trying to run a fairly small input file (300MB) on Cloudera > >>>> Hadoop 0.20.1. The job I'm using probably writes to on the order of > over > >>>> 1000 part-files at once, across the whole grid. The grid has 33 nodes > in it. > >>>> I get the following exception in the run logs: > >>>> > >>>> 10/01/30 17:24:25 INFO mapred.JobClient: map 100% reduce 12% > >>>> 10/01/30 17:24:25 INFO mapred.JobClient: Task Id : > >>>> attempt_201001261532_1137_r_000013_0, Status : FAILED > >>>> java.io.EOFException > >>>> at java.io.DataInputStream.readByte(DataInputStream.java:250) > >>>> at > >>>> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298) > >>>> at > >>>> org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319) > >>>> at org.apache.hadoop.io.Text.readString(Text.java:400) > >>>> at > >>>> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2869) > >>>> at > >>>> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794) > >>>> at > >>>> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077) > >>>> at > >>>> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263) > >>>> > >>>> ....lots of EOFExceptions.... > >>>> > >>>> 10/01/30 17:24:25 INFO mapred.JobClient: Task Id : > >>>> attempt_201001261532_1137_r_000019_0, Status : FAILED > >>>> java.io.IOException: Bad connect ack with firstBadLink > 10.2.19.1:50010 > >>>> at > >>>> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871) > >>>> at > >>>> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794) > >>>> at > >>>> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077) > >>>> at > >>>> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263) > >>>> > >>>> 10/01/30 17:24:36 INFO mapred.JobClient: map 100% reduce 11% > >>>> 10/01/30 17:24:42 INFO mapred.JobClient: map 100% reduce 12% > >>>> 10/01/30 17:24:49 INFO mapred.JobClient: map 100% reduce 13% > >>>> 10/01/30 17:24:55 INFO mapred.JobClient: map 100% reduce 14% > >>>> 10/01/30 17:25:00 INFO mapred.JobClient: map 100% reduce 15% > >>>> > >>>> From searching around, it seems like the most common cause of BadLink > and > >>>> EOFExceptions is when the nodes don't have enough file descriptors > set. But > >>>> across all the grid machines, the file-max has been set to 1573039. > >>>> Furthermore, we set ulimit -n to 65536 using hadoop-env.sh. > >>>> > >>>> Where else should I be looking for what's causing this? > >>>> > >>> > >>> > >> > > >