ack, after looking at the logs again, there are definitely xcievers errors.
It's set to 256!
I had thought I had cleared this a possible cause, but guess I was wrong.
Gonna retest right away.
Thanks!

On Fri, Feb 5, 2010 at 11:05 AM, Todd Lipcon <t...@cloudera.com> wrote:

> Yes, you're likely to see an error in the DN log. Do you see anything
> about max number of xceivers?
>
> -Todd
>
> On Thu, Feb 4, 2010 at 11:42 PM, Meng Mao <meng...@gmail.com> wrote:
> > not sure what else I could be checking to see where the problem lies.
> Should
> > I be looking in the datanode logs? I looked briefly in there and didn't
> see
> > anything from around the time exceptions started getting reported.
> > lsof during the job execution? Number of open threads?
> >
> > I'm at a loss here.
> >
> > On Thu, Feb 4, 2010 at 2:52 PM, Meng Mao <meng...@gmail.com> wrote:
> >
> >> I wrote a hadoop job that checks for ulimits across the nodes, and every
> >> node is reporting:
> >> core file size          (blocks, -c) 0
> >> data seg size           (kbytes, -d) unlimited
> >> scheduling priority             (-e) 0
> >> file size               (blocks, -f) unlimited
> >> pending signals                 (-i) 139264
> >> max locked memory       (kbytes, -l) 32
> >> max memory size         (kbytes, -m) unlimited
> >> open files                      (-n) 65536
> >> pipe size            (512 bytes, -p) 8
> >> POSIX message queues     (bytes, -q) 819200
> >> real-time priority              (-r) 0
> >> stack size              (kbytes, -s) 10240
> >> cpu time               (seconds, -t) unlimited
> >> max user processes              (-u) 139264
> >> virtual memory          (kbytes, -v) unlimited
> >> file locks                      (-x) unlimited
> >>
> >>
> >> Is anything in there telling about file number limits? From what I
> >> understand, a high open files limit like 65536 should be enough. I
> estimate
> >> only a couple thousand part-files on HDFS being written to at once, and
> >> around 200 on the filesystem per node.
> >>
> >> On Wed, Feb 3, 2010 at 4:04 PM, Meng Mao <meng...@gmail.com> wrote:
> >>
> >>> also, which is the ulimit that's important, the one for the user who is
> >>> running the job, or the hadoop user that owns the Hadoop processes?
> >>>
> >>>
> >>> On Tue, Feb 2, 2010 at 7:29 PM, Meng Mao <meng...@gmail.com> wrote:
> >>>
> >>>> I've been trying to run a fairly small input file (300MB) on Cloudera
> >>>> Hadoop 0.20.1. The job I'm using probably writes to on the order of
> over
> >>>> 1000 part-files at once, across the whole grid. The grid has 33 nodes
> in it.
> >>>> I get the following exception in the run logs:
> >>>>
> >>>> 10/01/30 17:24:25 INFO mapred.JobClient:  map 100% reduce 12%
> >>>> 10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
> >>>> attempt_201001261532_1137_r_000013_0, Status : FAILED
> >>>> java.io.EOFException
> >>>>     at java.io.DataInputStream.readByte(DataInputStream.java:250)
> >>>>     at
> >>>> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
> >>>>     at
> >>>> org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
> >>>>     at org.apache.hadoop.io.Text.readString(Text.java:400)
> >>>>     at
> >>>>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2869)
> >>>>     at
> >>>>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
> >>>>     at
> >>>>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
> >>>>     at
> >>>>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)
> >>>>
> >>>> ....lots of EOFExceptions....
> >>>>
> >>>> 10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
> >>>> attempt_201001261532_1137_r_000019_0, Status : FAILED
> >>>> java.io.IOException: Bad connect ack with firstBadLink
> 10.2.19.1:50010
> >>>>     at
> >>>>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871)
> >>>>     at
> >>>>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
> >>>>      at
> >>>>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
> >>>>     at
> >>>>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)
> >>>>
> >>>> 10/01/30 17:24:36 INFO mapred.JobClient:  map 100% reduce 11%
> >>>> 10/01/30 17:24:42 INFO mapred.JobClient:  map 100% reduce 12%
> >>>> 10/01/30 17:24:49 INFO mapred.JobClient:  map 100% reduce 13%
> >>>> 10/01/30 17:24:55 INFO mapred.JobClient:  map 100% reduce 14%
> >>>> 10/01/30 17:25:00 INFO mapred.JobClient:  map 100% reduce 15%
> >>>>
> >>>> From searching around, it seems like the most common cause of BadLink
> and
> >>>> EOFExceptions is when the nodes don't have enough file descriptors
> set. But
> >>>> across all the grid machines, the file-max has been set to 1573039.
> >>>> Furthermore, we set ulimit -n to 65536 using hadoop-env.sh.
> >>>>
> >>>> Where else should I be looking for what's causing this?
> >>>>
> >>>
> >>>
> >>
> >
>

Reply via email to