Re: Map-Reduce Slow Down

jason hadoop Wed, 15 Apr 2009 19:24:12 -0700

Double check that there is no firewall in place.
At one point a bunch of new machines were kickstarted and placed in a
cluster and they all failed with something similar.
It turned out the kickstart script turned enabled the firewall with a rule
that blocked ports in the 50k range.
It took us a while to even think to check that was not a part of our normal
machine configuration


On Wed, Apr 15, 2009 at 11:04 AM, Mithila Nagendra <mnage...@asu.edu> wrote:

> Hi Aaron
> I will look into that thanks!
>
> I spoke to the admin who overlooks the cluster. He said that the gateway
> comes in to the picture only when one of the nodes communicates with a node
> outside of the cluster. But in my case the communication is carried out
> between the nodes which all belong to the same cluster.
>
> Mithila
>
> On Wed, Apr 15, 2009 at 8:59 PM, Aaron Kimball <aa...@cloudera.com> wrote:
>
> > Hi,
> >
> > I wrote a blog post a while back about connecting nodes via a gateway.
> See
> >
> http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/
> >
> > This assumes that the client is outside the gateway and all
> > datanodes/namenode are inside, but the same principles apply. You'll just
> > need to set up ssh tunnels from every datanode to the namenode.
> >
> > - Aaron
> >
> >
> > On Wed, Apr 15, 2009 at 10:19 AM, Ravi Phulari <rphul...@yahoo-inc.com
> >wrote:
> >
> >> Looks like your NameNode is down .
> >> Verify if hadoop process are running (   jps should show you all java
> >> running process).
> >> If your hadoop process are running try restarting your hadoop process .
> >> I guess this problem is due to your fsimage not being correct .
> >> You might have to format your namenode.
> >> Hope this helps.
> >>
> >> Thanks,
> >> --
> >> Ravi
> >>
> >>
> >> On 4/15/09 10:15 AM, "Mithila Nagendra" <mnage...@asu.edu> wrote:
> >>
> >> The log file runs into thousands of line with the same message being
> >> displayed every time.
> >>
> >> On Wed, Apr 15, 2009 at 8:10 PM, Mithila Nagendra <mnage...@asu.edu>
> >> wrote:
> >>
> >> > The log file : hadoop-mithila-datanode-node19.log.2009-04-14 has the
> >> > following in it:
> >> >
> >> > 2009-04-14 10:08:11,499 INFO org.apache.hadoop.dfs.DataNode:
> >> STARTUP_MSG:
> >> > /************************************************************
> >> > STARTUP_MSG: Starting DataNode
> >> > STARTUP_MSG:   host = node19/127.0.0.1
> >> > STARTUP_MSG:   args = []
> >> > STARTUP_MSG:   version = 0.18.3
> >> > STARTUP_MSG:   build =
> >> > https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r
> >> > 736250; compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009
> >> > ************************************************************/
> >> > 2009-04-14 10:08:12,915 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 0 time(s).
> >> > 2009-04-14 10:08:13,925 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 1 time(s).
> >> > 2009-04-14 10:08:14,935 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 2 time(s).
> >> > 2009-04-14 10:08:15,945 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 3 time(s).
> >> > 2009-04-14 10:08:16,955 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 4 time(s).
> >> > 2009-04-14 10:08:17,965 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 5 time(s).
> >> > 2009-04-14 10:08:18,975 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 6 time(s).
> >> > 2009-04-14 10:08:19,985 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 7 time(s).
> >> > 2009-04-14 10:08:20,995 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 8 time(s).
> >> > 2009-04-14 10:08:22,005 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 9 time(s).
> >> > 2009-04-14 10:08:22,008 INFO org.apache.hadoop.ipc.RPC: Server at
> >> node18/
> >> > 192.168.0.18:54310 not available yet, Zzzzz...
> >> > 2009-04-14 10:08:24,025 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 0 time(s).
> >> > 2009-04-14 10:08:25,035 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 1 time(s).
> >> > 2009-04-14 10:08:26,045 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 2 time(s).
> >> > 2009-04-14 10:08:27,055 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 3 time(s).
> >> > 2009-04-14 10:08:28,065 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 4 time(s).
> >> > 2009-04-14 10:08:29,075 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 5 time(s).
> >> > 2009-04-14 10:08:30,085 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 6 time(s).
> >> > 2009-04-14 10:08:31,095 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 7 time(s).
> >> > 2009-04-14 10:08:32,105 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 8 time(s).
> >> > 2009-04-14 10:08:33,115 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 9 time(s).
> >> > 2009-04-14 10:08:33,116 INFO org.apache.hadoop.ipc.RPC: Server at
> >> node18/
> >> > 192.168.0.18:54310 not available yet, Zzzzz...
> >> > 2009-04-14 10:08:35,135 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 0 time(s).
> >> > 2009-04-14 10:08:36,145 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 1 time(s).
> >> > 2009-04-14 10:08:37,155 INFO org.apache.hadoop.ipc.Client: Retrying
> >> connect
> >> > to server: node18/192.168.0.18:54310. Already tried 2 time(s).
> >> >
> >> >
> >> > Hmmm I still cant figure it out..
> >> >
> >> > Mithila
> >> >
> >> >
> >> > On Tue, Apr 14, 2009 at 10:22 PM, Mithila Nagendra <mnage...@asu.edu
> >> >wrote:
> >> >
> >> >> Also, Would the way the port is accessed change if all these node are
> >> >> connected through a gateway? I mean in the hadoop-site.xml file? The
> >> Ubuntu
> >> >> systems we worked with earlier didnt have a gateway.
> >> >> Mithila
> >> >>
> >> >> On Tue, Apr 14, 2009 at 9:48 PM, Mithila Nagendra <mnage...@asu.edu
> >> >wrote:
> >> >>
> >> >>> Aaron: Which log file do I look into - there are alot of them. Here
> s
> >> >>> what the error looks like:
> >> >>> [mith...@node19:~]$ cd hadoop
> >> >>> [mith...@node19:~/hadoop]$ bin/hadoop dfs -ls
> >> >>> 09/04/14 10:09:29 INFO ipc.Client: Retrying connect to server:
> node18/
> >> >>> 192.168.0.18:54310. Already tried 0 time(s).
> >> >>> 09/04/14 10:09:30 INFO ipc.Client: Retrying connect to server:
> node18/
> >> >>> 192.168.0.18:54310. Already tried 1 time(s).
> >> >>> 09/04/14 10:09:31 INFO ipc.Client: Retrying connect to server:
> node18/
> >> >>> 192.168.0.18:54310. Already tried 2 time(s).
> >> >>> 09/04/14 10:09:32 INFO ipc.Client: Retrying connect to server:
> node18/
> >> >>> 192.168.0.18:54310. Already tried 3 time(s).
> >> >>> 09/04/14 10:09:33 INFO ipc.Client: Retrying connect to server:
> node18/
> >> >>> 192.168.0.18:54310. Already tried 4 time(s).
> >> >>> 09/04/14 10:09:34 INFO ipc.Client: Retrying connect to server:
> node18/
> >> >>> 192.168.0.18:54310. Already tried 5 time(s).
> >> >>> 09/04/14 10:09:35 INFO ipc.Client: Retrying connect to server:
> node18/
> >> >>> 192.168.0.18:54310. Already tried 6 time(s).
> >> >>> 09/04/14 10:09:36 INFO ipc.Client: Retrying connect to server:
> node18/
> >> >>> 192.168.0.18:54310. Already tried 7 time(s).
> >> >>> 09/04/14 10:09:37 INFO ipc.Client: Retrying connect to server:
> node18/
> >> >>> 192.168.0.18:54310. Already tried 8 time(s).
> >> >>> 09/04/14 10:09:38 INFO ipc.Client: Retrying connect to server:
> node18/
> >> >>> 192.168.0.18:54310. Already tried 9 time(s).
> >> >>> Bad connection to FS. command aborted.
> >> >>>
> >> >>> Node19 is a slave and Node18 is the master.
> >> >>>
> >> >>> Mithila
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Tue, Apr 14, 2009 at 8:53 PM, Aaron Kimball <aa...@cloudera.com
> >> >wrote:
> >> >>>
> >> >>>> Are there any error messages in the log files on those nodes?
> >> >>>> - Aaron
> >> >>>>
> >> >>>> On Tue, Apr 14, 2009 at 9:03 AM, Mithila Nagendra <
> mnage...@asu.edu>
> >> >>>> wrote:
> >> >>>>
> >> >>>> > I ve drawn a blank here! Can't figure out what s wrong with the
> >> ports.
> >> >>>> I
> >> >>>> > can
> >> >>>> > ssh between the nodes but cant access the DFS from the slaves -
> >> says
> >> >>>> "Bad
> >> >>>> > connection to DFS". Master seems to be fine.
> >> >>>> > Mithila
> >> >>>> >
> >> >>>> > On Tue, Apr 14, 2009 at 4:28 AM, Mithila Nagendra <
> >> mnage...@asu.edu>
> >> >>>> > wrote:
> >> >>>> >
> >> >>>> > > Yes I can..
> >> >>>> > >
> >> >>>> > >
> >> >>>> > > On Mon, Apr 13, 2009 at 5:12 PM, Jim Twensky <
> >> jim.twen...@gmail.com
> >> >>>> > >wrote:
> >> >>>> > >
> >> >>>> > >> Can you ssh between the nodes?
> >> >>>> > >>
> >> >>>> > >> -jim
> >> >>>> > >>
> >> >>>> > >> On Mon, Apr 13, 2009 at 6:49 PM, Mithila Nagendra <
> >> >>>> mnage...@asu.edu>
> >> >>>> > >> wrote:
> >> >>>> > >>
> >> >>>> > >> > Thanks Aaron.
> >> >>>> > >> > Jim: The three clusters I setup had ubuntu running on them
> and
> >> >>>> the dfs
> >> >>>> > >> was
> >> >>>> > >> > accessed at port 54310. The new cluster which I ve setup has
> >> Red
> >> >>>> Hat
> >> >>>> > >> Linux
> >> >>>> > >> > release 7.2 (Enigma)running on it. Now when I try to access
> >> the
> >> >>>> dfs
> >> >>>> > from
> >> >>>> > >> > one
> >> >>>> > >> > of the slaves i get the following response: dfs cannot be
> >> >>>> accessed.
> >> >>>> > When
> >> >>>> > >> I
> >> >>>> > >> > access the DFS throught the master there s no problem. So I
> >> feel
> >> >>>> there
> >> >>>> > a
> >> >>>> > >> > problem with the port. Any ideas? I did check the list of
> >> slaves,
> >> >>>> it
> >> >>>> > >> looks
> >> >>>> > >> > fine to me.
> >> >>>> > >> >
> >> >>>> > >> > Mithila
> >> >>>> > >> >
> >> >>>> > >> >
> >> >>>> > >> >
> >> >>>> > >> >
> >> >>>> > >> > On Mon, Apr 13, 2009 at 2:58 PM, Jim Twensky <
> >> >>>> jim.twen...@gmail.com>
> >> >>>> > >> > wrote:
> >> >>>> > >> >
> >> >>>> > >> > > Mithila,
> >> >>>> > >> > >
> >> >>>> > >> > > You said all the slaves were being utilized in the 3 node
> >> >>>> cluster.
> >> >>>> > >> Which
> >> >>>> > >> > > application did you run to test that and what was your
> input
> >> >>>> size?
> >> >>>> > If
> >> >>>> > >> you
> >> >>>> > >> > > tried the word count application on a 516 MB input file on
> >> both
> >> >>>> > >> cluster
> >> >>>> > >> > > setups, than some of your nodes in the 15 node cluster may
> >> not
> >> >>>> be
> >> >>>> > >> running
> >> >>>> > >> > > at
> >> >>>> > >> > > all. Generally, one map job is assigned to each input
> split
> >> and
> >> >>>> if
> >> >>>> > you
> >> >>>> > >> > are
> >> >>>> > >> > > running your cluster with the defaults, the splits are 64
> MB
> >> >>>> each. I
> >> >>>> > >> got
> >> >>>> > >> > > confused when you said the Namenode seemed to do all the
> >> work.
> >> >>>> Can
> >> >>>> > you
> >> >>>> > >> > > check
> >> >>>> > >> > > conf/slaves and make sure you put the names of all task
> >> >>>> trackers
> >> >>>> > >> there? I
> >> >>>> > >> > > also suggest comparing both clusters with a larger input
> >> size,
> >> >>>> say
> >> >>>> > at
> >> >>>> > >> > least
> >> >>>> > >> > > 5 GB, to really see a difference.
> >> >>>> > >> > >
> >> >>>> > >> > > Jim
> >> >>>> > >> > >
> >> >>>> > >> > > On Mon, Apr 13, 2009 at 4:17 PM, Aaron Kimball <
> >> >>>> aa...@cloudera.com>
> >> >>>> > >> > wrote:
> >> >>>> > >> > >
> >> >>>> > >> > > > in hadoop-*-examples.jar, use "randomwriter" to generate
> >> the
> >> >>>> data
> >> >>>> > >> and
> >> >>>> > >> > > > "sort"
> >> >>>> > >> > > > to sort it.
> >> >>>> > >> > > > - Aaron
> >> >>>> > >> > > >
> >> >>>> > >> > > > On Sun, Apr 12, 2009 at 9:33 PM, Pankil Doshi <
> >> >>>> > forpan...@gmail.com>
> >> >>>> > >> > > wrote:
> >> >>>> > >> > > >
> >> >>>> > >> > > > > Your data is too small I guess for 15 clusters ..So it
> >> >>>> might be
> >> >>>> > >> > > overhead
> >> >>>> > >> > > > > time of these clusters making your total MR jobs more
> >> time
> >> >>>> > >> consuming.
> >> >>>> > >> > > > > I guess you will have to try with larger set of data..
> >> >>>> > >> > > > >
> >> >>>> > >> > > > > Pankil
> >> >>>> > >> > > > > On Sun, Apr 12, 2009 at 6:54 PM, Mithila Nagendra <
> >> >>>> > >> mnage...@asu.edu>
> >> >>>> > >> > > > > wrote:
> >> >>>> > >> > > > >
> >> >>>> > >> > > > > > Aaron
> >> >>>> > >> > > > > >
> >> >>>> > >> > > > > > That could be the issue, my data is just 516MB -
> >> wouldn't
> >> >>>> this
> >> >>>> > >> see
> >> >>>> > >> > a
> >> >>>> > >> > > > bit
> >> >>>> > >> > > > > of
> >> >>>> > >> > > > > > speed up?
> >> >>>> > >> > > > > > Could you guide me to the example? I ll run my
> cluster
> >> on
> >> >>>> it
> >> >>>> > and
> >> >>>> > >> > see
> >> >>>> > >> > > > what
> >> >>>> > >> > > > > I
> >> >>>> > >> > > > > > get. Also for my program I had a java timer running
> to
> >> >>>> record
> >> >>>> > >> the
> >> >>>> > >> > > time
> >> >>>> > >> > > > > > taken
> >> >>>> > >> > > > > > to complete execution. Does Hadoop have an inbuilt
> >> timer?
> >> >>>> > >> > > > > >
> >> >>>> > >> > > > > > Mithila
> >> >>>> > >> > > > > >
> >> >>>> > >> > > > > > On Mon, Apr 13, 2009 at 1:13 AM, Aaron Kimball <
> >> >>>> > >> aa...@cloudera.com
> >> >>>> > >> > >
> >> >>>> > >> > > > > wrote:
> >> >>>> > >> > > > > >
> >> >>>> > >> > > > > > > Virtually none of the examples that ship with
> Hadoop
> >> >>>> are
> >> >>>> > >> designed
> >> >>>> > >> > > to
> >> >>>> > >> > > > > > > showcase its speed. Hadoop's speedup comes from
> its
> >> >>>> ability
> >> >>>> > to
> >> >>>> > >> > > > process
> >> >>>> > >> > > > > > very
> >> >>>> > >> > > > > > > large volumes of data (starting around, say, tens
> of
> >> GB
> >> >>>> per
> >> >>>> > >> job,
> >> >>>> > >> > > and
> >> >>>> > >> > > > > > going
> >> >>>> > >> > > > > > > up in orders of magnitude from there). So if you
> are
> >> >>>> timing
> >> >>>> > >> the
> >> >>>> > >> > pi
> >> >>>> > >> > > > > > > calculator (or something like that), its results
> >> won't
> >> >>>> > >> > necessarily
> >> >>>> > >> > > be
> >> >>>> > >> > > > > > very
> >> >>>> > >> > > > > > > consistent. If a job doesn't have enough fragments
> >> of
> >> >>>> data
> >> >>>> > to
> >> >>>> > >> > > > allocate
> >> >>>> > >> > > > > > one
> >> >>>> > >> > > > > > > per each node, some of the nodes will also just go
> >> >>>> unused.
> >> >>>> > >> > > > > > >
> >> >>>> > >> > > > > > > The best example for you to run is to use
> >> randomwriter
> >> >>>> to
> >> >>>> > fill
> >> >>>> > >> up
> >> >>>> > >> > > > your
> >> >>>> > >> > > > > > > cluster with several GB of random data and then
> run
> >> the
> >> >>>> sort
> >> >>>> > >> > > program.
> >> >>>> > >> > > > > If
> >> >>>> > >> > > > > > > that doesn't scale up performance from 3 nodes to
> >> 15,
> >> >>>> then
> >> >>>> > >> you've
> >> >>>> > >> > > > > > > definitely
> >> >>>> > >> > > > > > > got something strange going on.
> >> >>>> > >> > > > > > >
> >> >>>> > >> > > > > > > - Aaron
> >> >>>> > >> > > > > > >
> >> >>>> > >> > > > > > >
> >> >>>> > >> > > > > > > On Sun, Apr 12, 2009 at 8:39 AM, Mithila Nagendra
> <
> >> >>>> > >> > > mnage...@asu.edu>
> >> >>>> > >> > > > > > > wrote:
> >> >>>> > >> > > > > > >
> >> >>>> > >> > > > > > > > Hey all
> >> >>>> > >> > > > > > > > I recently setup a three node hadoop cluster and
> >> ran
> >> >>>> an
> >> >>>> > >> > examples
> >> >>>> > >> > > on
> >> >>>> > >> > > > > it.
> >> >>>> > >> > > > > > > It
> >> >>>> > >> > > > > > > > was pretty fast, and all the three nodes were
> >> being
> >> >>>> used
> >> >>>> > (I
> >> >>>> > >> > > checked
> >> >>>> > >> > > > > the
> >> >>>> > >> > > > > > > log
> >> >>>> > >> > > > > > > > files to make sure that the slaves are
> utilized).
> >> >>>> > >> > > > > > > >
> >> >>>> > >> > > > > > > > Now I ve setup another cluster consisting of 15
> >> >>>> nodes. I
> >> >>>> > ran
> >> >>>> > >> > the
> >> >>>> > >> > > > same
> >> >>>> > >> > > > > > > > example, but instead of speeding up, the
> >> map-reduce
> >> >>>> task
> >> >>>> > >> seems
> >> >>>> > >> > to
> >> >>>> > >> > > > > take
> >> >>>> > >> > > > > > > > forever! The slaves are not being used for some
> >> >>>> reason.
> >> >>>> > This
> >> >>>> > >> > > second
> >> >>>> > >> > > > > > > cluster
> >> >>>> > >> > > > > > > > has a lower, per node processing power, but
> should
> >> >>>> that
> >> >>>> > make
> >> >>>> > >> > any
> >> >>>> > >> > > > > > > > difference?
> >> >>>> > >> > > > > > > > How can I ensure that the data is being mapped
> to
> >> all
> >> >>>> the
> >> >>>> > >> > nodes?
> >> >>>> > >> > > > > > > Presently,
> >> >>>> > >> > > > > > > > the only node that seems to be doing all the
> work
> >> is
> >> >>>> the
> >> >>>> > >> Master
> >> >>>> > >> > > > node.
> >> >>>> > >> > > > > > > >
> >> >>>> > >> > > > > > > > Does 15 nodes in a cluster increase the network
> >> cost?
> >> >>>> What
> >> >>>> > >> can
> >> >>>> > >> > I
> >> >>>> > >> > > do
> >> >>>> > >> > > > > to
> >> >>>> > >> > > > > > > > setup
> >> >>>> > >> > > > > > > > the cluster to function more efficiently?
> >> >>>> > >> > > > > > > >
> >> >>>> > >> > > > > > > > Thanks!
> >> >>>> > >> > > > > > > > Mithila Nagendra
> >> >>>> > >> > > > > > > > Arizona State University
> >> >>>> > >> > > > > > > >
> >> >>>> > >> > > > > > >
> >> >>>> > >> > > > > >
> >> >>>> > >> > > > >
> >> >>>> > >> > > >
> >> >>>> > >> > >
> >> >>>> > >> >
> >> >>>> > >>
> >> >>>> > >
> >> >>>> > >
> >> >>>> >
> >> >>>>
> >> >>>
> >> >>>
> >> >>
> >> >
> >>
> >>
> >> Ravi
> >> --
> >>
> >>
> >
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: Map-Reduce Slow Down

Reply via email to