Re: Jobs run slower and slower

2009-03-04 Thread Sean Laurent
Hrmmm. I can tell init/execution at the job level, but I don't know how to
figure that out at the individual map task level. What would be the best way
for me to determine that?

-Sean

On Wed, Mar 4, 2009 at 12:13 PM, Runping Qi  wrote:

> Do you know the break down of times for a mapper task takes to initialize
> and to execute the map function?
>
>
> On Wed, Mar 4, 2009 at 8:44 AM, Sean Laurent  >wrote:
>
> > On Tue, Mar 3, 2009 at 10:14 PM, Amar Kamat 
> wrote:
> >
> > > Yeah. May be its not the problem with the JobTracker. Can you check
> (via
> > > job history) what is the best and the worst task runtimes? You can
> > analyze
> > > the jobs after they complete.
> >
> > Okay, I ran the same job 35 times last night. Each job was exactly
> > identical
> > - it parsed 1000 identical files that were already stored in HDFS via a
> map
> > task (no reduce). Like all of my previous tests, each successive run took
> > longer than the previous run.
> >
> > Looking at the job history, the first run was the fastest; it took a
> total
> > of 2mins 28sec (setup: 2 secs, map: 2min 22sec, cleanup: 0sec). The last
> > run
> > was the slowest; it took a total of 22mins 31sec (setup: 16sec, map:
> 22mins
> > 14sec, cleanup: 16sec).
> >
> > Memory usage on the JT/NN machine, as reported by sar, slowly increased
> > over
> > the 7 hour window. Memory usage on a randomly selected DN/TT also
> steadily
> > increased over the 7 hour window but far more rapidly. We also looked at
> > I/O
> > usage and CPU utilization on both the JT/NN machine and the same randomly
> > selected DN/TT - nothing out of the ordinary. I/O waits (both from the
> I/O
> > subsystem level perspective and from the CPU's perspective) were
> > consistently low over the 7 hour window and did not fluctuate
> significantly
> > on any of the machines. CPU utilization on the JT/NN was practically
> > non-existent and hovered between 40%-60% on the DN/TT.
>


Re: Jobs run slower and slower

2009-03-04 Thread Sean Laurent
On Tue, Mar 3, 2009 at 10:14 PM, Amar Kamat  wrote:

> Yeah. May be its not the problem with the JobTracker. Can you check (via
> job history) what is the best and the worst task runtimes? You can analyze
> the jobs after they complete.
> Amar


Okay, I ran the same job 35 times last night. Each job was exactly identical
- it parsed 1000 identical files that were already stored in HDFS via a map
task (no reduce). Like all of my previous tests, each successive run took
longer than the previous run.

Looking at the job history, the first run was the fastest; it took a total
of 2mins 28sec (setup: 2 secs, map: 2min 22sec, cleanup: 0sec). The last run
was the slowest; it took a total of 22mins 31sec (setup: 16sec, map: 22mins
14sec, cleanup: 16sec).

Memory usage on the JT/NN machine, as reported by sar, slowly increased over
the 7 hour window. Memory usage on a randomly selected DN/TT also steadily
increased over the 7 hour window but far more rapidly. We also looked at I/O
usage and CPU utilization on both the JT/NN machine and the same randomly
selected DN/TT - nothing out of the ordinary. I/O waits (both from the I/O
subsystem level perspective and from the CPU's perspective) were
consistently low over the 7 hour window and did not fluctuate significantly
on any of the machines. CPU utilization on the JT/NN was practically
non-existent and hovered between 40%-60% on the DN/TT.

-Sean


Re: Jobs run slower and slower

2009-03-03 Thread Sean Laurent
It's quite possible that's the problem. I'll re-run the tests over night and
collect the run times according to the JobTracker.

If I want to test the patch in HADOOP-4780, should I pull down branch-0.19
and go from there? This is not a production environment, so I'm not worried
about data loss or other problems. But I could certainly wait if 0.19.2 or
0.20.0 are expected to be released soon, if that would make more sense.

-Sean

On Tue, Mar 3, 2009 at 10:14 PM, Amar Kamat  wrote:

> Runping Qi wrote:
>
>> Could it be the case that the latter jobs ran slower because the tasks
>> took
>> longer time to get initialized?
>> If so, you may hit
>> https://issues.apache.org/jira/browse/HADOOP-4780
>>
>> Runping
>>
>> On Tue, Mar 3, 2009 at 2:02 PM, Sean Laurent > >wrote:
>>
>>
>>
>>> Hrmmm. According to hadoop-defaults.xml,
>>> mapred.jobtracker.completeuserjobs.maximum defaults to 100. So I tried
>>> setting it to 1, but that had no effect. I still see each successive run
>>> taking longer than the previous run.
>>>
>>> 1) Restart M/R
>>> 2) Run #1: 142.12 (secs)
>>> 3) Run #2 181.96 (secs)
>>> 4) Run #3  221.95 (secs)
>>> 5) Run #4  281.96 (secs)
>>>
>>>
>> Yeah. May be its not the problem with the JobTracker. Can you check (via
> job history) what is the best and the worst task runtimes? You can analyze
> the jobs after they complete.
> Amar


Re: Jobs run slower and slower

2009-03-03 Thread Sean Laurent
Hrmmm. According to hadoop-defaults.xml,
mapred.jobtracker.completeuserjobs.maximum defaults to 100. So I tried
setting it to 1, but that had no effect. I still see each successive run
taking longer than the previous run.

1) Restart M/R
2) Run #1: 142.12 (secs)
3) Run #2 181.96 (secs)
4) Run #3  221.95 (secs)
5) Run #4  281.96 (secs)

I don't think that's the problem here... :(

-S

On Tue, Mar 3, 2009 at 2:33 PM, Runping Qi  wrote:

> The jobtracker's memory increased as you ran more and more jobs because the
> job tracker still kept some data about those completed jobs. The maximum
> number of completed jobs kept is determined by the config variable
> mapred.jobtracker.completeuserjobs.maximum.
> You can lower that to lower the job tracker memory consumption.
>
>
> On Tue, Mar 3, 2009 at 10:01 AM, Sean Laurent  >wrote:
>
> > Interesting... from reading HADOOP-4766, I'm  not entirely clear if that
> > problem is related to the number of jobs or the number of tasks.
> >
> > - I'm only running a single job with approximately 900 map tasks as
> opposed
> > to the 500-600+ jobs and 100K tasks described in HADOOP-4766.
> > - I am seeing increased memory use on the JobTracker.
> > - I am seeing elevated memory use over time on the DataNode/TaskTracker
> > machines.
> > - Amar's description in HADOOP-4766 from December 6th sounds pretty
> > similar.
> >
> > I also tried adjusting garbage collection via -XX:+UseParallelGC, but
> that
> > had no noticeable impact.
> >
> > It also wasn't clear to me what, if anything, I can do to fix or work
> > around
> > the problem.
> >
> > Any advice would be greatly appreciated.
> >
> > -Sean
> > - Show quoted text -
> >
> > On Mon, Mar 2, 2009 at 7:50 PM, Runping Qi  wrote:
> >
> > > Your problem may be related to
> > > https://issues.apache.org/jira/browse/HADOOP-4766
> > >
> > > Runping
> > >
> > >
> > > On Mon, Mar 2, 2009 at 4:46 PM, Sean Laurent  > > >wrote:
> > >
> > > > Hi all,
> > > > I'm conducting some initial tests with Hadoop to better understand
> how
> > > well
> > > > it will handle and scale with some of our specific problems. As a
> > result,
> > > > I've written some M/R jobs that are representative of the work we
> want
> > to
> > > > do. I then run the jobs multiple times in a row (sequentially) to get
> a
> > > > rough estimate for average run-time.
> > > >
> > > > What I'm seeing is really strange... If I run the same job with the
> > same
> > > > inputs multiple times, each successive run is slower than the
> previous
> > > run.
> > > > If I restart the cluster and re-run the tests, the first run is fast
> > and
> > > > then each successive run is slower.
> > > >
> > > > For example, I just started the cluster and ran the same job 4 times.
> > The
> > > > run times for the jobs were as follows: 127 seconds, 177 seconds, 207
> > > > seconds and 218 seconds. I restarted HDFS and M/R, reran the job 3
> more
> > > > times and got the following run times: 138 seconds, 187 seconds and
> 221
> > > > seconds. :(
> > > >
> > > > The map task is pretty simple - parse XML files to extract specific
> > > > elements. I'm using Cascading and wrote a custom Scheme, which in
> turn
> > > uses
> > > > a custom FileInputFormat that treats each file as an entire record
> > > > (splitable = false). Each file is then treated as a separate map task
> > > with
> > > > no reduce step.
> > > >
> > > > In this case I have a 8 node cluster. 1 node acts as a dedicated
> > > > NameNode/JobTracker and 7 nodes run the DataNode/TaskTracker. Each
> > > machine
> > > > is identical: Dell 1950 with Intel quad-core 2.5, 8GB RAM and 2 250GB
> > > SATA2
> > > > drives. All 8 machines are in the same rack running on a dedicated
> > > Force10
> > > > gigabit switch.
> > > >
> > > > I tried enabling JVM reuse via JobConf, which improved performance
> for
> > > the
> > > > initial few runs... but each successive job still took longer than
> the
> > > > previous. I also tried increasing the maximum memory via the
> > > > mapred.child.java.opts property, but that didn't have any impact.
> > > >
> > > > I checked the logs, but I don't see any errors.
> > > >
> > > > Here's my basic list of configured properties:
> > > >
> > > > fs.default.name=hdfs://dn01.hadoop.mycompany.com:9000
> > > > mapred.job.tracker=dn01.hadoop.mycompany.com:9001
> > > > dfs.replication=3
> > > > dfs.block.size=1048576
> > > > dfs.name.dir=/opt/hadoop/volume1/name,/opt/hadoop/volume2/name
> > > > dfs.data.dir=/opt/hadoop/volume1/data,/opt/hadoop/volume2/data
> > > >
> mapred.local.dir=/opt/hadoop/volume1/mapred,/opt/hadoop/volume2/mapred
> > > > mapred.child.java.opts=-Xmx1532m
> > > >
> > > > Frankly I'm stumped. I'm sure there is something obvious that I'm
> > > missing,
> > > > but I'm totally at a loss right now. Any suggestions would be
> ~greatly~
> > > > appreciated.
>


Re: Jobs run slower and slower

2009-03-03 Thread Sean Laurent
Interesting... from reading HADOOP-4766, I'm  not entirely clear if that
problem is related to the number of jobs or the number of tasks.

- I'm only running a single job with approximately 900 map tasks as opposed
to the 500-600+ jobs and 100K tasks described in HADOOP-4766.
- I am seeing increased memory use on the JobTracker.
- I am seeing elevated memory use over time on the DataNode/TaskTracker
machines.
- Amar's description in HADOOP-4766 from December 6th sounds pretty similar.

I also tried adjusting garbage collection via -XX:+UseParallelGC, but that
had no noticeable impact.

It also wasn't clear to me what, if anything, I can do to fix or work around
the problem.

Any advice would be greatly appreciated.

-Sean

On Mon, Mar 2, 2009 at 7:50 PM, Runping Qi  wrote:

> Your problem may be related to
> https://issues.apache.org/jira/browse/HADOOP-4766
>
> Runping
>
>
> On Mon, Mar 2, 2009 at 4:46 PM, Sean Laurent  >wrote:
>
> > Hi all,
> > I'm conducting some initial tests with Hadoop to better understand how
> well
> > it will handle and scale with some of our specific problems. As a result,
> > I've written some M/R jobs that are representative of the work we want to
> > do. I then run the jobs multiple times in a row (sequentially) to get a
> > rough estimate for average run-time.
> >
> > What I'm seeing is really strange... If I run the same job with the same
> > inputs multiple times, each successive run is slower than the previous
> run.
> > If I restart the cluster and re-run the tests, the first run is fast and
> > then each successive run is slower.
> >
> > For example, I just started the cluster and ran the same job 4 times. The
> > run times for the jobs were as follows: 127 seconds, 177 seconds, 207
> > seconds and 218 seconds. I restarted HDFS and M/R, reran the job 3 more
> > times and got the following run times: 138 seconds, 187 seconds and 221
> > seconds. :(
> >
> > The map task is pretty simple - parse XML files to extract specific
> > elements. I'm using Cascading and wrote a custom Scheme, which in turn
> uses
> > a custom FileInputFormat that treats each file as an entire record
> > (splitable = false). Each file is then treated as a separate map task
> with
> > no reduce step.
> >
> > In this case I have a 8 node cluster. 1 node acts as a dedicated
> > NameNode/JobTracker and 7 nodes run the DataNode/TaskTracker. Each
> machine
> > is identical: Dell 1950 with Intel quad-core 2.5, 8GB RAM and 2 250GB
> SATA2
> > drives. All 8 machines are in the same rack running on a dedicated
> Force10
> > gigabit switch.
> >
> > I tried enabling JVM reuse via JobConf, which improved performance for
> the
> > initial few runs... but each successive job still took longer than the
> > previous. I also tried increasing the maximum memory via the
> > mapred.child.java.opts property, but that didn't have any impact.
> >
> > I checked the logs, but I don't see any errors.
> >
> > Here's my basic list of configured properties:
> >
> > fs.default.name=hdfs://dn01.hadoop.mycompany.com:9000
> > mapred.job.tracker=dn01.hadoop.mycompany.com:9001
> > dfs.replication=3
> > dfs.block.size=1048576
> > dfs.name.dir=/opt/hadoop/volume1/name,/opt/hadoop/volume2/name
> > dfs.data.dir=/opt/hadoop/volume1/data,/opt/hadoop/volume2/data
> > mapred.local.dir=/opt/hadoop/volume1/mapred,/opt/hadoop/volume2/mapred
> > mapred.child.java.opts=-Xmx1532m
> >
> > Frankly I'm stumped. I'm sure there is something obvious that I'm
> missing,
> > but I'm totally at a loss right now. Any suggestions would be ~greatly~
> > appreciated.
> >
> > Thanks!
> >
> > -Sean
>


Jobs run slower and slower

2009-03-02 Thread Sean Laurent
Hi all,
I'm conducting some initial tests with Hadoop to better understand how well
it will handle and scale with some of our specific problems. As a result,
I've written some M/R jobs that are representative of the work we want to
do. I then run the jobs multiple times in a row (sequentially) to get a
rough estimate for average run-time.

What I'm seeing is really strange... If I run the same job with the same
inputs multiple times, each successive run is slower than the previous run.
If I restart the cluster and re-run the tests, the first run is fast and
then each successive run is slower.

For example, I just started the cluster and ran the same job 4 times. The
run times for the jobs were as follows: 127 seconds, 177 seconds, 207
seconds and 218 seconds. I restarted HDFS and M/R, reran the job 3 more
times and got the following run times: 138 seconds, 187 seconds and 221
seconds. :(

The map task is pretty simple - parse XML files to extract specific
elements. I'm using Cascading and wrote a custom Scheme, which in turn uses
a custom FileInputFormat that treats each file as an entire record
(splitable = false). Each file is then treated as a separate map task with
no reduce step.

In this case I have a 8 node cluster. 1 node acts as a dedicated
NameNode/JobTracker and 7 nodes run the DataNode/TaskTracker. Each machine
is identical: Dell 1950 with Intel quad-core 2.5, 8GB RAM and 2 250GB SATA2
drives. All 8 machines are in the same rack running on a dedicated Force10
gigabit switch.

I tried enabling JVM reuse via JobConf, which improved performance for the
initial few runs... but each successive job still took longer than the
previous. I also tried increasing the maximum memory via the
mapred.child.java.opts property, but that didn't have any impact.

I checked the logs, but I don't see any errors.

Here's my basic list of configured properties:

fs.default.name=hdfs://dn01.hadoop.mycompany.com:9000
mapred.job.tracker=dn01.hadoop.mycompany.com:9001
dfs.replication=3
dfs.block.size=1048576
dfs.name.dir=/opt/hadoop/volume1/name,/opt/hadoop/volume2/name
dfs.data.dir=/opt/hadoop/volume1/data,/opt/hadoop/volume2/data
mapred.local.dir=/opt/hadoop/volume1/mapred,/opt/hadoop/volume2/mapred
mapred.child.java.opts=-Xmx1532m

Frankly I'm stumped. I'm sure there is something obvious that I'm missing,
but I'm totally at a loss right now. Any suggestions would be ~greatly~
appreciated.

Thanks!

-Sean


Re: Quickstart: only replicated to 0 nodes

2008-11-14 Thread Sean Laurent
On Thu, Nov 6, 2008 at 12:45 AM, Sean Laurent <[EMAIL PROTECTED]> wrote:
>
> So I'm new to Hadoop and I have been trying unsuccessfully to work
> through the Quickstart tutorial to get a single node working in
> pseudo-distributed mode. I can't seem to put data into HDFS using
> release 0.18.2 under Java 1.6.0_04-b12:
>
> $ bin/hadoop fs -put conf input
> 08/11/05 18:32:23 INFO dfs.DFSClient:
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
> /user/slaurent/input/commons-logging.properties could only be
> replicated to 0 nodes, instead of 1
> ...

So I finally discovered my problems... :)

First, I didn't have an entry /etc/hosts for my machine name.

Second (and far more important), the HDFS system was getting created
in /tmp and the partition on which /tmp resides was running out of
disk space. Once I moved the HDFS to a partition with enough space, my
replication problems went away.

I have to admit that it kinda seems like a bug that Hadoop never gave
me ANY indication that I was out of disk space.

-Sean


Quickstart: only replicated to 0 nodes

2008-11-05 Thread Sean Laurent
So I'm new to Hadoop and I have been trying unsuccessfully to work
through the Quickstart tutorial to get a single node working in
pseudo-distributed mode. I can't seem to put data into HDFS using
release 0.18.2 under Java 1.6.0_04-b12:

$ bin/hadoop fs -put conf input
08/11/05 18:32:23 INFO dfs.DFSClient:
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/user/slaurent/input/commons-logging.properties could only be
replicated to 0 nodes, instead of 1
...

The dfshealth jsp page reports 1 live datanode. The strange thing is
that the node listed as "dkz216" with a url of
"http://dkz216.neoplus.adsl.tpnet.pl:50075/browseDirectory.jsp?namenodeInfoPort=50070&dir=%2F";...
not sure where that came from.

No errors in the log files, other than the replication error. However,
I do see one other oddity in the datanode logfile:

---hadoop-user-datanode-server.log---
2008-11-05 18:32:28,317 INFO org.apache.hadoop.dfs.DataNode:
dnRegistration =
DatanodeRegistration(dkz216.neoplus.adsl.tpnet.pl:50010, storageID=,
infoPort=50075, ipcPort=50020)
2008-11-05 18:32:28,317 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 50020: starting
2008-11-05 18:32:28,443 INFO org.apache.hadoop.dfs.DataNode: New
storage id DS-2140500399-83.24.29.216-50010-1225931548407 is assigned
to data-node 127.0.0.1:50010
2008-11-05 18:32:28,444 INFO org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(127.0.0.1:50010,
storageID=DS-2140500399-83.24.29.216-50010-1225931548407,
infoPort=50075, ipcPort=50020)In DataNode.run, data =
FSDataset{dirpath='/tmp/hadoop-slaurent/dfs/data/current'}
---hadoop-user-datanode-server.log---


Here are my config files:

---hadoop-site.xml---






  
fs.default.name
hdfs://127.0.0.1:9000/
  
  
mapred.job.tracker
127.0.0.1:9001
  
  
dfs.replication
1
  

---hadoop-site.xml---

---masters
127.0.0.1
---masters

---slaves---
127.0.0.1
---slaves---

I originally started with localhost everywhere but then switched to
127.0.0.1 to see if that helped. No luck. I can't seem to copy any
files to HDFS.

Any suggestions would be greatly appreciated!

-Sean