Re: Jobs run slower and slower
Hrmmm. I can tell init/execution at the job level, but I don't know how to figure that out at the individual map task level. What would be the best way for me to determine that? -Sean On Wed, Mar 4, 2009 at 12:13 PM, Runping Qi wrote: > Do you know the break down of times for a mapper task takes to initialize > and to execute the map function? > > > On Wed, Mar 4, 2009 at 8:44 AM, Sean Laurent >wrote: > > > On Tue, Mar 3, 2009 at 10:14 PM, Amar Kamat > wrote: > > > > > Yeah. May be its not the problem with the JobTracker. Can you check > (via > > > job history) what is the best and the worst task runtimes? You can > > analyze > > > the jobs after they complete. > > > > Okay, I ran the same job 35 times last night. Each job was exactly > > identical > > - it parsed 1000 identical files that were already stored in HDFS via a > map > > task (no reduce). Like all of my previous tests, each successive run took > > longer than the previous run. > > > > Looking at the job history, the first run was the fastest; it took a > total > > of 2mins 28sec (setup: 2 secs, map: 2min 22sec, cleanup: 0sec). The last > > run > > was the slowest; it took a total of 22mins 31sec (setup: 16sec, map: > 22mins > > 14sec, cleanup: 16sec). > > > > Memory usage on the JT/NN machine, as reported by sar, slowly increased > > over > > the 7 hour window. Memory usage on a randomly selected DN/TT also > steadily > > increased over the 7 hour window but far more rapidly. We also looked at > > I/O > > usage and CPU utilization on both the JT/NN machine and the same randomly > > selected DN/TT - nothing out of the ordinary. I/O waits (both from the > I/O > > subsystem level perspective and from the CPU's perspective) were > > consistently low over the 7 hour window and did not fluctuate > significantly > > on any of the machines. CPU utilization on the JT/NN was practically > > non-existent and hovered between 40%-60% on the DN/TT. >
Re: Jobs run slower and slower
On Tue, Mar 3, 2009 at 10:14 PM, Amar Kamat wrote: > Yeah. May be its not the problem with the JobTracker. Can you check (via > job history) what is the best and the worst task runtimes? You can analyze > the jobs after they complete. > Amar Okay, I ran the same job 35 times last night. Each job was exactly identical - it parsed 1000 identical files that were already stored in HDFS via a map task (no reduce). Like all of my previous tests, each successive run took longer than the previous run. Looking at the job history, the first run was the fastest; it took a total of 2mins 28sec (setup: 2 secs, map: 2min 22sec, cleanup: 0sec). The last run was the slowest; it took a total of 22mins 31sec (setup: 16sec, map: 22mins 14sec, cleanup: 16sec). Memory usage on the JT/NN machine, as reported by sar, slowly increased over the 7 hour window. Memory usage on a randomly selected DN/TT also steadily increased over the 7 hour window but far more rapidly. We also looked at I/O usage and CPU utilization on both the JT/NN machine and the same randomly selected DN/TT - nothing out of the ordinary. I/O waits (both from the I/O subsystem level perspective and from the CPU's perspective) were consistently low over the 7 hour window and did not fluctuate significantly on any of the machines. CPU utilization on the JT/NN was practically non-existent and hovered between 40%-60% on the DN/TT. -Sean
Re: Jobs run slower and slower
It's quite possible that's the problem. I'll re-run the tests over night and collect the run times according to the JobTracker. If I want to test the patch in HADOOP-4780, should I pull down branch-0.19 and go from there? This is not a production environment, so I'm not worried about data loss or other problems. But I could certainly wait if 0.19.2 or 0.20.0 are expected to be released soon, if that would make more sense. -Sean On Tue, Mar 3, 2009 at 10:14 PM, Amar Kamat wrote: > Runping Qi wrote: > >> Could it be the case that the latter jobs ran slower because the tasks >> took >> longer time to get initialized? >> If so, you may hit >> https://issues.apache.org/jira/browse/HADOOP-4780 >> >> Runping >> >> On Tue, Mar 3, 2009 at 2:02 PM, Sean Laurent > >wrote: >> >> >> >>> Hrmmm. According to hadoop-defaults.xml, >>> mapred.jobtracker.completeuserjobs.maximum defaults to 100. So I tried >>> setting it to 1, but that had no effect. I still see each successive run >>> taking longer than the previous run. >>> >>> 1) Restart M/R >>> 2) Run #1: 142.12 (secs) >>> 3) Run #2 181.96 (secs) >>> 4) Run #3 221.95 (secs) >>> 5) Run #4 281.96 (secs) >>> >>> >> Yeah. May be its not the problem with the JobTracker. Can you check (via > job history) what is the best and the worst task runtimes? You can analyze > the jobs after they complete. > Amar
Re: Jobs run slower and slower
Hrmmm. According to hadoop-defaults.xml, mapred.jobtracker.completeuserjobs.maximum defaults to 100. So I tried setting it to 1, but that had no effect. I still see each successive run taking longer than the previous run. 1) Restart M/R 2) Run #1: 142.12 (secs) 3) Run #2 181.96 (secs) 4) Run #3 221.95 (secs) 5) Run #4 281.96 (secs) I don't think that's the problem here... :( -S On Tue, Mar 3, 2009 at 2:33 PM, Runping Qi wrote: > The jobtracker's memory increased as you ran more and more jobs because the > job tracker still kept some data about those completed jobs. The maximum > number of completed jobs kept is determined by the config variable > mapred.jobtracker.completeuserjobs.maximum. > You can lower that to lower the job tracker memory consumption. > > > On Tue, Mar 3, 2009 at 10:01 AM, Sean Laurent >wrote: > > > Interesting... from reading HADOOP-4766, I'm not entirely clear if that > > problem is related to the number of jobs or the number of tasks. > > > > - I'm only running a single job with approximately 900 map tasks as > opposed > > to the 500-600+ jobs and 100K tasks described in HADOOP-4766. > > - I am seeing increased memory use on the JobTracker. > > - I am seeing elevated memory use over time on the DataNode/TaskTracker > > machines. > > - Amar's description in HADOOP-4766 from December 6th sounds pretty > > similar. > > > > I also tried adjusting garbage collection via -XX:+UseParallelGC, but > that > > had no noticeable impact. > > > > It also wasn't clear to me what, if anything, I can do to fix or work > > around > > the problem. > > > > Any advice would be greatly appreciated. > > > > -Sean > > - Show quoted text - > > > > On Mon, Mar 2, 2009 at 7:50 PM, Runping Qi wrote: > > > > > Your problem may be related to > > > https://issues.apache.org/jira/browse/HADOOP-4766 > > > > > > Runping > > > > > > > > > On Mon, Mar 2, 2009 at 4:46 PM, Sean Laurent > > >wrote: > > > > > > > Hi all, > > > > I'm conducting some initial tests with Hadoop to better understand > how > > > well > > > > it will handle and scale with some of our specific problems. As a > > result, > > > > I've written some M/R jobs that are representative of the work we > want > > to > > > > do. I then run the jobs multiple times in a row (sequentially) to get > a > > > > rough estimate for average run-time. > > > > > > > > What I'm seeing is really strange... If I run the same job with the > > same > > > > inputs multiple times, each successive run is slower than the > previous > > > run. > > > > If I restart the cluster and re-run the tests, the first run is fast > > and > > > > then each successive run is slower. > > > > > > > > For example, I just started the cluster and ran the same job 4 times. > > The > > > > run times for the jobs were as follows: 127 seconds, 177 seconds, 207 > > > > seconds and 218 seconds. I restarted HDFS and M/R, reran the job 3 > more > > > > times and got the following run times: 138 seconds, 187 seconds and > 221 > > > > seconds. :( > > > > > > > > The map task is pretty simple - parse XML files to extract specific > > > > elements. I'm using Cascading and wrote a custom Scheme, which in > turn > > > uses > > > > a custom FileInputFormat that treats each file as an entire record > > > > (splitable = false). Each file is then treated as a separate map task > > > with > > > > no reduce step. > > > > > > > > In this case I have a 8 node cluster. 1 node acts as a dedicated > > > > NameNode/JobTracker and 7 nodes run the DataNode/TaskTracker. Each > > > machine > > > > is identical: Dell 1950 with Intel quad-core 2.5, 8GB RAM and 2 250GB > > > SATA2 > > > > drives. All 8 machines are in the same rack running on a dedicated > > > Force10 > > > > gigabit switch. > > > > > > > > I tried enabling JVM reuse via JobConf, which improved performance > for > > > the > > > > initial few runs... but each successive job still took longer than > the > > > > previous. I also tried increasing the maximum memory via the > > > > mapred.child.java.opts property, but that didn't have any impact. > > > > > > > > I checked the logs, but I don't see any errors. > > > > > > > > Here's my basic list of configured properties: > > > > > > > > fs.default.name=hdfs://dn01.hadoop.mycompany.com:9000 > > > > mapred.job.tracker=dn01.hadoop.mycompany.com:9001 > > > > dfs.replication=3 > > > > dfs.block.size=1048576 > > > > dfs.name.dir=/opt/hadoop/volume1/name,/opt/hadoop/volume2/name > > > > dfs.data.dir=/opt/hadoop/volume1/data,/opt/hadoop/volume2/data > > > > > mapred.local.dir=/opt/hadoop/volume1/mapred,/opt/hadoop/volume2/mapred > > > > mapred.child.java.opts=-Xmx1532m > > > > > > > > Frankly I'm stumped. I'm sure there is something obvious that I'm > > > missing, > > > > but I'm totally at a loss right now. Any suggestions would be > ~greatly~ > > > > appreciated. >
Re: Jobs run slower and slower
Interesting... from reading HADOOP-4766, I'm not entirely clear if that problem is related to the number of jobs or the number of tasks. - I'm only running a single job with approximately 900 map tasks as opposed to the 500-600+ jobs and 100K tasks described in HADOOP-4766. - I am seeing increased memory use on the JobTracker. - I am seeing elevated memory use over time on the DataNode/TaskTracker machines. - Amar's description in HADOOP-4766 from December 6th sounds pretty similar. I also tried adjusting garbage collection via -XX:+UseParallelGC, but that had no noticeable impact. It also wasn't clear to me what, if anything, I can do to fix or work around the problem. Any advice would be greatly appreciated. -Sean On Mon, Mar 2, 2009 at 7:50 PM, Runping Qi wrote: > Your problem may be related to > https://issues.apache.org/jira/browse/HADOOP-4766 > > Runping > > > On Mon, Mar 2, 2009 at 4:46 PM, Sean Laurent >wrote: > > > Hi all, > > I'm conducting some initial tests with Hadoop to better understand how > well > > it will handle and scale with some of our specific problems. As a result, > > I've written some M/R jobs that are representative of the work we want to > > do. I then run the jobs multiple times in a row (sequentially) to get a > > rough estimate for average run-time. > > > > What I'm seeing is really strange... If I run the same job with the same > > inputs multiple times, each successive run is slower than the previous > run. > > If I restart the cluster and re-run the tests, the first run is fast and > > then each successive run is slower. > > > > For example, I just started the cluster and ran the same job 4 times. The > > run times for the jobs were as follows: 127 seconds, 177 seconds, 207 > > seconds and 218 seconds. I restarted HDFS and M/R, reran the job 3 more > > times and got the following run times: 138 seconds, 187 seconds and 221 > > seconds. :( > > > > The map task is pretty simple - parse XML files to extract specific > > elements. I'm using Cascading and wrote a custom Scheme, which in turn > uses > > a custom FileInputFormat that treats each file as an entire record > > (splitable = false). Each file is then treated as a separate map task > with > > no reduce step. > > > > In this case I have a 8 node cluster. 1 node acts as a dedicated > > NameNode/JobTracker and 7 nodes run the DataNode/TaskTracker. Each > machine > > is identical: Dell 1950 with Intel quad-core 2.5, 8GB RAM and 2 250GB > SATA2 > > drives. All 8 machines are in the same rack running on a dedicated > Force10 > > gigabit switch. > > > > I tried enabling JVM reuse via JobConf, which improved performance for > the > > initial few runs... but each successive job still took longer than the > > previous. I also tried increasing the maximum memory via the > > mapred.child.java.opts property, but that didn't have any impact. > > > > I checked the logs, but I don't see any errors. > > > > Here's my basic list of configured properties: > > > > fs.default.name=hdfs://dn01.hadoop.mycompany.com:9000 > > mapred.job.tracker=dn01.hadoop.mycompany.com:9001 > > dfs.replication=3 > > dfs.block.size=1048576 > > dfs.name.dir=/opt/hadoop/volume1/name,/opt/hadoop/volume2/name > > dfs.data.dir=/opt/hadoop/volume1/data,/opt/hadoop/volume2/data > > mapred.local.dir=/opt/hadoop/volume1/mapred,/opt/hadoop/volume2/mapred > > mapred.child.java.opts=-Xmx1532m > > > > Frankly I'm stumped. I'm sure there is something obvious that I'm > missing, > > but I'm totally at a loss right now. Any suggestions would be ~greatly~ > > appreciated. > > > > Thanks! > > > > -Sean >
Jobs run slower and slower
Hi all, I'm conducting some initial tests with Hadoop to better understand how well it will handle and scale with some of our specific problems. As a result, I've written some M/R jobs that are representative of the work we want to do. I then run the jobs multiple times in a row (sequentially) to get a rough estimate for average run-time. What I'm seeing is really strange... If I run the same job with the same inputs multiple times, each successive run is slower than the previous run. If I restart the cluster and re-run the tests, the first run is fast and then each successive run is slower. For example, I just started the cluster and ran the same job 4 times. The run times for the jobs were as follows: 127 seconds, 177 seconds, 207 seconds and 218 seconds. I restarted HDFS and M/R, reran the job 3 more times and got the following run times: 138 seconds, 187 seconds and 221 seconds. :( The map task is pretty simple - parse XML files to extract specific elements. I'm using Cascading and wrote a custom Scheme, which in turn uses a custom FileInputFormat that treats each file as an entire record (splitable = false). Each file is then treated as a separate map task with no reduce step. In this case I have a 8 node cluster. 1 node acts as a dedicated NameNode/JobTracker and 7 nodes run the DataNode/TaskTracker. Each machine is identical: Dell 1950 with Intel quad-core 2.5, 8GB RAM and 2 250GB SATA2 drives. All 8 machines are in the same rack running on a dedicated Force10 gigabit switch. I tried enabling JVM reuse via JobConf, which improved performance for the initial few runs... but each successive job still took longer than the previous. I also tried increasing the maximum memory via the mapred.child.java.opts property, but that didn't have any impact. I checked the logs, but I don't see any errors. Here's my basic list of configured properties: fs.default.name=hdfs://dn01.hadoop.mycompany.com:9000 mapred.job.tracker=dn01.hadoop.mycompany.com:9001 dfs.replication=3 dfs.block.size=1048576 dfs.name.dir=/opt/hadoop/volume1/name,/opt/hadoop/volume2/name dfs.data.dir=/opt/hadoop/volume1/data,/opt/hadoop/volume2/data mapred.local.dir=/opt/hadoop/volume1/mapred,/opt/hadoop/volume2/mapred mapred.child.java.opts=-Xmx1532m Frankly I'm stumped. I'm sure there is something obvious that I'm missing, but I'm totally at a loss right now. Any suggestions would be ~greatly~ appreciated. Thanks! -Sean
Re: Quickstart: only replicated to 0 nodes
On Thu, Nov 6, 2008 at 12:45 AM, Sean Laurent <[EMAIL PROTECTED]> wrote: > > So I'm new to Hadoop and I have been trying unsuccessfully to work > through the Quickstart tutorial to get a single node working in > pseudo-distributed mode. I can't seem to put data into HDFS using > release 0.18.2 under Java 1.6.0_04-b12: > > $ bin/hadoop fs -put conf input > 08/11/05 18:32:23 INFO dfs.DFSClient: > org.apache.hadoop.ipc.RemoteException: java.io.IOException: File > /user/slaurent/input/commons-logging.properties could only be > replicated to 0 nodes, instead of 1 > ... So I finally discovered my problems... :) First, I didn't have an entry /etc/hosts for my machine name. Second (and far more important), the HDFS system was getting created in /tmp and the partition on which /tmp resides was running out of disk space. Once I moved the HDFS to a partition with enough space, my replication problems went away. I have to admit that it kinda seems like a bug that Hadoop never gave me ANY indication that I was out of disk space. -Sean
Quickstart: only replicated to 0 nodes
So I'm new to Hadoop and I have been trying unsuccessfully to work through the Quickstart tutorial to get a single node working in pseudo-distributed mode. I can't seem to put data into HDFS using release 0.18.2 under Java 1.6.0_04-b12: $ bin/hadoop fs -put conf input 08/11/05 18:32:23 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/slaurent/input/commons-logging.properties could only be replicated to 0 nodes, instead of 1 ... The dfshealth jsp page reports 1 live datanode. The strange thing is that the node listed as "dkz216" with a url of "http://dkz216.neoplus.adsl.tpnet.pl:50075/browseDirectory.jsp?namenodeInfoPort=50070&dir=%2F";... not sure where that came from. No errors in the log files, other than the replication error. However, I do see one other oddity in the datanode logfile: ---hadoop-user-datanode-server.log--- 2008-11-05 18:32:28,317 INFO org.apache.hadoop.dfs.DataNode: dnRegistration = DatanodeRegistration(dkz216.neoplus.adsl.tpnet.pl:50010, storageID=, infoPort=50075, ipcPort=50020) 2008-11-05 18:32:28,317 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 50020: starting 2008-11-05 18:32:28,443 INFO org.apache.hadoop.dfs.DataNode: New storage id DS-2140500399-83.24.29.216-50010-1225931548407 is assigned to data-node 127.0.0.1:50010 2008-11-05 18:32:28,444 INFO org.apache.hadoop.dfs.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-2140500399-83.24.29.216-50010-1225931548407, infoPort=50075, ipcPort=50020)In DataNode.run, data = FSDataset{dirpath='/tmp/hadoop-slaurent/dfs/data/current'} ---hadoop-user-datanode-server.log--- Here are my config files: ---hadoop-site.xml--- fs.default.name hdfs://127.0.0.1:9000/ mapred.job.tracker 127.0.0.1:9001 dfs.replication 1 ---hadoop-site.xml--- ---masters 127.0.0.1 ---masters ---slaves--- 127.0.0.1 ---slaves--- I originally started with localhost everywhere but then switched to 127.0.0.1 to see if that helped. No luck. I can't seem to copy any files to HDFS. Any suggestions would be greatly appreciated! -Sean