Re: desperate question about NameNode startup sequence
The namenode eventually came up. Here's the resumation of the logging: 2011-12-17 01:37:35,648 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 16978046 2011-12-17 01:43:24,023 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files under construction = 1 2011-12-17 01:43:24,025 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 2589456651 loaded in 348 seconds. 2011-12-17 01:43:24,030 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /hadoop/hadoop-metadata/cache/dfs/name/current/edits of size 3885 edits # 43 loaded in 0 seconds. 2011-12-17 03:06:26,731 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Invalid opcode, reached end of edit log Number of transactions found 306757368 2011-12-17 03:06:26,732 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /hadoop/hadoop-metadata/cache/dfs/name/current/edits.new of size 41011966085 edits # 306757368 loaded in 4982 seconds. 2011-12-17 03:06:47,264 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 1724849462 saved in 19 seconds. 2011-12-17 03:07:09,051 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage in 5373458 msecs It took a long time to load edits.new, consistent with the speed at which it loaded fsimage. So at 3:12 or so, the namenode came back up. At that time, the fsimage got updated on our secondary namenode: [vmc@prod1-secondary ~]$ ls -l /hadoop/hadoop-metadata/cache/dfs/namesecondary/current/ total 1686124 -rw-r--r-- 1 hadoop hadoop 38117 Dec 17 03:12 edits -rw-r--r-- 1 hadoop hadoop 1724849462 Dec 17 03:12 fsimage But that's it. No updates since then. Is that normal 2NN behavior? I don't think we've tuned away from the defaults for fsimage and edits maintenance. How should I diagnose? Similarly, primary namenode seems to continue to log changes to edits.new: -bash-3.2$ ls -l /hadoop/hadoop-metadata/cache/dfs/name/current/ total 1728608 -rw-r--r-- 1 hadoop hadoop 38117 Dec 17 03:12 edits -rw-r--r-- 1 hadoop hadoop 44064633 Dec 17 11:41 edits.new -rw-r--r-- 1 hadoop hadoop 1724849462 Dec 17 03:06 fsimage -rw-r--r-- 1 hadoop hadoop 8 Dec 17 03:07 fstime -rw-r--r-- 1 hadoop hadoop101 Dec 17 03:07 VERSION Is this normal? Have I been misunderstanding normal NN operation? On Sat, Dec 17, 2011 at 3:01 AM, Meng Mao meng...@gmail.com wrote: Maybe this is a bad sign -- the edits.new was created before the master node crashed, and is huge: -bash-3.2$ ls -lh /hadoop/hadoop-metadata/cache/dfs/name/current total 41G -rw-r--r-- 1 hadoop hadoop 3.8K Jan 27 2011 edits -rw-r--r-- 1 hadoop hadoop 39G Dec 17 00:44 edits.new -rw-r--r-- 1 hadoop hadoop 2.5G Jan 27 2011 fsimage -rw-r--r-- 1 hadoop hadoop8 Jan 27 2011 fstime -rw-r--r-- 1 hadoop hadoop 101 Jan 27 2011 VERSION could this mean something was up with our SecondaryNameNode and rolling the edits file? On Sat, Dec 17, 2011 at 2:53 AM, Meng Mao meng...@gmail.com wrote: All of the worker nodes datanodes' logs haven't logged anything after the initial startup announcement: STARTUP_MSG: host = prod1-worker075/10.2.19.75 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.1+169.56 STARTUP_MSG: build = -r 8e662cb065be1c4bc61c55e6bff161e09c1d36f3; compiled by 'root' on Tue Feb 9 13:40:08 EST 2010 / On Sat, Dec 17, 2011 at 2:00 AM, Meng Mao meng...@gmail.com wrote: Our CDH2 production grid just crashed with some sort of master node failure. When I went in there, JobTracker was missing and NameNode was up. Trying to ls on HDFS met with no connection. We decided to go for a restart. This is in the namenode log right now: 2011-12-17 01:37:35,568 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2011-12-17 01:37:35,612 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop 2011-12-17 01:37:35,613 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2011-12-17 01:37:35,613 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true 2011-12-17 01:37:35,620 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext 2011-12-17 01:37:35,621 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean 2011-12-17 01:37:35,648 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 16978046 2011-12-17 01:43:24,023 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files under construction = 1 2011-12-17 01:43:24,025 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 2589456651 loaded in 348 seconds. 2011-12-17 01:43:24,030 INFO
desperate question about NameNode startup sequence
Our CDH2 production grid just crashed with some sort of master node failure. When I went in there, JobTracker was missing and NameNode was up. Trying to ls on HDFS met with no connection. We decided to go for a restart. This is in the namenode log right now: 2011-12-17 01:37:35,568 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2011-12-17 01:37:35,612 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop 2011-12-17 01:37:35,613 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2011-12-17 01:37:35,613 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true 2011-12-17 01:37:35,620 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext 2011-12-17 01:37:35,621 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean 2011-12-17 01:37:35,648 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 16978046 2011-12-17 01:43:24,023 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files under construction = 1 2011-12-17 01:43:24,025 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 2589456651 loaded in 348 seconds. 2011-12-17 01:43:24,030 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /hadoop/hadoop-metadata/cache/dfs/name/current/edits of size 3885 edits # 43 loaded in 0 seconds. What's coming up in the startup sequence? We have a ton of data on there. Is there any way to estimate startup time?
Re: desperate question about NameNode startup sequence
All of the worker nodes datanodes' logs haven't logged anything after the initial startup announcement: STARTUP_MSG: host = prod1-worker075/10.2.19.75 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.1+169.56 STARTUP_MSG: build = -r 8e662cb065be1c4bc61c55e6bff161e09c1d36f3; compiled by 'root' on Tue Feb 9 13:40:08 EST 2010 / On Sat, Dec 17, 2011 at 2:00 AM, Meng Mao meng...@gmail.com wrote: Our CDH2 production grid just crashed with some sort of master node failure. When I went in there, JobTracker was missing and NameNode was up. Trying to ls on HDFS met with no connection. We decided to go for a restart. This is in the namenode log right now: 2011-12-17 01:37:35,568 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2011-12-17 01:37:35,612 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop 2011-12-17 01:37:35,613 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2011-12-17 01:37:35,613 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true 2011-12-17 01:37:35,620 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext 2011-12-17 01:37:35,621 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean 2011-12-17 01:37:35,648 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 16978046 2011-12-17 01:43:24,023 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files under construction = 1 2011-12-17 01:43:24,025 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 2589456651 loaded in 348 seconds. 2011-12-17 01:43:24,030 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /hadoop/hadoop-metadata/cache/dfs/name/current/edits of size 3885 edits # 43 loaded in 0 seconds. What's coming up in the startup sequence? We have a ton of data on there. Is there any way to estimate startup time?
Re: Do failed task attempts stick around the jobcache on local disk?
Am I being completely silly asking about this? Does anyone know? On Wed, Nov 2, 2011 at 6:27 PM, Meng Mao meng...@gmail.com wrote: Is there any mechanism in place to remove failed task attempt directories from the TaskTracker's jobcache? It seems like for us, the only way to get rid of them is manually.
Re: apache-solr-3.3.0, corrupt indexes, and speculative execution
Today we again had some leftover attempt dirs -- 2 out of 5 reduce tasks with a redundant speculative attempt left around their extra attempt: part-5: conf data solr_attempt_201101270134_42328_r_05_1.1 part-6: conf data solr_attempt_201101270134_42328_r_06_1.1 Is it possible that the disk being really full could cause transient filesystem glitches that aren't being thrown as exceptions? On Sat, Oct 29, 2011 at 2:23 PM, Meng Mao meng...@gmail.com wrote: We've been getting up to speed on SOLR, and one of the recent problems we've run into is with successful jobs delivering corrupt index shards. This is a look at a 12-sharded index we built and then copied to local disk off of HDFS. $ ls -l part-1 total 16 drwxr-xr-x 2 vmc visible 4096 Oct 29 09:39 conf drwxr-xr-x 4 vmc visible 4096 Oct 29 09:42 data $ ls -l part-4/ total 16 drwxr-xr-x 4 vmc visible 4096 Oct 29 09:54 data drwxr-xr-x 2 vmc visible 4096 Oct 29 09:54 solr_attempt_201101270134_42143_r_04_1.1 Right away, you can see that there's apparently some lack of cleanup or incompleteness. Shard 04 is missing a conf directory, and has an empty attempt directory lying around. This is what a complete shard listing looks like: $ ls -l part-1/*/* -rw-r--r-- 1 vmc visible 33402 Oct 29 09:39 part-1/conf/schema.xml part-1/data/index: total 13088776 -rw-r--r-- 1 vmc visible 6036701453 Oct 29 09:40 _1m3.fdt *-rw-r--r-- 1 vmc visible 246345692 Oct 29 09:40 _1m3.fdx #missing from shard 04* -rw-r--r-- 1 vmc visible211 Oct 29 09:40 _1m3.fnm -rw-r--r-- 1 vmc visible 3516034769 Oct 29 09:41 _1m3.frq -rw-r--r-- 1 vmc visible 92379637 Oct 29 09:41 _1m3.nrm -rw-r--r-- 1 vmc visible 695935796 Oct 29 09:41 _1m3.prx -rw-r--r-- 1 vmc visible 28548963 Oct 29 09:41 _1m3.tii -rw-r--r-- 1 vmc visible 2773769958 Oct 29 09:42 _1m3.tis -rw-r--r-- 1 vmc visible284 Oct 29 09:42 segments_2 -rw-r--r-- 1 vmc visible 20 Oct 29 09:42 segments.gen part-1/data/spellchecker: total 16 -rw-r--r-- 1 vmc visible 32 Oct 29 09:42 segments_1 -rw-r--r-- 1 vmc visible 20 Oct 29 09:42 segments.gen And shard 04: $ ls -l part-4/*/* #missing conf/ part-4/data/index: total 12818420 -rw-r--r-- 1 vmc visible 6036000614 Oct 29 09:52 _1m1.fdt -rw-r--r-- 1 vmc visible211 Oct 29 09:52 _1m1.fnm -rw-r--r-- 1 vmc visible 3515333900 Oct 29 09:53 _1m1.frq -rw-r--r-- 1 vmc visible 92361544 Oct 29 09:53 _1m1.nrm -rw-r--r-- 1 vmc visible 696258210 Oct 29 09:54 _1m1.prx -rw-r--r-- 1 vmc visible 28552866 Oct 29 09:54 _1m1.tii -rw-r--r-- 1 vmc visible 2744647680 Oct 29 09:54 _1m1.tis -rw-r--r-- 1 vmc visible283 Oct 29 09:54 segments_2 -rw-r--r-- 1 vmc visible 20 Oct 29 09:54 segments.gen part-4/data/spellchecker: total 16 -rw-r--r-- 1 vmc visible 32 Oct 29 09:54 segments_1 -rw-r--r-- 1 vmc visible 20 Oct 29 09:54 segments.gen What might cause that attempt path to be lying around at the time of completion? Has anyone seen anything like this? My gut says if we were able to disable speculative execution, we would probably see this go away. But that might be overreacting. In this job, of the 12 reduce tasks, 5 had an extra speculative attempt. Of those 5, 2 were cases where the speculative attempt won out over the first attempt. And one of them had this output inconsistency. Here is an excerpt from the task log for shard 04: 2011-10-29 07:08:33,152 INFO org.apache.solr.hadoop.SolrRecordWriter: SolrHome: /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip 2011-10-29 07:08:33,889 INFO org.apache.solr.hadoop.SolrRecordWriter: Constructed instance information solr.home /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip (/hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip), instance dir /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip/, conf dir /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip/conf/, writing index to temporary directory /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/jobcache/job_201101270134_42143/work/solr_attempt_201101270134_42143_r_04_1.1/data, with permdir /PROD/output/solr/solr-20111029063514-12/part-4 2011-10-29 07:08:35,868 INFO org.apache.solr.schema.IndexSchema: Reading Solr Schema ... much later ... 2011-10-29 07:08:37,263 WARN org.apache.solr.core.SolrCore: [core1] Solr index directory '/hadoop/hadoop-metadata/cache/mapred/local/taskTracker/jobcache/job_201101270134_42143/work/solr_attempt_201101270134_42143_r_04_1.1/data/index' doesn't exist. Creating new index... The log doesn't look much
apache-solr-3.3.0, corrupt indexes, and speculative execution
We've been getting up to speed on SOLR, and one of the recent problems we've run into is with successful jobs delivering corrupt index shards. This is a look at a 12-sharded index we built and then copied to local disk off of HDFS. $ ls -l part-1 total 16 drwxr-xr-x 2 vmc visible 4096 Oct 29 09:39 conf drwxr-xr-x 4 vmc visible 4096 Oct 29 09:42 data $ ls -l part-4/ total 16 drwxr-xr-x 4 vmc visible 4096 Oct 29 09:54 data drwxr-xr-x 2 vmc visible 4096 Oct 29 09:54 solr_attempt_201101270134_42143_r_04_1.1 Right away, you can see that there's apparently some lack of cleanup or incompleteness. Shard 04 is missing a conf directory, and has an empty attempt directory lying around. This is what a complete shard listing looks like: $ ls -l part-1/*/* -rw-r--r-- 1 vmc visible 33402 Oct 29 09:39 part-1/conf/schema.xml part-1/data/index: total 13088776 -rw-r--r-- 1 vmc visible 6036701453 Oct 29 09:40 _1m3.fdt *-rw-r--r-- 1 vmc visible 246345692 Oct 29 09:40 _1m3.fdx #missing from shard 04* -rw-r--r-- 1 vmc visible211 Oct 29 09:40 _1m3.fnm -rw-r--r-- 1 vmc visible 3516034769 Oct 29 09:41 _1m3.frq -rw-r--r-- 1 vmc visible 92379637 Oct 29 09:41 _1m3.nrm -rw-r--r-- 1 vmc visible 695935796 Oct 29 09:41 _1m3.prx -rw-r--r-- 1 vmc visible 28548963 Oct 29 09:41 _1m3.tii -rw-r--r-- 1 vmc visible 2773769958 Oct 29 09:42 _1m3.tis -rw-r--r-- 1 vmc visible284 Oct 29 09:42 segments_2 -rw-r--r-- 1 vmc visible 20 Oct 29 09:42 segments.gen part-1/data/spellchecker: total 16 -rw-r--r-- 1 vmc visible 32 Oct 29 09:42 segments_1 -rw-r--r-- 1 vmc visible 20 Oct 29 09:42 segments.gen And shard 04: $ ls -l part-4/*/* #missing conf/ part-4/data/index: total 12818420 -rw-r--r-- 1 vmc visible 6036000614 Oct 29 09:52 _1m1.fdt -rw-r--r-- 1 vmc visible211 Oct 29 09:52 _1m1.fnm -rw-r--r-- 1 vmc visible 3515333900 Oct 29 09:53 _1m1.frq -rw-r--r-- 1 vmc visible 92361544 Oct 29 09:53 _1m1.nrm -rw-r--r-- 1 vmc visible 696258210 Oct 29 09:54 _1m1.prx -rw-r--r-- 1 vmc visible 28552866 Oct 29 09:54 _1m1.tii -rw-r--r-- 1 vmc visible 2744647680 Oct 29 09:54 _1m1.tis -rw-r--r-- 1 vmc visible283 Oct 29 09:54 segments_2 -rw-r--r-- 1 vmc visible 20 Oct 29 09:54 segments.gen part-4/data/spellchecker: total 16 -rw-r--r-- 1 vmc visible 32 Oct 29 09:54 segments_1 -rw-r--r-- 1 vmc visible 20 Oct 29 09:54 segments.gen What might cause that attempt path to be lying around at the time of completion? Has anyone seen anything like this? My gut says if we were able to disable speculative execution, we would probably see this go away. But that might be overreacting. In this job, of the 12 reduce tasks, 5 had an extra speculative attempt. Of those 5, 2 were cases where the speculative attempt won out over the first attempt. And one of them had this output inconsistency. Here is an excerpt from the task log for shard 04: 2011-10-29 07:08:33,152 INFO org.apache.solr.hadoop.SolrRecordWriter: SolrHome: /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip 2011-10-29 07:08:33,889 INFO org.apache.solr.hadoop.SolrRecordWriter: Constructed instance information solr.home /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip (/hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip), instance dir /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip/, conf dir /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip/conf/, writing index to temporary directory /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/jobcache/job_201101270134_42143/work/solr_attempt_201101270134_42143_r_04_1.1/data, with permdir /PROD/output/solr/solr-20111029063514-12/part-4 2011-10-29 07:08:35,868 INFO org.apache.solr.schema.IndexSchema: Reading Solr Schema ... much later ... 2011-10-29 07:08:37,263 WARN org.apache.solr.core.SolrCore: [core1] Solr index directory '/hadoop/hadoop-metadata/cache/mapred/local/taskTracker/jobcache/job_201101270134_42143/work/solr_attempt_201101270134_42143_r_04_1.1/data/index' doesn't exist. Creating new index... The log doesn't look much different from the successful shard 3 (where the later attempt beat the earlier attempt). I have the full logs if anyone has diagnostic advice.
Re: ways to expand hadoop.tmp.dir capacity?
So the only way we can expand to multiple mapred.local.dir paths is to config our site.xml and to restart the DataNode? On Mon, Oct 10, 2011 at 9:36 AM, Marcos Luis Ortiz Valmaseda marcosluis2...@googlemail.com wrote: 2011/10/9 Harsh J ha...@cloudera.com Hello Meng, On Wed, Oct 5, 2011 at 11:02 AM, Meng Mao meng...@gmail.com wrote: Currently, we've got defined: property namehadoop.tmp.dir/name value/hadoop/hadoop-metadata/cache//value /property In our experiments with SOLR, the intermediate files are so large that they tend to blow out disk space and fail (and annoyingly leave behind their huge failed attempts). We've had issues with it in the past, but we're having real problems with SOLR if we can't comfortably get more space out of hadoop.tmp.dir somehow. 1) It seems we never set *mapred.system.dir* to anything special, so it's defaulting to ${hadoop.tmp.dir}/mapred/system. Is this a problem? The docs seem to recommend against it when hadoop.tmp.dir had ${user.name} in it, which ours doesn't. The {mapred.system.dir} is a HDFS location, and you shouldn't really be worried about it as much. 1b) The doc says mapred.system.dir is the in-HDFS path to shared MapReduce system files. To me, that means there's must be 1 single path for mapred.system.dir, which sort of forces hadoop.tmp.dir to be 1 path. Otherwise, one might imagine that you could specify multiple paths to store hadoop.tmp.dir, like you can for dfs.data.dir. Is this a correct interpretation? -- hadoop.tmp.dir could live on multiple paths/disks if there were more mapping/lookup between mapred.system.dir and hadoop.tmp.dir? {hadoop.tmp.dir} is indeed reused for {mapred.system.dir}, although it is on HDFS, and hence is confusing, but there should just be one mapred.system.dir, yes. Also, the config {hadoop.tmp.dir} doesn't support 1 path. What you need here is a proper {mapred.local.dir} configuration. 2) IIRC, there's a -D switch for supplying config name/value pairs into indivdiual jobs. Does such a switch exist? Googling for single letters is fruitless. If we had a path on our workers with more space (in our case, another hard disk), could we simply pass that path in as hadoop.tmp.dir for our SOLR jobs? Without incurring any consistency issues on future jobs that might use the SOLR output on HDFS? Only a few parameters of a job are user-configurable. Stuff like hadoop.tmp.dir and mapred.local.dir are not override-able by user set parameters as they are server side configurations (static). Given that the default value is ${hadoop.tmp.dir}/mapred/local, would the expanded capacity we're looking for be as easily accomplished as by defining mapred.local.dir to span multiple disks? Setting aside the issue of temp files so big that they could still fill a whole disk. 1. You can set mapred.local.dir independent of hadoop.tmp.dir 2. mapred.local.dir can have comma separated values in it, spanning multiple disks 3. Intermediate outputs may spread across these disks but shall not consume 1 disk at a time. So if your largest configured disk is 500 GB while the total set of them may be 2 TB, then your intermediate output size can't really exceed 500 GB, cause only one disk is consumed by one task -- the multiple disks are for better I/O parallelism between tasks. Know that hadoop.tmp.dir is a convenience property, for quickly starting up dev clusters and such. For a proper configuration, you need to remove dependency on it (almost nothing uses hadoop.tmp.dir on the server side, once the right properties are configured - ex: dfs.data.dir, dfs.name.dir, fs.checkpoint.dir, mapred.local.dir, etc.) -- Harsh J Here it's a excellent explanation how to install Apache Hadoop manually, and Lars explains this very good. http://blog.lars-francke.de/2011/01/26/setting-up-a-hadoop-cluster-part-1-manual-installation/ Regards -- Marcos Luis Ortíz Valmaseda Linux Infrastructure Engineer Linux User # 418229 http://marcosluis2186.posterous.com http://www.linkedin.com/in/marcosluis2186 Twitter: @marcosluis2186
ways to expand hadoop.tmp.dir capacity?
Currently, we've got defined: property namehadoop.tmp.dir/name value/hadoop/hadoop-metadata/cache//value /property In our experiments with SOLR, the intermediate files are so large that they tend to blow out disk space and fail (and annoyingly leave behind their huge failed attempts). We've had issues with it in the past, but we're having real problems with SOLR if we can't comfortably get more space out of hadoop.tmp.dir somehow. 1) It seems we never set *mapred.system.dir* to anything special, so it's defaulting to ${hadoop.tmp.dir}/mapred/system. Is this a problem? The docs seem to recommend against it when hadoop.tmp.dir had ${user.name} in it, which ours doesn't. 1b) The doc says mapred.system.dir is the in-HDFS path to shared MapReduce system files. To me, that means there's must be 1 single path for mapred.system.dir, which sort of forces hadoop.tmp.dir to be 1 path. Otherwise, one might imagine that you could specify multiple paths to store hadoop.tmp.dir, like you can for dfs.data.dir. Is this a correct interpretation? -- hadoop.tmp.dir could live on multiple paths/disks if there were more mapping/lookup between mapred.system.dir and hadoop.tmp.dir? 2) IIRC, there's a -D switch for supplying config name/value pairs into indivdiual jobs. Does such a switch exist? Googling for single letters is fruitless. If we had a path on our workers with more space (in our case, another hard disk), could we simply pass that path in as hadoop.tmp.dir for our SOLR jobs? Without incurring any consistency issues on future jobs that might use the SOLR output on HDFS?
Re: ways to expand hadoop.tmp.dir capacity?
I just read this: MapReduce performance can also be improved by distributing the temporary data generated by MapReduce tasks across multiple disks on each machine: property namemapred.local.dir/name value/d1/mapred/local,/d2/mapred/local,/d3/mapred/local,/d4/mapred/local/value finaltrue/final /property Given that the default value is ${hadoop.tmp.dir}/mapred/local, would the expanded capacity we're looking for be as easily accomplished as by defining mapred.local.dir to span multiple disks? Setting aside the issue of temp files so big that they could still fill a whole disk. On Wed, Oct 5, 2011 at 1:32 AM, Meng Mao meng...@gmail.com wrote: Currently, we've got defined: property namehadoop.tmp.dir/name value/hadoop/hadoop-metadata/cache//value /property In our experiments with SOLR, the intermediate files are so large that they tend to blow out disk space and fail (and annoyingly leave behind their huge failed attempts). We've had issues with it in the past, but we're having real problems with SOLR if we can't comfortably get more space out of hadoop.tmp.dir somehow. 1) It seems we never set *mapred.system.dir* to anything special, so it's defaulting to ${hadoop.tmp.dir}/mapred/system. Is this a problem? The docs seem to recommend against it when hadoop.tmp.dir had ${user.name} in it, which ours doesn't. 1b) The doc says mapred.system.dir is the in-HDFS path to shared MapReduce system files. To me, that means there's must be 1 single path for mapred.system.dir, which sort of forces hadoop.tmp.dir to be 1 path. Otherwise, one might imagine that you could specify multiple paths to store hadoop.tmp.dir, like you can for dfs.data.dir. Is this a correct interpretation? -- hadoop.tmp.dir could live on multiple paths/disks if there were more mapping/lookup between mapred.system.dir and hadoop.tmp.dir? 2) IIRC, there's a -D switch for supplying config name/value pairs into indivdiual jobs. Does such a switch exist? Googling for single letters is fruitless. If we had a path on our workers with more space (in our case, another hard disk), could we simply pass that path in as hadoop.tmp.dir for our SOLR jobs? Without incurring any consistency issues on future jobs that might use the SOLR output on HDFS?
Re: operation of DistributedCache following manual deletion of cached files?
Who is in charge of getting the files there for the first time? The addCacheFile call in the mapreduce job? Or a manual setup by the user/operator? On Tue, Sep 27, 2011 at 11:35 AM, Robert Evans ev...@yahoo-inc.com wrote: The problem is the step 4 in the breaking sequence. Currently the TaskTracker never looks at the disk to know if a file is in the distributed cache or not. It assumes that if it downloaded the file and did not delete that file itself then the file is still there in its original form. It does not know that you deleted those files, or if wrote to the files, or in any way altered those files. In general you should not be modifying those files. This is not only because it messes up the tracking of those files, but because other jobs running concurrently with your task may also be using those files. --Bobby Evans On 9/26/11 4:40 PM, Meng Mao meng...@gmail.com wrote: Let's frame the issue in another way. I'll describe a sequence of Hadoop operations that I think should work, and then I'll get into what we did and how it failed. Normal sequence: 1. have files to be cached in HDFS 2. Run Job A, which specifies those files to be put into DistributedCache space 3. job runs fine 4. Run Job A some time later. job runs fine again. Breaking sequence: 1. have files to be cached in HDFS 2. Run Job A, which specifies those files to be put into DistributedCache space 3. job runs fine 4. Manually delete cached files out of local disk on worker nodes 5. Run Job A again, expect it to push out cache copies as needed. 6. job fails because the cache copies didn't get distributed Should this second sequence have broken? On Fri, Sep 23, 2011 at 3:09 PM, Meng Mao meng...@gmail.com wrote: Hmm, I must have really missed an important piece somewhere. This is from the MapRed tutorial text: DistributedCache is a facility provided by the Map/Reduce framework to cache files (text, archives, jars and so on) needed by applications. Applications specify the files to be cached via urls (hdfs://) in the JobConf. The DistributedCache* assumes that the files specified via hdfs:// urls are already present on the FileSystem.* *The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node*. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves. After some close reading, the two bolded pieces seem to be in contradiction of each other? I'd always that addCacheFile() would perform the 2nd bolded statement. If that sentence is true, then I still don't have an explanation of why our job didn't correctly push out new versions of the cache files upon the startup and execution of JobConfiguration. We deleted them before our job started, not during. On Fri, Sep 23, 2011 at 9:35 AM, Robert Evans ev...@yahoo-inc.com wrote: Meng Mao, The way the distributed cache is currently written, it does not verify the integrity of the cache files at all after they are downloaded. It just assumes that if they were downloaded once they are still there and in the proper shape. It might be good to file a JIRA to add in some sort of check. Another thing to do is that the distributed cache also includes the time stamp of the original file, just incase you delete the file and then use a different version. So if you want it to force a download again you can copy it delete the original and then move it back to what it was before. --Bobby Evans On 9/23/11 1:57 AM, Meng Mao meng...@gmail.com wrote: We use the DistributedCache class to distribute a few lookup files for our jobs. We have been aggressively deleting failed task attempts' leftover data , and our script accidentally deleted the path to our distributed cache files. Our task attempt leftover data was here [per node]: /hadoop/hadoop-metadata/cache/mapred/local/ and our distributed cache path was: hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/nameNode We deleted this path by accident. Does this latter path look normal? I'm not that familiar with DistributedCache but I'm up right now investigating the issue so I thought I'd ask. After that deletion, the first 2 jobs to run (which are use the addCacheFile method to distribute their files) didn't seem to push the files out to the cache path, except on one node. Is this expected behavior? Shouldn't addCacheFile check to see if the files are missing, and if so, repopulate them as needed? I'm trying to get a handle on whether it's safe to delete the distributed cache path when the grid is quiet and no jobs are running. That is, if addCacheFile is designed to be robust against the files it's caching not being at each job start.
Re: operation of DistributedCache following manual deletion of cached files?
I'm not concerned about disk space usage -- the script we used that deleted the taskTracker cache path has been fixed not to do so. I'm curious about the exact behavior of jobs that use DistributedCache files. Again, it seems safe from your description to delete files between completed runs. How could the job or the taskTracker distinguish between the files having been deleted and their not having been downloaded from a previous run of the job? Is it state in memory that the taskTracker maintains? On Tue, Sep 27, 2011 at 1:44 PM, Robert Evans ev...@yahoo-inc.com wrote: If you are never ever going to use that file again for any map/reduce task in the future then yes you can delete it, but I would not recommend it. If you want to reduce the amount of space that is used by the distributed cache there is a config parameter for that. local.cache.size it is the number of bytes per drive that will be used for storing data in the distributed cache. This is in 0.20 for hadoop I am not sure if it has changed at all for trunk. It is not documented as far as I can tell, and it defaults to 10GB. --Bobby Evans On 9/27/11 12:04 PM, Meng Mao meng...@gmail.com wrote: From that interpretation, it then seems like it would be safe to delete the files between completed runs? How could it distinguish between the files having been deleted and their not having been downloaded from a previous run? On Tue, Sep 27, 2011 at 12:25 PM, Robert Evans ev...@yahoo-inc.com wrote: addCacheFile sets a config value in your jobConf that indicates which files your particular job depends on. When the TaskTracker is assigned to run part of your job (map task or reduce task), it will download your jobConf, read it in, and then download the files listed in the conf, if it has not already downloaded them from a previous run. Then it will set up the directory structure for your job, possibly adding in symbolic links to these files in the working directory for your task. After that it will launch your task. --Bobby Evans On 9/27/11 11:17 AM, Meng Mao meng...@gmail.com wrote: Who is in charge of getting the files there for the first time? The addCacheFile call in the mapreduce job? Or a manual setup by the user/operator? On Tue, Sep 27, 2011 at 11:35 AM, Robert Evans ev...@yahoo-inc.com wrote: The problem is the step 4 in the breaking sequence. Currently the TaskTracker never looks at the disk to know if a file is in the distributed cache or not. It assumes that if it downloaded the file and did not delete that file itself then the file is still there in its original form. It does not know that you deleted those files, or if wrote to the files, or in any way altered those files. In general you should not be modifying those files. This is not only because it messes up the tracking of those files, but because other jobs running concurrently with your task may also be using those files. --Bobby Evans On 9/26/11 4:40 PM, Meng Mao meng...@gmail.com wrote: Let's frame the issue in another way. I'll describe a sequence of Hadoop operations that I think should work, and then I'll get into what we did and how it failed. Normal sequence: 1. have files to be cached in HDFS 2. Run Job A, which specifies those files to be put into DistributedCache space 3. job runs fine 4. Run Job A some time later. job runs fine again. Breaking sequence: 1. have files to be cached in HDFS 2. Run Job A, which specifies those files to be put into DistributedCache space 3. job runs fine 4. Manually delete cached files out of local disk on worker nodes 5. Run Job A again, expect it to push out cache copies as needed. 6. job fails because the cache copies didn't get distributed Should this second sequence have broken? On Fri, Sep 23, 2011 at 3:09 PM, Meng Mao meng...@gmail.com wrote: Hmm, I must have really missed an important piece somewhere. This is from the MapRed tutorial text: DistributedCache is a facility provided by the Map/Reduce framework to cache files (text, archives, jars and so on) needed by applications. Applications specify the files to be cached via urls (hdfs://) in the JobConf. The DistributedCache* assumes that the files specified via hdfs:// urls are already present on the FileSystem.* *The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node*. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves. After some close reading, the two bolded pieces seem to be in contradiction of each other? I'd always that addCacheFile() would perform the 2nd bolded statement. If that sentence is true, then I still don't have
Re: operation of DistributedCache following manual deletion of cached files?
So the proper description of how DistributedCache normally works is: 1. have files to be cached sitting around in HDFS 2. Run Job A, which specifies those files to be put into DistributedCache space. Each worker node copies the to-be-cached files from HDFS to local disk, but more importantly, the TaskTracker acknowledges this distribution and marks somewhere the fact that these files are cached for the first (and only time) 3. job runs fine 4. Run Job A some time later. TaskTracker simply assumes (by looking at its memory) that the files are still cached. The tasks on the workers, on this second call to addCacheFile, don't actually copy the files from HDFS to local disk again, but instead accept TaskTracker's word that they're still there. Because the files actually still exist, the workers run fine and the job finishes normally. Is that a correct interpretation? If so, the caution, then, must be that if you accidentally deleted the local disk cache file copies, you either repopulate them (as well as their crc checksums) or you restart the TaskTracker? On Tue, Sep 27, 2011 at 3:03 PM, Robert Evans ev...@yahoo-inc.com wrote: Yes, all of the state for the task tracker is in memory. It never looks at the disk to see what is there, it only maintains the state in memory. --bobby Evans On 9/27/11 1:00 PM, Meng Mao meng...@gmail.com wrote: I'm not concerned about disk space usage -- the script we used that deleted the taskTracker cache path has been fixed not to do so. I'm curious about the exact behavior of jobs that use DistributedCache files. Again, it seems safe from your description to delete files between completed runs. How could the job or the taskTracker distinguish between the files having been deleted and their not having been downloaded from a previous run of the job? Is it state in memory that the taskTracker maintains? On Tue, Sep 27, 2011 at 1:44 PM, Robert Evans ev...@yahoo-inc.com wrote: If you are never ever going to use that file again for any map/reduce task in the future then yes you can delete it, but I would not recommend it. If you want to reduce the amount of space that is used by the distributed cache there is a config parameter for that. local.cache.size it is the number of bytes per drive that will be used for storing data in the distributed cache. This is in 0.20 for hadoop I am not sure if it has changed at all for trunk. It is not documented as far as I can tell, and it defaults to 10GB. --Bobby Evans On 9/27/11 12:04 PM, Meng Mao meng...@gmail.com wrote: From that interpretation, it then seems like it would be safe to delete the files between completed runs? How could it distinguish between the files having been deleted and their not having been downloaded from a previous run? On Tue, Sep 27, 2011 at 12:25 PM, Robert Evans ev...@yahoo-inc.com wrote: addCacheFile sets a config value in your jobConf that indicates which files your particular job depends on. When the TaskTracker is assigned to run part of your job (map task or reduce task), it will download your jobConf, read it in, and then download the files listed in the conf, if it has not already downloaded them from a previous run. Then it will set up the directory structure for your job, possibly adding in symbolic links to these files in the working directory for your task. After that it will launch your task. --Bobby Evans On 9/27/11 11:17 AM, Meng Mao meng...@gmail.com wrote: Who is in charge of getting the files there for the first time? The addCacheFile call in the mapreduce job? Or a manual setup by the user/operator? On Tue, Sep 27, 2011 at 11:35 AM, Robert Evans ev...@yahoo-inc.com wrote: The problem is the step 4 in the breaking sequence. Currently the TaskTracker never looks at the disk to know if a file is in the distributed cache or not. It assumes that if it downloaded the file and did not delete that file itself then the file is still there in its original form. It does not know that you deleted those files, or if wrote to the files, or in any way altered those files. In general you should not be modifying those files. This is not only because it messes up the tracking of those files, but because other jobs running concurrently with your task may also be using those files. --Bobby Evans On 9/26/11 4:40 PM, Meng Mao meng...@gmail.com wrote: Let's frame the issue in another way. I'll describe a sequence of Hadoop operations that I think should work, and then I'll get into what we did and how it failed. Normal sequence: 1. have files to be cached in HDFS 2. Run Job A, which specifies those files to be put into DistributedCache space 3. job runs fine 4. Run Job A some time later. job runs fine again. Breaking sequence
Re: operation of DistributedCache following manual deletion of cached files?
Let's frame the issue in another way. I'll describe a sequence of Hadoop operations that I think should work, and then I'll get into what we did and how it failed. Normal sequence: 1. have files to be cached in HDFS 2. Run Job A, which specifies those files to be put into DistributedCache space 3. job runs fine 4. Run Job A some time later. job runs fine again. Breaking sequence: 1. have files to be cached in HDFS 2. Run Job A, which specifies those files to be put into DistributedCache space 3. job runs fine 4. Manually delete cached files out of local disk on worker nodes 5. Run Job A again, expect it to push out cache copies as needed. 6. job fails because the cache copies didn't get distributed Should this second sequence have broken? On Fri, Sep 23, 2011 at 3:09 PM, Meng Mao meng...@gmail.com wrote: Hmm, I must have really missed an important piece somewhere. This is from the MapRed tutorial text: DistributedCache is a facility provided by the Map/Reduce framework to cache files (text, archives, jars and so on) needed by applications. Applications specify the files to be cached via urls (hdfs://) in the JobConf. The DistributedCache* assumes that the files specified via hdfs:// urls are already present on the FileSystem.* *The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node*. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves. After some close reading, the two bolded pieces seem to be in contradiction of each other? I'd always that addCacheFile() would perform the 2nd bolded statement. If that sentence is true, then I still don't have an explanation of why our job didn't correctly push out new versions of the cache files upon the startup and execution of JobConfiguration. We deleted them before our job started, not during. On Fri, Sep 23, 2011 at 9:35 AM, Robert Evans ev...@yahoo-inc.com wrote: Meng Mao, The way the distributed cache is currently written, it does not verify the integrity of the cache files at all after they are downloaded. It just assumes that if they were downloaded once they are still there and in the proper shape. It might be good to file a JIRA to add in some sort of check. Another thing to do is that the distributed cache also includes the time stamp of the original file, just incase you delete the file and then use a different version. So if you want it to force a download again you can copy it delete the original and then move it back to what it was before. --Bobby Evans On 9/23/11 1:57 AM, Meng Mao meng...@gmail.com wrote: We use the DistributedCache class to distribute a few lookup files for our jobs. We have been aggressively deleting failed task attempts' leftover data , and our script accidentally deleted the path to our distributed cache files. Our task attempt leftover data was here [per node]: /hadoop/hadoop-metadata/cache/mapred/local/ and our distributed cache path was: hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/nameNode We deleted this path by accident. Does this latter path look normal? I'm not that familiar with DistributedCache but I'm up right now investigating the issue so I thought I'd ask. After that deletion, the first 2 jobs to run (which are use the addCacheFile method to distribute their files) didn't seem to push the files out to the cache path, except on one node. Is this expected behavior? Shouldn't addCacheFile check to see if the files are missing, and if so, repopulate them as needed? I'm trying to get a handle on whether it's safe to delete the distributed cache path when the grid is quiet and no jobs are running. That is, if addCacheFile is designed to be robust against the files it's caching not being at each job start.
operation of DistributedCache following manual deletion of cached files?
We use the DistributedCache class to distribute a few lookup files for our jobs. We have been aggressively deleting failed task attempts' leftover data , and our script accidentally deleted the path to our distributed cache files. Our task attempt leftover data was here [per node]: /hadoop/hadoop-metadata/cache/mapred/local/ and our distributed cache path was: hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/nameNode We deleted this path by accident. Does this latter path look normal? I'm not that familiar with DistributedCache but I'm up right now investigating the issue so I thought I'd ask. After that deletion, the first 2 jobs to run (which are use the addCacheFile method to distribute their files) didn't seem to push the files out to the cache path, except on one node. Is this expected behavior? Shouldn't addCacheFile check to see if the files are missing, and if so, repopulate them as needed? I'm trying to get a handle on whether it's safe to delete the distributed cache path when the grid is quiet and no jobs are running. That is, if addCacheFile is designed to be robust against the files it's caching not being at each job start.
Re: operation of DistributedCache following manual deletion of cached files?
Hmm, I must have really missed an important piece somewhere. This is from the MapRed tutorial text: DistributedCache is a facility provided by the Map/Reduce framework to cache files (text, archives, jars and so on) needed by applications. Applications specify the files to be cached via urls (hdfs://) in the JobConf. The DistributedCache* assumes that the files specified via hdfs:// urls are already present on the FileSystem.* *The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node*. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves. After some close reading, the two bolded pieces seem to be in contradiction of each other? I'd always that addCacheFile() would perform the 2nd bolded statement. If that sentence is true, then I still don't have an explanation of why our job didn't correctly push out new versions of the cache files upon the startup and execution of JobConfiguration. We deleted them before our job started, not during. On Fri, Sep 23, 2011 at 9:35 AM, Robert Evans ev...@yahoo-inc.com wrote: Meng Mao, The way the distributed cache is currently written, it does not verify the integrity of the cache files at all after they are downloaded. It just assumes that if they were downloaded once they are still there and in the proper shape. It might be good to file a JIRA to add in some sort of check. Another thing to do is that the distributed cache also includes the time stamp of the original file, just incase you delete the file and then use a different version. So if you want it to force a download again you can copy it delete the original and then move it back to what it was before. --Bobby Evans On 9/23/11 1:57 AM, Meng Mao meng...@gmail.com wrote: We use the DistributedCache class to distribute a few lookup files for our jobs. We have been aggressively deleting failed task attempts' leftover data , and our script accidentally deleted the path to our distributed cache files. Our task attempt leftover data was here [per node]: /hadoop/hadoop-metadata/cache/mapred/local/ and our distributed cache path was: hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/nameNode We deleted this path by accident. Does this latter path look normal? I'm not that familiar with DistributedCache but I'm up right now investigating the issue so I thought I'd ask. After that deletion, the first 2 jobs to run (which are use the addCacheFile method to distribute their files) didn't seem to push the files out to the cache path, except on one node. Is this expected behavior? Shouldn't addCacheFile check to see if the files are missing, and if so, repopulate them as needed? I'm trying to get a handle on whether it's safe to delete the distributed cache path when the grid is quiet and no jobs are running. That is, if addCacheFile is designed to be robust against the files it's caching not being at each job start.
Re: Disable Sorting?
Is there a way to collate the possibly large number of map output files, though? On Sat, Sep 10, 2011 at 2:48 PM, Arun C Murthy a...@hortonworks.com wrote: Run a map-only job with #reduces set to 0. Arun On Sep 10, 2011, at 2:06 AM, john smith wrote: Hi, Some of the MR jobs I run doesn't need sorting of map-output in each partition. Is there someway I can disable it? Any help? Thanks jS
Re: do HDFS files starting with _ (underscore) have special properties?
I get the opposite behavior -- [this is more or less how I listed the files in the original email] hadoop dfs -ls /test/output/solr-20110901165238/part-0/data/index/* -rw-r--r-- 2 hadoopuser visible 8538430603 2011-09-01 18:58 /test/output/solr-20110901165238/part-0/data/index/_ox.fdt -rw-r--r-- 2 hadoopuser visible 233396596 2011-09-01 18:57 /test/output/solr-20110901165238/part-0/data/index/_ox.fdx -rw-r--r-- 2 hadoopuser visible130 2011-09-01 18:57 /test/output/solr-20110901165238/part-0/data/index/_ox.fnm -rw-r--r-- 2 hadoopuser visible 2147948283 2011-09-01 18:55 /test/output/solr-20110901165238/part-0/data/index/_ox.frq -rw-r--r-- 2 hadoopuser visible 87523726 2011-09-01 18:57 /test/output/solr-20110901165238/part-0/data/index/_ox.nrm -rw-r--r-- 2 hadoopuser visible 920936168 2011-09-01 18:57 /test/output/solr-20110901165238/part-0/data/index/_ox.prx -rw-r--r-- 2 hadoopuser visible 22619542 2011-09-01 18:58 /test/output/solr-20110901165238/part-0/data/index/_ox.tii -rw-r--r-- 2 hadoopuser visible 2070214402 2011-09-01 18:51 /test/output/solr-20110901165238/part-0/data/index/_ox.tis -rw-r--r-- 2 hadoopuser visible 20 2011-09-01 18:51 /test/output/solr-20110901165238/part-0/data/index/segments.gen -rw-r--r-- 2 hadoopuser visible282 2011-09-01 18:55 /test/output/solr-20110901165238/part-0/data/index/segments_2 Whereas my globStatus doesn't capture them. I thought we were on Cloudera's CDH3, but now I'm not sure. This is what version reports: $ hadoop version Hadoop 0.20.1+169.56 Subversion -r 8e662cb065be1c4bc61c55e6bff161e09c1d36f3 Compiled by root on Tue Feb 9 13:40:08 EST 2010 On Fri, Sep 2, 2011 at 11:45 PM, Harsh J ha...@cloudera.com wrote: Meng, What version of hadoop are you on? I'm able to use globStatus(Path) for '_' listing successfully, with a '*' glob. Although the same doesn't apply to what FsShell's ls utility provide (which is odd here!). Here's my test code which can validate that the listing is indeed done: http://pastebin.com/vCbd2wmK $ hadoop dfs -ls Found 4 items drwxr-xr-x - harshchouraria supergroup 0 2011-09-03 09:09 /user/harshchouraria/_abc -rw-r--r-- 1 harshchouraria supergroup 0 2011-09-03 09:10 /user/harshchouraria/_def drwxr-xr-x - harshchouraria supergroup 0 2011-09-03 08:10 /user/harshchouraria/abc -rw-r--r-- 1 harshchouraria supergroup 0 2011-09-03 09:10 /user/harshchouraria/def $ hadoop dfs -ls '*' -rw-r--r-- 1 harshchouraria supergroup 0 2011-09-03 09:10 /user/harshchouraria/_def -rw-r--r-- 1 harshchouraria supergroup 0 2011-09-03 09:10 /user/harshchouraria/def $ # No dir results! ^^ $ hadoop jar myjar.jar # (My code) hdfs://localhost/user/harshchouraria/_abc hdfs://localhost/user/harshchouraria/_def hdfs://localhost/user/harshchouraria/abc hdfs://localhost/user/harshchouraria/def I suppose that means globStatus is fine, but the FsShell.ls(…) code does something more than a simple glob status, and filters away directory results when used with a glob. On Sat, Sep 3, 2011 at 3:07 AM, Meng Mao meng...@gmail.com wrote: Is there a programmatic way to access these hidden files then? On Fri, Sep 2, 2011 at 5:20 PM, Edward Capriolo edlinuxg...@gmail.com wrote: On Fri, Sep 2, 2011 at 4:04 PM, Meng Mao meng...@gmail.com wrote: We have a compression utility that tries to grab all subdirs to a directory on HDFS. It makes a call like this: FileStatus[] subdirs = fs.globStatus(new Path(inputdir, *)); and handles files vs dirs accordingly. We tried to run our utility against a dir containing a computed SOLR shard, which has files that look like this: -rw-r--r-- 2 hadoopuser visible 8538430603 2011-09-01 18:58 /test/output/solr-20110901165238/part-0/data/index/_ox.fdt -rw-r--r-- 2 hadoopuser visible 233396596 2011-09-01 18:57 /test/output/solr-20110901165238/part-0/data/index/_ox.fdx -rw-r--r-- 2 hadoopuser visible130 2011-09-01 18:57 /test/output/solr-20110901165238/part-0/data/index/_ox.fnm -rw-r--r-- 2 hadoopuser visible 2147948283 2011-09-01 18:55 /test/output/solr-20110901165238/part-0/data/index/_ox.frq -rw-r--r-- 2 hadoopuser visible 87523726 2011-09-01 18:57 /test/output/solr-20110901165238/part-0/data/index/_ox.nrm -rw-r--r-- 2 hadoopuser visible 920936168 2011-09-01 18:57 /test/output/solr-20110901165238/part-0/data/index/_ox.prx -rw-r--r-- 2 hadoopuser visible 22619542 2011-09-01 18:58 /test/output/solr-20110901165238/part-0/data/index/_ox.tii -rw-r--r-- 2 hadoopuser visible 2070214402 2011-09-01 18:51 /test/output/solr-20110901165238/part-0/data/index/_ox.tis -rw-r--r-- 2 hadoopuser visible 20 2011-09-01 18:51 /test/output/solr-20110901165238/part-0/data/index/segments.gen -rw-r--r-- 2
do HDFS files starting with _ (underscore) have special properties?
We have a compression utility that tries to grab all subdirs to a directory on HDFS. It makes a call like this: FileStatus[] subdirs = fs.globStatus(new Path(inputdir, *)); and handles files vs dirs accordingly. We tried to run our utility against a dir containing a computed SOLR shard, which has files that look like this: -rw-r--r-- 2 hadoopuser visible 8538430603 2011-09-01 18:58 /test/output/solr-20110901165238/part-0/data/index/_ox.fdt -rw-r--r-- 2 hadoopuser visible 233396596 2011-09-01 18:57 /test/output/solr-20110901165238/part-0/data/index/_ox.fdx -rw-r--r-- 2 hadoopuser visible130 2011-09-01 18:57 /test/output/solr-20110901165238/part-0/data/index/_ox.fnm -rw-r--r-- 2 hadoopuser visible 2147948283 2011-09-01 18:55 /test/output/solr-20110901165238/part-0/data/index/_ox.frq -rw-r--r-- 2 hadoopuser visible 87523726 2011-09-01 18:57 /test/output/solr-20110901165238/part-0/data/index/_ox.nrm -rw-r--r-- 2 hadoopuser visible 920936168 2011-09-01 18:57 /test/output/solr-20110901165238/part-0/data/index/_ox.prx -rw-r--r-- 2 hadoopuser visible 22619542 2011-09-01 18:58 /test/output/solr-20110901165238/part-0/data/index/_ox.tii -rw-r--r-- 2 hadoopuser visible 2070214402 2011-09-01 18:51 /test/output/solr-20110901165238/part-0/data/index/_ox.tis -rw-r--r-- 2 hadoopuser visible 20 2011-09-01 18:51 /test/output/solr-20110901165238/part-0/data/index/segments.gen -rw-r--r-- 2 hadoopuser visible282 2011-09-01 18:55 /test/output/solr-20110901165238/part-0/data/index/segments_2 The globStatus call seems only able to pick up those last 2 files; the several files that start with _ don't register. I've skimmed the FileSystem and GlobExpander source to see if there's anything related to this, but didn't see it. Google didn't turn up anything about underscores. Am I misunderstanding something about the regex patterns needed to pick these up or unaware of some filename convention in HDFS?
Re: do HDFS files starting with _ (underscore) have special properties?
Is there a programmatic way to access these hidden files then? On Fri, Sep 2, 2011 at 5:20 PM, Edward Capriolo edlinuxg...@gmail.comwrote: On Fri, Sep 2, 2011 at 4:04 PM, Meng Mao meng...@gmail.com wrote: We have a compression utility that tries to grab all subdirs to a directory on HDFS. It makes a call like this: FileStatus[] subdirs = fs.globStatus(new Path(inputdir, *)); and handles files vs dirs accordingly. We tried to run our utility against a dir containing a computed SOLR shard, which has files that look like this: -rw-r--r-- 2 hadoopuser visible 8538430603 2011-09-01 18:58 /test/output/solr-20110901165238/part-0/data/index/_ox.fdt -rw-r--r-- 2 hadoopuser visible 233396596 2011-09-01 18:57 /test/output/solr-20110901165238/part-0/data/index/_ox.fdx -rw-r--r-- 2 hadoopuser visible130 2011-09-01 18:57 /test/output/solr-20110901165238/part-0/data/index/_ox.fnm -rw-r--r-- 2 hadoopuser visible 2147948283 2011-09-01 18:55 /test/output/solr-20110901165238/part-0/data/index/_ox.frq -rw-r--r-- 2 hadoopuser visible 87523726 2011-09-01 18:57 /test/output/solr-20110901165238/part-0/data/index/_ox.nrm -rw-r--r-- 2 hadoopuser visible 920936168 2011-09-01 18:57 /test/output/solr-20110901165238/part-0/data/index/_ox.prx -rw-r--r-- 2 hadoopuser visible 22619542 2011-09-01 18:58 /test/output/solr-20110901165238/part-0/data/index/_ox.tii -rw-r--r-- 2 hadoopuser visible 2070214402 2011-09-01 18:51 /test/output/solr-20110901165238/part-0/data/index/_ox.tis -rw-r--r-- 2 hadoopuser visible 20 2011-09-01 18:51 /test/output/solr-20110901165238/part-0/data/index/segments.gen -rw-r--r-- 2 hadoopuser visible282 2011-09-01 18:55 /test/output/solr-20110901165238/part-0/data/index/segments_2 The globStatus call seems only able to pick up those last 2 files; the several files that start with _ don't register. I've skimmed the FileSystem and GlobExpander source to see if there's anything related to this, but didn't see it. Google didn't turn up anything about underscores. Am I misunderstanding something about the regex patterns needed to pick these up or unaware of some filename convention in HDFS? Files starting with '_' are considered 'hidden' like unix files starting with '.'. I did not know that for a very long time because not everyone follows this rule or even knows about it.
streaming: -f with a zip file?
We are trying to pass into a streaming program a zip file. An invocation looks like this: /usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-0.20.1+152-streaming.jar ... -file 'myfile.zip' Within the program, it seems like a ZipFile invocation can't locate the zip file. Is there anything that might be problematic with a zip file when used with a -f?
Re: duplicate tasks getting started/killed
Right, so have you ever seen your non-idempotent DEFINE command have an incorrect result? That would essentially point to duplicate attempts behaving badly. To your second question -- I think spec exec assumes that not all machines run at the same speed. If a machine is free (not used for some other attempt), then Hadoop might schedule an attempt right away on it. It's possible, depending on the granularity of the work or issues with the original attempt, that this later attempt finishes first, and thus becomes the committing attempt. On Wed, Feb 10, 2010 at 12:10 PM, prasenjit mukherjee pmukher...@quattrowireless.com wrote: Correctness of the results actually depends on my DEFINE command. If the commands are idempotent ( which is not in my case ) then I believe it wont have any affect on the results, otherwise it will indeed make the results incorrect. For example if my command fetches some data and appends to a mysql db in that case it is undesirable. I have a question though, why in the first place the second attempt was kicked off just seconds after the first one. I mean what is the rationale behind kicking off a second attempt immediately afterwards ? Baffling... -Prasen On Wed, Feb 10, 2010 at 10:32 PM, Meng Mao meng...@gmail.com wrote: That cleanup action looks promising in terms of preventing duplication. What I'd meant was, could you ever find an instance where the results of your DEFINE statement were made incorrect by multiple attempts? On Wed, Feb 10, 2010 at 5:05 AM, prasenjit mukherjee pmukher...@quattrowireless.com wrote: Below is the log : attempt_201002090552_0009_m_01_0 /default-rack/ip-10-242-142-193.ec2.internal SUCCEEDED 100.00% 9-Feb-2010 07:04:37 9-Feb-2010 07:07:00 (2mins, 23sec) attempt_201002090552_0009_m_01_1 Task attempt: /default-rack/ip-10-212-147-129.ec2.internal Cleanup Attempt: /default-rack/ip-10-212-147-129.ec2.internal KILLED 100.00% 9-Feb-2010 07:05:34 9-Feb-2010 07:07:10 (1mins, So, here is the time-line for both the attempts : attempt_1's start_time=07:04:37 , end_time=07:07:00 attempt_2's start_time=07:05:34, end_time=07:07:10 -Thanks, Prasen On Wed, Feb 10, 2010 at 1:15 PM, Meng Mao meng...@gmail.com wrote: Can you confirm that duplication is happening in the case that one attempt gets underway but killed before the other's completion? I believe by default (though I'm not sure for Pig), each attempt's output is first isolated to a path keyed to its attempt id, and only committed when one and only one attempt is complete. On Tue, Feb 9, 2010 at 9:52 PM, prasenjit mukherjee pmukher...@quattrowireless.com wrote: Any thoughts on this problem ? I am using a DEFINE command ( in PIG ) and hence the actions are not idempotent. Because of which duplicate execution does have an affect on my results. Any way to overcome that ?
Re: duplicate tasks getting started/killed
Can you confirm that duplication is happening in the case that one attempt gets underway but killed before the other's completion? I believe by default (though I'm not sure for Pig), each attempt's output is first isolated to a path keyed to its attempt id, and only committed when one and only one attempt is complete. On Tue, Feb 9, 2010 at 9:52 PM, prasenjit mukherjee pmukher...@quattrowireless.com wrote: Any thoughts on this problem ? I am using a DEFINE command ( in PIG ) and hence the actions are not idempotent. Because of which duplicate execution does have an affect on my results. Any way to overcome that ? On Tue, Feb 9, 2010 at 9:26 PM, prasenjit mukherjee pmukher...@quattrowireless.com wrote: But the second attempted job got killed even before the first one was completed. How can we explain that. On Tue, Feb 9, 2010 at 7:38 PM, Eric Sammer e...@lifeless.net wrote: Prasen: This is most likely speculative execution. Hadoop fires up multiple attempts for the same task and lets them race to see which finishes first and then kills the others. This is meant to speed things along. Speculative execution is on by default, but can be disabled. See the configuration reference for mapred-*.xml. On 2/9/10 9:03 AM, prasenjit mukherjee wrote: Sometimes for the same task I see that a duplicate task gets run on a different machine and gets killed later. Not always but sometimes. Any reason why duplicate tasks get run. I thought tasks are duplicated only if either the first attempt exits( exceptions etc ) or exceeds mapred.task.timeout. In this case none of them happens. As can be seen from timestamp, the second attempt starts even though the first attempt is still running ( only for 1 minute ). Any explanation ? attempt_201002090552_0009_m_01_0 /default-rack/ip-10-242-142-193.ec2.internal SUCCEEDED 100.00% 9-Feb-2010 07:04:37 9-Feb-2010 07:07:00 (2mins, 23sec) attempt_201002090552_0009_m_01_1 Task attempt: /default-rack/ip-10-212-147-129.ec2.internal Cleanup Attempt: /default-rack/ip-10-212-147-129.ec2.internal KILLED 100.00% 9-Feb-2010 07:05:34 9-Feb-2010 07:07:10 (1mins, 36sec) -Prasen -- Eric Sammer e...@lifeless.net http://esammer.blogspot.com
Re: has anyone ported hadoop.lib.aggregate?
I'm not familiar with the current roadmap for 0.20, but is there any plan to backport the new mapreduce.lib.aggregate library into 0.20.x? I suppose our team could attempt to use the patch ourselves, but we'd be much more comfortable going with a standard release if at all possible. On Sun, Feb 7, 2010 at 10:41 PM, Amareshwari Sri Ramadasu amar...@yahoo-inc.com wrote: Org.apache.hadoop.mapred.lib.aggregate has been ported to new api in branch 0.21. See http://issues.apache.org/jira/browse/MAPREDUCE-358 Thanks Amareshwari On 2/7/10 5:34 AM, Meng Mao meng...@gmail.com wrote: From what I can tell, while the ValueAggregator stuff should be useable, the ValueAggregatorJob and ValueAggregatorJobBase classes still use the old Mapper and Reducer signatures, and basically aren't compatible with the new mapreduce.* API. Is that correct? Has anyone out there done a port? We've been dragging our feet very hard about getting away from use of deprecated API for our classes that take advantage of the aggregate lib. It would be a huge boost if there was any stuff we could borrow to port over.
Re: Is it possible to write each key-value pair emitted by the reducer to a different output file
It's possible to write your own class to take better encapsulate writing of side-effect files, but as people have said, you can run into unanticipated issues if the number of files you try to write at once becomes high. On Sat, Feb 6, 2010 at 3:47 AM, Udaya Lakshmi udaya...@gmail.com wrote: Hi Amareshwari, But this feature is not available in Hadoop 0.18.3. Is there any work around for this version. Thanks, Udaya. On Fri, Feb 5, 2010 at 10:49 AM, Amareshwari Sri Ramadasu amar...@yahoo-inc.com wrote: See MultipleOutputs at http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html -Amareshwari On 2/5/10 10:41 AM, Udaya Lakshmi udaya...@gmail.com wrote: Hi, I was wondering if it is possible to write each key-value pair produced by the reduce function to a different file. How could I open a new file in the reduce function of the reducer? I know its possible in configure function but it will write all the output that reducer to that file. Thanks, Udaya.
has anyone ported hadoop.lib.aggregate?
From what I can tell, while the ValueAggregator stuff should be useable, the ValueAggregatorJob and ValueAggregatorJobBase classes still use the old Mapper and Reducer signatures, and basically aren't compatible with the new mapreduce.* API. Is that correct? Has anyone out there done a port? We've been dragging our feet very hard about getting away from use of deprecated API for our classes that take advantage of the aggregate lib. It would be a huge boost if there was any stuff we could borrow to port over.
Re: EOFException and BadLink, but file descriptors number is ok?
ack, after looking at the logs again, there are definitely xcievers errors. It's set to 256! I had thought I had cleared this a possible cause, but guess I was wrong. Gonna retest right away. Thanks! On Fri, Feb 5, 2010 at 11:05 AM, Todd Lipcon t...@cloudera.com wrote: Yes, you're likely to see an error in the DN log. Do you see anything about max number of xceivers? -Todd On Thu, Feb 4, 2010 at 11:42 PM, Meng Mao meng...@gmail.com wrote: not sure what else I could be checking to see where the problem lies. Should I be looking in the datanode logs? I looked briefly in there and didn't see anything from around the time exceptions started getting reported. lsof during the job execution? Number of open threads? I'm at a loss here. On Thu, Feb 4, 2010 at 2:52 PM, Meng Mao meng...@gmail.com wrote: I wrote a hadoop job that checks for ulimits across the nodes, and every node is reporting: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 139264 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 65536 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 139264 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Is anything in there telling about file number limits? From what I understand, a high open files limit like 65536 should be enough. I estimate only a couple thousand part-files on HDFS being written to at once, and around 200 on the filesystem per node. On Wed, Feb 3, 2010 at 4:04 PM, Meng Mao meng...@gmail.com wrote: also, which is the ulimit that's important, the one for the user who is running the job, or the hadoop user that owns the Hadoop processes? On Tue, Feb 2, 2010 at 7:29 PM, Meng Mao meng...@gmail.com wrote: I've been trying to run a fairly small input file (300MB) on Cloudera Hadoop 0.20.1. The job I'm using probably writes to on the order of over 1000 part-files at once, across the whole grid. The grid has 33 nodes in it. I get the following exception in the run logs: 10/01/30 17:24:25 INFO mapred.JobClient: map 100% reduce 12% 10/01/30 17:24:25 INFO mapred.JobClient: Task Id : attempt_201001261532_1137_r_13_0, Status : FAILED java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:250) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319) at org.apache.hadoop.io.Text.readString(Text.java:400) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2869) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263) lots of EOFExceptions 10/01/30 17:24:25 INFO mapred.JobClient: Task Id : attempt_201001261532_1137_r_19_0, Status : FAILED java.io.IOException: Bad connect ack with firstBadLink 10.2.19.1:50010 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263) 10/01/30 17:24:36 INFO mapred.JobClient: map 100% reduce 11% 10/01/30 17:24:42 INFO mapred.JobClient: map 100% reduce 12% 10/01/30 17:24:49 INFO mapred.JobClient: map 100% reduce 13% 10/01/30 17:24:55 INFO mapred.JobClient: map 100% reduce 14% 10/01/30 17:25:00 INFO mapred.JobClient: map 100% reduce 15% From searching around, it seems like the most common cause of BadLink and EOFExceptions is when the nodes don't have enough file descriptors set. But across all the grid machines, the file-max has been set to 1573039. Furthermore, we set ulimit -n to 65536 using hadoop-env.sh. Where else should I be looking for what's causing this?
Re: EOFException and BadLink, but file descriptors number is ok?
I wrote a hadoop job that checks for ulimits across the nodes, and every node is reporting: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 139264 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 65536 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 139264 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Is anything in there telling about file number limits? From what I understand, a high open files limit like 65536 should be enough. I estimate only a couple thousand part-files on HDFS being written to at once, and around 200 on the filesystem per node. On Wed, Feb 3, 2010 at 4:04 PM, Meng Mao meng...@gmail.com wrote: also, which is the ulimit that's important, the one for the user who is running the job, or the hadoop user that owns the Hadoop processes? On Tue, Feb 2, 2010 at 7:29 PM, Meng Mao meng...@gmail.com wrote: I've been trying to run a fairly small input file (300MB) on Cloudera Hadoop 0.20.1. The job I'm using probably writes to on the order of over 1000 part-files at once, across the whole grid. The grid has 33 nodes in it. I get the following exception in the run logs: 10/01/30 17:24:25 INFO mapred.JobClient: map 100% reduce 12% 10/01/30 17:24:25 INFO mapred.JobClient: Task Id : attempt_201001261532_1137_r_13_0, Status : FAILED java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:250) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319) at org.apache.hadoop.io.Text.readString(Text.java:400) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2869) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263) lots of EOFExceptions 10/01/30 17:24:25 INFO mapred.JobClient: Task Id : attempt_201001261532_1137_r_19_0, Status : FAILED java.io.IOException: Bad connect ack with firstBadLink 10.2.19.1:50010 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263) 10/01/30 17:24:36 INFO mapred.JobClient: map 100% reduce 11% 10/01/30 17:24:42 INFO mapred.JobClient: map 100% reduce 12% 10/01/30 17:24:49 INFO mapred.JobClient: map 100% reduce 13% 10/01/30 17:24:55 INFO mapred.JobClient: map 100% reduce 14% 10/01/30 17:25:00 INFO mapred.JobClient: map 100% reduce 15% From searching around, it seems like the most common cause of BadLink and EOFExceptions is when the nodes don't have enough file descriptors set. But across all the grid machines, the file-max has been set to 1573039. Furthermore, we set ulimit -n to 65536 using hadoop-env.sh. Where else should I be looking for what's causing this?
Re: EOFException and BadLink, but file descriptors number is ok?
not sure what else I could be checking to see where the problem lies. Should I be looking in the datanode logs? I looked briefly in there and didn't see anything from around the time exceptions started getting reported. lsof during the job execution? Number of open threads? I'm at a loss here. On Thu, Feb 4, 2010 at 2:52 PM, Meng Mao meng...@gmail.com wrote: I wrote a hadoop job that checks for ulimits across the nodes, and every node is reporting: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 139264 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 65536 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 139264 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Is anything in there telling about file number limits? From what I understand, a high open files limit like 65536 should be enough. I estimate only a couple thousand part-files on HDFS being written to at once, and around 200 on the filesystem per node. On Wed, Feb 3, 2010 at 4:04 PM, Meng Mao meng...@gmail.com wrote: also, which is the ulimit that's important, the one for the user who is running the job, or the hadoop user that owns the Hadoop processes? On Tue, Feb 2, 2010 at 7:29 PM, Meng Mao meng...@gmail.com wrote: I've been trying to run a fairly small input file (300MB) on Cloudera Hadoop 0.20.1. The job I'm using probably writes to on the order of over 1000 part-files at once, across the whole grid. The grid has 33 nodes in it. I get the following exception in the run logs: 10/01/30 17:24:25 INFO mapred.JobClient: map 100% reduce 12% 10/01/30 17:24:25 INFO mapred.JobClient: Task Id : attempt_201001261532_1137_r_13_0, Status : FAILED java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:250) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319) at org.apache.hadoop.io.Text.readString(Text.java:400) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2869) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263) lots of EOFExceptions 10/01/30 17:24:25 INFO mapred.JobClient: Task Id : attempt_201001261532_1137_r_19_0, Status : FAILED java.io.IOException: Bad connect ack with firstBadLink 10.2.19.1:50010 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263) 10/01/30 17:24:36 INFO mapred.JobClient: map 100% reduce 11% 10/01/30 17:24:42 INFO mapred.JobClient: map 100% reduce 12% 10/01/30 17:24:49 INFO mapred.JobClient: map 100% reduce 13% 10/01/30 17:24:55 INFO mapred.JobClient: map 100% reduce 14% 10/01/30 17:25:00 INFO mapred.JobClient: map 100% reduce 15% From searching around, it seems like the most common cause of BadLink and EOFExceptions is when the nodes don't have enough file descriptors set. But across all the grid machines, the file-max has been set to 1573039. Furthermore, we set ulimit -n to 65536 using hadoop-env.sh. Where else should I be looking for what's causing this?
Re: EOFException and BadLink, but file descriptors number is ok?
also, which is the ulimit that's important, the one for the user who is running the job, or the hadoop user that owns the Hadoop processes? On Tue, Feb 2, 2010 at 7:29 PM, Meng Mao meng...@gmail.com wrote: I've been trying to run a fairly small input file (300MB) on Cloudera Hadoop 0.20.1. The job I'm using probably writes to on the order of over 1000 part-files at once, across the whole grid. The grid has 33 nodes in it. I get the following exception in the run logs: 10/01/30 17:24:25 INFO mapred.JobClient: map 100% reduce 12% 10/01/30 17:24:25 INFO mapred.JobClient: Task Id : attempt_201001261532_1137_r_13_0, Status : FAILED java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:250) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319) at org.apache.hadoop.io.Text.readString(Text.java:400) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2869) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263) lots of EOFExceptions 10/01/30 17:24:25 INFO mapred.JobClient: Task Id : attempt_201001261532_1137_r_19_0, Status : FAILED java.io.IOException: Bad connect ack with firstBadLink 10.2.19.1:50010 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263) 10/01/30 17:24:36 INFO mapred.JobClient: map 100% reduce 11% 10/01/30 17:24:42 INFO mapred.JobClient: map 100% reduce 12% 10/01/30 17:24:49 INFO mapred.JobClient: map 100% reduce 13% 10/01/30 17:24:55 INFO mapred.JobClient: map 100% reduce 14% 10/01/30 17:25:00 INFO mapred.JobClient: map 100% reduce 15% From searching around, it seems like the most common cause of BadLink and EOFExceptions is when the nodes don't have enough file descriptors set. But across all the grid machines, the file-max has been set to 1573039. Furthermore, we set ulimit -n to 65536 using hadoop-env.sh. Where else should I be looking for what's causing this?
EOFException and BadLink, but file descriptors number is ok?
I've been trying to run a fairly small input file (300MB) on Cloudera Hadoop 0.20.1. The job I'm using probably writes to on the order of over 1000 part-files at once, across the whole grid. The grid has 33 nodes in it. I get the following exception in the run logs: 10/01/30 17:24:25 INFO mapred.JobClient: map 100% reduce 12% 10/01/30 17:24:25 INFO mapred.JobClient: Task Id : attempt_201001261532_1137_r_13_0, Status : FAILED java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:250) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319) at org.apache.hadoop.io.Text.readString(Text.java:400) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2869) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263) lots of EOFExceptions 10/01/30 17:24:25 INFO mapred.JobClient: Task Id : attempt_201001261532_1137_r_19_0, Status : FAILED java.io.IOException: Bad connect ack with firstBadLink 10.2.19.1:50010 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263) 10/01/30 17:24:36 INFO mapred.JobClient: map 100% reduce 11% 10/01/30 17:24:42 INFO mapred.JobClient: map 100% reduce 12% 10/01/30 17:24:49 INFO mapred.JobClient: map 100% reduce 13% 10/01/30 17:24:55 INFO mapred.JobClient: map 100% reduce 14% 10/01/30 17:25:00 INFO mapred.JobClient: map 100% reduce 15% From searching around, it seems like the most common cause of BadLink and EOFExceptions is when the nodes don't have enough file descriptors set. But across all the grid machines, the file-max has been set to 1573039. Furthermore, we set ulimit -n to 65536 using hadoop-env.sh. Where else should I be looking for what's causing this?
Re: Repeated attempts to kill old job?
We restarted the grid and that did kill the repeated KillJobAction attempts. I forgot to look around with hadoop -dfsadmin, though. On Mon, Feb 1, 2010 at 11:29 PM, Rekha Joshi rekha...@yahoo-inc.com wrote: I would say restart the cluster, but suspect that would not help either - instead try checking up your running process list (eg: perl/shell script or a ETL pipeline job) to analyze/kill. Also wondering if any hadoop -dfsadmin commands can supersede this scenario.. Cheers, /R On 2/2/10 2:50 AM, Meng Mao meng...@gmail.com wrote: On our worker nodes, I see repeated requests for KillJobActions for the same old job: 2010-01-31 00:00:01,024 INFO org.apache.hadoop.mapred.TaskTracker: Received 'KillJobAction' for job: job_201001261532_0690 2010-01-31 00:00:01,064 WARN org.apache.hadoop.mapred.TaskTracker: Unknown job job_201001261532_0690 being deleted. This request and response is repeated almost 30k times over the course of the day. Other nodes have the same behavior, except with different job ids. The jobs presumably all ran in the past, to completion or got killed manually. We use the grid fairly actively and that job is several hundred increments old. Has anyone seen this before? Is there a way to stop it?
Repeated attempts to kill old job?
On our worker nodes, I see repeated requests for KillJobActions for the same old job: 2010-01-31 00:00:01,024 INFO org.apache.hadoop.mapred.TaskTracker: Received 'KillJobAction' for job: job_201001261532_0690 2010-01-31 00:00:01,064 WARN org.apache.hadoop.mapred.TaskTracker: Unknown job job_201001261532_0690 being deleted. This request and response is repeated almost 30k times over the course of the day. Other nodes have the same behavior, except with different job ids. The jobs presumably all ran in the past, to completion or got killed manually. We use the grid fairly actively and that job is several hundred increments old. Has anyone seen this before? Is there a way to stop it?