Re: How to manage large record in MapReduce
Jerome, You can take a look at FileStreamInputFormat at https://github.com/sonalgoyal/hiho/tree/hihoApache0.20/src/co/nubetech/hiho/mapreduce/lib/input This provides an input stream per file. In our case, we are using the input stream to load data into the database directly. Maybe you can use this or a similar approach for working with your videos. HTH Thanks and Regards, Sonal https://github.com/sonalgoyal/hihoConnect Hadoop with databases, Salesforce, FTP servers and others https://github.com/sonalgoyal/hiho Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Thu, Jan 6, 2011 at 4:23 PM, Jérôme Thièvre jthie...@gmail.com wrote: Hi, we are currently using Hadoop (version 0.20.2) to manage some web archiving processes like fulltext indexing, and it works very well with small records that contains html. Now, we would like to work with other type of web data like videos. These kind of data could be really large and of course these records doesn't fit in memory. Is it possible to manage record which content doesn't reside in memory but on disk. A possibility would be to implements a Writable that read its content from a DataInput but doesn't load it in memory, instead it would copy that content to a temporary file in the local file system and allows to stream its content using an InputStream (an InputStreamWritable). Has somebody tested a similar approach, and if not do you think some big problems could happen (that impacts performance) with this method ? Thanks, Jérôme Thièvre
ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.EOFException
Hi, After power failure, the name node is not starting,, giving the following error. kindly let me know how to resolve it thnx 2011-01-07 04:14:49,666 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = ubuntu/192.168.1.2 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.2+737 STARTUP_MSG: build = -r 98c55c28258aa6f42250569bd7fa431ac657bdbd; compiled by 'root' on Mon Oct 11 17:21:30 UTC 2010 / 2011-01-07 04:14:50,610 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2011-01-07 04:14:50,670 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NoEmitMetricsContext 2011-01-07 04:14:50,907 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hdfs 2011-01-07 04:14:50,908 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2011-01-07 04:14:50,908 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=false 2011-01-07 04:14:50,931 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 2011-01-07 04:14:52,378 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NoEmitMetricsContext 2011-01-07 04:14:52,392 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean 2011-01-07 04:14:52,651 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readLong(DataInputStream.java:399) at org.apache.hadoop.hdfs.server.namenode.FSImage.readCheckpointTime(FSImage.java:571) at org.apache.hadoop.hdfs.server.namenode.FSImage.getFields(FSImage.java:562) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.java:237) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.java:226) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:316) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:99) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:343) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:317) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:214) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:394) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1148) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1157) 2011-01-07 04:14:52,662 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readLong(DataInputStream.java:399) at org.apache.hadoop.hdfs.server.namenode.FSImage.readCheckpointTime(FSImage.java:571) at org.apache.hadoop.hdfs.server.namenode.FSImage.getFields(FSImage.java:562) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.java:237) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.java:226) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:316) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:99) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:343) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:317) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:214) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:394) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1148) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1157) 2011-01-07 04:14:52,673 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: -- Regards Shuja-ur-Rehman Baig http://pk.linkedin.com/in/shujamughal
Too-many fetch failure Reduce Error
Dear all, I am researching about the below error and could not able to find the reason : Data Size : 3.4 GB Hadoop-0.20.0 had...@ws32-test-lin:~/project/hadoop-0.20.2$ bin/hadoop jar hadoop-0.20.2-examples.jar wordcount /user/hadoop/page_content.txt page_content_output.txt 11/01/07 16:11:14 INFO input.FileInputFormat: Total input paths to process : 1 11/01/07 16:11:15 INFO mapred.JobClient: Running job: job_201101071129_0001 11/01/07 16:11:16 INFO mapred.JobClient: map 0% reduce 0% 11/01/07 16:11:41 INFO mapred.JobClient: map 1% reduce 0% 11/01/07 16:11:45 INFO mapred.JobClient: map 2% reduce 0% 11/01/07 16:11:48 INFO mapred.JobClient: map 3% reduce 0% 11/01/07 16:11:52 INFO mapred.JobClient: map 4% reduce 0% 11/01/07 16:11:56 INFO mapred.JobClient: map 5% reduce 0% 11/01/07 16:12:00 INFO mapred.JobClient: map 6% reduce 0% 11/01/07 16:12:05 INFO mapred.JobClient: map 7% reduce 0% 11/01/07 16:12:08 INFO mapred.JobClient: map 8% reduce 0% 11/01/07 16:12:11 INFO mapred.JobClient: map 9% reduce 0% 11/01/07 16:12:14 INFO mapred.JobClient: map 10% reduce 0% 11/01/07 16:12:17 INFO mapred.JobClient: map 11% reduce 0% 11/01/07 16:12:21 INFO mapred.JobClient: map 12% reduce 0% 11/01/07 16:12:24 INFO mapred.JobClient: map 13% reduce 0% 11/01/07 16:12:27 INFO mapred.JobClient: map 14% reduce 0% 11/01/07 16:12:30 INFO mapred.JobClient: map 15% reduce 0% 11/01/07 16:12:33 INFO mapred.JobClient: map 16% reduce 0% 11/01/07 16:12:36 INFO mapred.JobClient: map 17% reduce 0% 11/01/07 16:12:40 INFO mapred.JobClient: map 18% reduce 0% 11/01/07 16:12:45 INFO mapred.JobClient: map 19% reduce 0% 11/01/07 16:12:48 INFO mapred.JobClient: map 20% reduce 0% 11/01/07 16:12:54 INFO mapred.JobClient: map 21% reduce 0% 11/01/07 16:13:00 INFO mapred.JobClient: map 22% reduce 0% 11/01/07 16:13:04 INFO mapred.JobClient: map 22% reduce 1% 11/01/07 16:13:13 INFO mapred.JobClient: map 23% reduce 1% 11/01/07 16:13:19 INFO mapred.JobClient: map 24% reduce 1% 11/01/07 16:13:25 INFO mapred.JobClient: map 25% reduce 1% 11/01/07 16:13:30 INFO mapred.JobClient: map 26% reduce 1% 11/01/07 16:13:34 INFO mapred.JobClient: map 26% reduce 3% 11/01/07 16:13:36 INFO mapred.JobClient: map 27% reduce 3% 11/01/07 16:13:37 INFO mapred.JobClient: map 27% reduce 4% 11/01/07 16:13:39 INFO mapred.JobClient: map 28% reduce 4% 11/01/07 16:13:43 INFO mapred.JobClient: map 29% reduce 4% 11/01/07 16:13:46 INFO mapred.JobClient: map 30% reduce 4% 11/01/07 16:13:49 INFO mapred.JobClient: map 31% reduce 4% 11/01/07 16:13:52 INFO mapred.JobClient: map 32% reduce 4% 11/01/07 16:13:55 INFO mapred.JobClient: map 33% reduce 4% 11/01/07 16:13:58 INFO mapred.JobClient: map 34% reduce 4% 11/01/07 16:14:02 INFO mapred.JobClient: map 35% reduce 4% 11/01/07 16:14:05 INFO mapred.JobClient: map 36% reduce 4% 11/01/07 16:14:08 INFO mapred.JobClient: map 37% reduce 4% 11/01/07 16:14:11 INFO mapred.JobClient: map 38% reduce 4% 11/01/07 16:14:15 INFO mapred.JobClient: map 39% reduce 4% 11/01/07 16:14:19 INFO mapred.JobClient: map 40% reduce 4% 11/01/07 16:14:20 INFO mapred.JobClient: map 40% reduce 5% 11/01/07 16:14:25 INFO mapred.JobClient: map 41% reduce 5% 11/01/07 16:14:32 INFO mapred.JobClient: map 42% reduce 5% 11/01/07 16:14:38 INFO mapred.JobClient: map 43% reduce 5% 11/01/07 16:14:41 INFO mapred.JobClient: map 43% reduce 6% 11/01/07 16:14:43 INFO mapred.JobClient: map 44% reduce 6% 11/01/07 16:14:47 INFO mapred.JobClient: map 45% reduce 6% 11/01/07 16:14:50 INFO mapred.JobClient: map 46% reduce 6% 11/01/07 16:14:54 INFO mapred.JobClient: map 47% reduce 7% 11/01/07 16:14:59 INFO mapred.JobClient: map 48% reduce 7% 11/01/07 16:15:02 INFO mapred.JobClient: map 49% reduce 7% 11/01/07 16:15:05 INFO mapred.JobClient: map 50% reduce 7% 11/01/07 16:15:11 INFO mapred.JobClient: map 51% reduce 7% 11/01/07 16:15:14 INFO mapred.JobClient: map 52% reduce 7% 11/01/07 16:15:16 INFO mapred.JobClient: map 52% reduce 8% 11/01/07 16:15:20 INFO mapred.JobClient: map 53% reduce 8% 11/01/07 16:15:25 INFO mapred.JobClient: map 54% reduce 8% 11/01/07 16:15:29 INFO mapred.JobClient: map 55% reduce 8% 11/01/07 16:15:31 INFO mapred.JobClient: map 55% reduce 9% 11/01/07 16:15:33 INFO mapred.JobClient: map 56% reduce 9% 11/01/07 16:15:38 INFO mapred.JobClient: map 57% reduce 9% 11/01/07 16:15:42 INFO mapred.JobClient: map 58% reduce 9% 11/01/07 16:15:43 INFO mapred.JobClient: map 58% reduce 10% 11/01/07 16:15:46 INFO mapred.JobClient: map 59% reduce 10% 11/01/07 16:15:49 INFO mapred.JobClient: map 60% reduce 10% 11/01/07 16:15:53 INFO mapred.JobClient: map 61% reduce 10% 11/01/07 16:15:56 INFO mapred.JobClient: map 62% reduce 10% 11/01/07 16:16:00 INFO mapred.JobClient: map 63% reduce 10% 11/01/07 16:16:06 INFO mapred.JobClient: map 64% reduce 10% 11/01/07 16:16:10 INFO mapred.JobClient: map 65% reduce 10% 11/01/07 16:16:15 INFO mapred.JobClient: map 66% reduce 10% 11/01/07 16:16:18 INFO mapred.JobClient: map 67% reduce 10%
Re: How to manage large record in MapReduce
Hi Sonal, thank you, I have just implemented a solution similar to yours (without copying to a temp file as suggested in my inital post), and it seems to work. Best Regards, Jérôme 2011/1/7 Sonal Goyal sonalgoy...@gmail.com Jerome, You can take a look at FileStreamInputFormat at https://github.com/sonalgoyal/hiho/tree/hihoApache0.20/src/co/nubetech/hiho/mapreduce/lib/input This provides an input stream per file. In our case, we are using the input stream to load data into the database directly. Maybe you can use this or a similar approach for working with your videos. HTH Thanks and Regards, Sonal https://github.com/sonalgoyal/hihoConnect Hadoop with databases, Salesforce, FTP servers and others https://github.com/sonalgoyal/hiho Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Thu, Jan 6, 2011 at 4:23 PM, Jérôme Thièvre jthie...@gmail.com wrote: Hi, we are currently using Hadoop (version 0.20.2) to manage some web archiving processes like fulltext indexing, and it works very well with small records that contains html. Now, we would like to work with other type of web data like videos. These kind of data could be really large and of course these records doesn't fit in memory. Is it possible to manage record which content doesn't reside in memory but on disk. A possibility would be to implements a Writable that read its content from a DataInput but doesn't load it in memory, instead it would copy that content to a temporary file in the local file system and allows to stream its content using an InputStream (an InputStreamWritable). Has somebody tested a similar approach, and if not do you think some big problems could happen (that impacts performance) with this method ? Thanks, Jérôme Thièvre
Regarding chaining multiple map-reduce jobs in Hadoop streaming
Hi, I need to chain a couple of mapreduce jobs in Hadoop streaming. i am planning to use python to write the mapper and reducer scripts. is there any other way to chain these jobs other than using a shell script and a temporary directory in HDFS? -- Thanks, M. Varadharajan Experience is what you get when you didn't get what you wanted -By Prof. Randy Pausch in The Last Lecture My Journal :- www.thinkasgeek.wordpress.com
Re: Regarding chaining multiple map-reduce jobs in Hadoop streaming
You can use Oozie from Yahoo! for building an elegant workflow out of your streaming jobs; but you would still require output spots on your HDFS as output records aren't really pipelined into the next job. -- Harsh J www.harshj.com
Re: Too-many fetch failure Reduce Error
Adarsh, Dou you have in /etc/hosts the hostnames for masters and slaves? esteban. On Fri, Jan 7, 2011 at 06:47, Adarsh Sharma adarsh.sha...@orkash.comwrote: Dear all, I am researching about the below error and could not able to find the reason : Data Size : 3.4 GB Hadoop-0.20.0 had...@ws32-test-lin:~/project/hadoop-0.20.2$ bin/hadoop jar hadoop-0.20.2-examples.jar wordcount /user/hadoop/page_content.txt page_content_output.txt 11/01/07 16:11:14 INFO input.FileInputFormat: Total input paths to process : 1 11/01/07 16:11:15 INFO mapred.JobClient: Running job: job_201101071129_0001 11/01/07 16:11:16 INFO mapred.JobClient: map 0% reduce 0% 11/01/07 16:11:41 INFO mapred.JobClient: map 1% reduce 0% 11/01/07 16:11:45 INFO mapred.JobClient: map 2% reduce 0% 11/01/07 16:11:48 INFO mapred.JobClient: map 3% reduce 0% 11/01/07 16:11:52 INFO mapred.JobClient: map 4% reduce 0% 11/01/07 16:11:56 INFO mapred.JobClient: map 5% reduce 0% 11/01/07 16:12:00 INFO mapred.JobClient: map 6% reduce 0% 11/01/07 16:12:05 INFO mapred.JobClient: map 7% reduce 0% 11/01/07 16:12:08 INFO mapred.JobClient: map 8% reduce 0% 11/01/07 16:12:11 INFO mapred.JobClient: map 9% reduce 0% 11/01/07 16:12:14 INFO mapred.JobClient: map 10% reduce 0% 11/01/07 16:12:17 INFO mapred.JobClient: map 11% reduce 0% 11/01/07 16:12:21 INFO mapred.JobClient: map 12% reduce 0% 11/01/07 16:12:24 INFO mapred.JobClient: map 13% reduce 0% 11/01/07 16:12:27 INFO mapred.JobClient: map 14% reduce 0% 11/01/07 16:12:30 INFO mapred.JobClient: map 15% reduce 0% 11/01/07 16:12:33 INFO mapred.JobClient: map 16% reduce 0% 11/01/07 16:12:36 INFO mapred.JobClient: map 17% reduce 0% 11/01/07 16:12:40 INFO mapred.JobClient: map 18% reduce 0% 11/01/07 16:12:45 INFO mapred.JobClient: map 19% reduce 0% 11/01/07 16:12:48 INFO mapred.JobClient: map 20% reduce 0% 11/01/07 16:12:54 INFO mapred.JobClient: map 21% reduce 0% 11/01/07 16:13:00 INFO mapred.JobClient: map 22% reduce 0% 11/01/07 16:13:04 INFO mapred.JobClient: map 22% reduce 1% 11/01/07 16:13:13 INFO mapred.JobClient: map 23% reduce 1% 11/01/07 16:13:19 INFO mapred.JobClient: map 24% reduce 1% 11/01/07 16:13:25 INFO mapred.JobClient: map 25% reduce 1% 11/01/07 16:13:30 INFO mapred.JobClient: map 26% reduce 1% 11/01/07 16:13:34 INFO mapred.JobClient: map 26% reduce 3% 11/01/07 16:13:36 INFO mapred.JobClient: map 27% reduce 3% 11/01/07 16:13:37 INFO mapred.JobClient: map 27% reduce 4% 11/01/07 16:13:39 INFO mapred.JobClient: map 28% reduce 4% 11/01/07 16:13:43 INFO mapred.JobClient: map 29% reduce 4% 11/01/07 16:13:46 INFO mapred.JobClient: map 30% reduce 4% 11/01/07 16:13:49 INFO mapred.JobClient: map 31% reduce 4% 11/01/07 16:13:52 INFO mapred.JobClient: map 32% reduce 4% 11/01/07 16:13:55 INFO mapred.JobClient: map 33% reduce 4% 11/01/07 16:13:58 INFO mapred.JobClient: map 34% reduce 4% 11/01/07 16:14:02 INFO mapred.JobClient: map 35% reduce 4% 11/01/07 16:14:05 INFO mapred.JobClient: map 36% reduce 4% 11/01/07 16:14:08 INFO mapred.JobClient: map 37% reduce 4% 11/01/07 16:14:11 INFO mapred.JobClient: map 38% reduce 4% 11/01/07 16:14:15 INFO mapred.JobClient: map 39% reduce 4% 11/01/07 16:14:19 INFO mapred.JobClient: map 40% reduce 4% 11/01/07 16:14:20 INFO mapred.JobClient: map 40% reduce 5% 11/01/07 16:14:25 INFO mapred.JobClient: map 41% reduce 5% 11/01/07 16:14:32 INFO mapred.JobClient: map 42% reduce 5% 11/01/07 16:14:38 INFO mapred.JobClient: map 43% reduce 5% 11/01/07 16:14:41 INFO mapred.JobClient: map 43% reduce 6% 11/01/07 16:14:43 INFO mapred.JobClient: map 44% reduce 6% 11/01/07 16:14:47 INFO mapred.JobClient: map 45% reduce 6% 11/01/07 16:14:50 INFO mapred.JobClient: map 46% reduce 6% 11/01/07 16:14:54 INFO mapred.JobClient: map 47% reduce 7% 11/01/07 16:14:59 INFO mapred.JobClient: map 48% reduce 7% 11/01/07 16:15:02 INFO mapred.JobClient: map 49% reduce 7% 11/01/07 16:15:05 INFO mapred.JobClient: map 50% reduce 7% 11/01/07 16:15:11 INFO mapred.JobClient: map 51% reduce 7% 11/01/07 16:15:14 INFO mapred.JobClient: map 52% reduce 7% 11/01/07 16:15:16 INFO mapred.JobClient: map 52% reduce 8% 11/01/07 16:15:20 INFO mapred.JobClient: map 53% reduce 8% 11/01/07 16:15:25 INFO mapred.JobClient: map 54% reduce 8% 11/01/07 16:15:29 INFO mapred.JobClient: map 55% reduce 8% 11/01/07 16:15:31 INFO mapred.JobClient: map 55% reduce 9% 11/01/07 16:15:33 INFO mapred.JobClient: map 56% reduce 9% 11/01/07 16:15:38 INFO mapred.JobClient: map 57% reduce 9% 11/01/07 16:15:42 INFO mapred.JobClient: map 58% reduce 9% 11/01/07 16:15:43 INFO mapred.JobClient: map 58% reduce 10% 11/01/07 16:15:46 INFO mapred.JobClient: map 59% reduce 10% 11/01/07 16:15:49 INFO mapred.JobClient: map 60% reduce 10% 11/01/07 16:15:53 INFO mapred.JobClient: map 61% reduce 10% 11/01/07 16:15:56 INFO mapred.JobClient: map 62% reduce 10% 11/01/07 16:16:00 INFO mapred.JobClient: map 63% reduce 10%
Re: ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.EOFException
Hi Shuja, Can you paste the output of ls -lR on all of your dfs.name.dirs? (hopefully you have more than one, with one on an external machine via NFS, right?) Thanks -Todd On Fri, Jan 7, 2011 at 4:39 AM, Shuja Rehman shujamug...@gmail.com wrote: Hi, After power failure, the name node is not starting,, giving the following error. kindly let me know how to resolve it thnx 2011-01-07 04:14:49,666 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = ubuntu/192.168.1.2 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.2+737 STARTUP_MSG: build = -r 98c55c28258aa6f42250569bd7fa431ac657bdbd; compiled by 'root' on Mon Oct 11 17:21:30 UTC 2010 / 2011-01-07 04:14:50,610 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2011-01-07 04:14:50,670 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NoEmitMetricsContext 2011-01-07 04:14:50,907 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hdfs 2011-01-07 04:14:50,908 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2011-01-07 04:14:50,908 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=false 2011-01-07 04:14:50,931 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 2011-01-07 04:14:52,378 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NoEmitMetricsContext 2011-01-07 04:14:52,392 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean 2011-01-07 04:14:52,651 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readLong(DataInputStream.java:399) at org.apache.hadoop.hdfs.server.namenode.FSImage.readCheckpointTime(FSImage.java:571) at org.apache.hadoop.hdfs.server.namenode.FSImage.getFields(FSImage.java:562) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.java:237) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.java:226) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:316) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:99) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:343) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:317) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:214) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:394) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1148) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1157) 2011-01-07 04:14:52,662 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readLong(DataInputStream.java:399) at org.apache.hadoop.hdfs.server.namenode.FSImage.readCheckpointTime(FSImage.java:571) at org.apache.hadoop.hdfs.server.namenode.FSImage.getFields(FSImage.java:562) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.java:237) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.java:226) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:316) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:99) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:343) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:317) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:214) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:394) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1148) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1157) 2011-01-07 04:14:52,673 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: -- Regards Shuja-ur-Rehman Baig http://pk.linkedin.com/in/shujamughal -- Todd Lipcon Software
Help: How to increase amont maptasks per job ?
We have a jobs which runs in several map/reduce stages. In the first job, a large number of map tasks -82 are initiated, as expected. And that cause all nodes to be used. In a later job, where we are still dealing with large amounts of data, only 4 map tasks are initiated, and that caused to use only 4 nodes. This stage is actually the workhorse of the job, and requires much more processing power than the initial stage. We are trying to understand why only a few map tasks are being used, as we are not getting the full advantage of our cluster.
Data Platform Engineer at Turn Inc.
Data Platform Engineer at Turn Inc. If you're passionate about large-scale distributed systems, petabyte data warehouses, NoSQL key-value stores, high throughput real-time reporting system and are interested in joining a world class engineering team you might well be the person we're looking for. This hands-on role contributes to the organization's success through expertise in large scale MapReduce systems, advanced database programming and database architecture. We are a small team but build innovative and powerful systems. Our results were published in top-tier conferences such as VLDB (http://www.vldb2010.org/proceedings/files/papers/I08.pdf). If any of the following areas interest you, please send your resume to j...@turn.com * Hybrid MapReduce/Database system * Distributed and parallel computing * Performance tuning and optimization * Large scale semi-structured data store * Real time reporting system in 24 x 7 environment * Turning research results into enterprise class software About Turn Inc. Turn was founded to bring the efficiencies of search to digital advertising and empower the world's premier advertising agencies and brands to reach custom audiences at scale. We are a software and services company with the industry's only end-to-end platform for delivering the most effective data-driven digital advertising in the world. Our technology infrastructure, self-service interface, optimization algorithms, real-time analytics, and interoperability represent the future of media and data management. The company is based in Silicon Valley with locations in New York City, Charlotte, Chicago, London, Los Angeles, and San Francisco. We are a rapidly growing, well funded startup in Redwood City, CA, with a growing business, a working business model and a seasoned executive team. We're changing the way the world thinks about online advertising and we are looking for talented engineers to help us take it to the next level.
Re: Help: How to increase amont maptasks per job ?
Set higher values for mapred.tasktracker.map.tasks.maximum (and mapred.tasktracker.reduce.tasks.maximum) in mapred-site.xml On Fri, Jan 7, 2011 at 12:58 PM, Tali K ncherr...@hotmail.com wrote: We have a jobs which runs in several map/reduce stages. In the first job, a large number of map tasks -82 are initiated, as expected. And that cause all nodes to be used. In a later job, where we are still dealing with large amounts of data, only 4 map tasks are initiated, and that caused to use only 4 nodes. This stage is actually the workhorse of the job, and requires much more processing power than the initial stage. We are trying to understand why only a few map tasks are being used, as we are not getting the full advantage of our cluster.
Re: Help: How to increase amont maptasks per job ?
Also make sure you've enough input files for the next stage mappers to work with... Read thru the input splits part of tutorial: http://wiki.apache.org/hadoop/HadoopMapReduce If the last stage had only 4 reducers running, they'd generate 4 output files. This will limit the # of mappers started in the next stage to 4, unless you tune your input split parameters or write a custom input split. Hope this helps, there is lot more literature on this on the web and hadoop books released till date. -Rahul On Fri, Jan 7, 2011 at 1:19 PM, Ted Yu yuzhih...@gmail.com wrote: Set higher values for mapred.tasktracker.map.tasks.maximum (and mapred.tasktracker.reduce.tasks.maximum) in mapred-site.xml On Fri, Jan 7, 2011 at 12:58 PM, Tali K ncherr...@hotmail.com wrote: We have a jobs which runs in several map/reduce stages. In the first job, a large number of map tasks -82 are initiated, as expected. And that cause all nodes to be used. In a later job, where we are still dealing with large amounts of data, only 4 map tasks are initiated, and that caused to use only 4 nodes. This stage is actually the workhorse of the job, and requires much more processing power than the initial stage. We are trying to understand why only a few map tasks are being used, as we are not getting the full advantage of our cluster.
RE: Help: How to increase amont maptasks per job ?
According to the documentation, that parameter is for the number of tasks *per TaskTracker*. I am asking about the number of tasks for the entire job and entire cluster. That parameter is already set to 3, which is one less than the number of cores on each node's CPU, as recommended.In my question I stated that 82 tasks were run for the first job, yet only 4 for the second - both numbers being cluster-wide. Date: Fri, 7 Jan 2011 13:19:42 -0800 Subject: Re: Help: How to increase amont maptasks per job ? From: yuzhih...@gmail.com To: common-user@hadoop.apache.org Set higher values for mapred.tasktracker.map.tasks.maximum (and mapred.tasktracker.reduce.tasks.maximum) in mapred-site.xml On Fri, Jan 7, 2011 at 12:58 PM, Tali K ncherr...@hotmail.com wrote: We have a jobs which runs in several map/reduce stages. In the first job, a large number of map tasks -82 are initiated, as expected. And that cause all nodes to be used. In a later job, where we are still dealing with large amounts of data, only 4 map tasks are initiated, and that caused to use only 4 nodes. This stage is actually the workhorse of the job, and requires much more processing power than the initial stage. We are trying to understand why only a few map tasks are being used, as we are not getting the full advantage of our cluster.
Accessing HDFS
Hi everyone, I'm wondering if there is a way to doing the following commands to HDFS ... File LocalinputDir = new File (/user/maha/inputDir); String[] file = LocalinputDir.list(); I'm given Hadoop and input directory with files {f1,f2 ..}. I would like for hadoop to open the directory and list the file names .. how can I do that? or I know I can copyToLocal the directory and use the regular-local fs.List() to see the file names . But the other way is faster, Thanks, Maha
Re: Help: How to increase amont maptasks per job ?
You said you have a large amount of data. How large is that approximately? Did you compress the intermediate data (with what codec)? Niels 2011/1/7 Tali K ncherr...@hotmail.com: According to the documentation, that parameter is for the number of tasks *per TaskTracker*. I am asking about the number of tasks for the entire job and entire cluster. That parameter is already set to 3, which is one less than the number of cores on each node's CPU, as recommended.In my question I stated that 82 tasks were run for the first job, yet only 4 for the second - both numbers being cluster-wide. Date: Fri, 7 Jan 2011 13:19:42 -0800 Subject: Re: Help: How to increase amont maptasks per job ? From: yuzhih...@gmail.com To: common-user@hadoop.apache.org Set higher values for mapred.tasktracker.map.tasks.maximum (and mapred.tasktracker.reduce.tasks.maximum) in mapred-site.xml On Fri, Jan 7, 2011 at 12:58 PM, Tali K ncherr...@hotmail.com wrote: We have a jobs which runs in several map/reduce stages. In the first job, a large number of map tasks -82 are initiated, as expected. And that cause all nodes to be used. In a later job, where we are still dealing with large amounts of data, only 4 map tasks are initiated, and that caused to use only 4 nodes. This stage is actually the workhorse of the job, and requires much more processing power than the initial stage. We are trying to understand why only a few map tasks are being used, as we are not getting the full advantage of our cluster. -- Met vriendelijke groeten, Niels Basjes
Re: Help: How to increase amont maptasks per job ?
Check out mapred.map.tasks and mapred.reduce.tasks On Fri, Jan 7, 2011 at 1:40 PM, Tali K ncherr...@hotmail.com wrote: According to the documentation, that parameter is for the number of tasks *per TaskTracker*. I am asking about the number of tasks for the entire job and entire cluster. That parameter is already set to 3, which is one less than the number of cores on each node's CPU, as recommended.In my question I stated that 82 tasks were run for the first job, yet only 4 for the second - both numbers being cluster-wide. Date: Fri, 7 Jan 2011 13:19:42 -0800 Subject: Re: Help: How to increase amont maptasks per job ? From: yuzhih...@gmail.com To: common-user@hadoop.apache.org Set higher values for mapred.tasktracker.map.tasks.maximum (and mapred.tasktracker.reduce.tasks.maximum) in mapred-site.xml On Fri, Jan 7, 2011 at 12:58 PM, Tali K ncherr...@hotmail.com wrote: We have a jobs which runs in several map/reduce stages. In the first job, a large number of map tasks -82 are initiated, as expected. And that cause all nodes to be used. In a later job, where we are still dealing with large amounts of data, only 4 map tasks are initiated, and that caused to use only 4 nodes. This stage is actually the workhorse of the job, and requires much more processing power than the initial stage. We are trying to understand why only a few map tasks are being used, as we are not getting the full advantage of our cluster.
Re: Accessing HDFS
I'm wondering if there is a way to doing the following commands to HDFS ... File LocalinputDir = new File (/user/maha/inputDir); String[] file = LocalinputDir.list(); I'm given Hadoop and input directory with files {f1,f2 ..}. I would like for hadoop to open the directory and list the file names .. how can I do that? or I know I can copyToLocal the directory and use the regular-local fs.List() to see the file names . But the other way is faster, You could do something like: Path f = new Path(hdfs://user/maha/inputDir); FileSystem fs = f.getFileSystem(new Configuration()); for(FileStatus s : fs.listStatus(f)) { System.out.println(s.getPath().getName()); } Jacob RIDEOUT Software Engineer Return Path, Inc Skype: jrideout Twitter: @jrideout
Re: Accessing HDFS
Nice ! I'd better try that. So the trick is only to add hdfs to the path to access that namespace. Thanks a ton :) Maha On Jan 7, 2011, at 1:55 PM, Jacob R Rideout wrote: I'm wondering if there is a way to doing the following commands to HDFS ... File LocalinputDir = new File (/user/maha/inputDir); String[] file = LocalinputDir.list(); I'm given Hadoop and input directory with files {f1,f2 ..}. I would like for hadoop to open the directory and list the file names .. how can I do that? or I know I can copyToLocal the directory and use the regular-local fs.List() to see the file names . But the other way is faster, You could do something like: Path f = new Path(hdfs://user/maha/inputDir); FileSystem fs = f.getFileSystem(new Configuration()); for(FileStatus s : fs.listStatus(f)) { System.out.println(s.getPath().getName()); } Jacob RIDEOUT Software Engineer Return Path, Inc Skype: jrideout Twitter: @jrideout
Re: Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
Which Hadoop versions are you testing and compiling against? Thanks and Regards, Sonal https://github.com/sonalgoyal/hihoConnect Hadoop with databases, Salesforce, FTP servers and others https://github.com/sonalgoyal/hiho Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Wed, Jan 5, 2011 at 3:20 PM, Cavus,M.,Fa. Post Direkt m.ca...@postdirekt.de wrote: Hi, I get this, did anyone know why I get an Error?: 11/01/05 10:46:55 WARN conf.Configuration: fs.checkpoint.period is deprecated. Instead, use dfs.namenode.checkpoint.period 11/01/05 10:46:55 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 11/01/05 10:46:55 INFO mapreduce.JobSubmitter: number of splits:1 11/01/05 10:46:55 INFO mapreduce.JobSubmitter: adding the following namenodes' delegation tokens:null 11/01/05 10:46:56 INFO mapreduce.Job: Running job: job_201101051016_0008 11/01/05 10:46:57 INFO mapreduce.Job: map 0% reduce 0% 11/01/05 10:47:04 INFO mapreduce.Job: map 100% reduce 0% 11/01/05 10:47:13 INFO mapreduce.Job: Task Id : attempt_201101051016_0008_r_00_0, Status : FAILED Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected 11/01/05 10:47:23 INFO mapreduce.Job: Task Id : attempt_201101051016_0008_r_00_1, Status : FAILED Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected 11/01/05 10:47:34 INFO mapreduce.Job: Task Id : attempt_201101051016_0008_r_00_2, Status : FAILED Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected 11/01/05 10:47:47 INFO mapreduce.Job: Job complete: job_201101051016_0008 11/01/05 10:47:47 INFO mapreduce.Job: Counters: 19 FileSystemCounters FILE_BYTES_WRITTEN=38 HDFS_BYTES_READ=69 Job Counters Data-local map tasks=1 Total time spent by all maps waiting after reserving slots (ms)=0 Total time spent by all reduces waiting after reserving slots (ms)=0 Failed reduce tasks=1 SLOTS_MILLIS_MAPS=5781 SLOTS_MILLIS_REDUCES=6379 Launched map tasks=1 Launched reduce tasks=4 Map-Reduce Framework Combine input records=0 Failed Shuffles=0 GC time elapsed (ms)=97 Map input records=0 Map output bytes=0 Map output records=0 Merged Map outputs=0 Spilled Records=0 SPLIT_RAW_BYTES=69 11/01/05 10:47:47 INFO zookeeper.ZooKeeper: Session: 0x12d555a4ed80018 closed
Re: Help: How to increase amont maptasks per job ?
It would depend on your input format. If the job is using an InputFormat that does not let it split files, you would get only mappers == no. of files. For splittable input files, you get mappers no. of files. Little more information on what the input format is could help tracking down the problem a bit more. On Sat, Jan 8, 2011 at 3:10 AM, Tali K ncherr...@hotmail.com wrote: According to the documentation, that parameter is for the number of tasks *per TaskTracker*. I am asking about the number of tasks for the entire job and entire cluster. That parameter is already set to 3, which is one less than the number of cores on each node's CPU, as recommended.In my question I stated that 82 tasks were run for the first job, yet only 4 for the second - both numbers being cluster-wide. Date: Fri, 7 Jan 2011 13:19:42 -0800 Subject: Re: Help: How to increase amont maptasks per job ? From: yuzhih...@gmail.com To: common-user@hadoop.apache.org Set higher values for mapred.tasktracker.map.tasks.maximum (and mapred.tasktracker.reduce.tasks.maximum) in mapred-site.xml On Fri, Jan 7, 2011 at 12:58 PM, Tali K ncherr...@hotmail.com wrote: We have a jobs which runs in several map/reduce stages. In the first job, a large number of map tasks -82 are initiated, as expected. And that cause all nodes to be used. In a later job, where we are still dealing with large amounts of data, only 4 map tasks are initiated, and that caused to use only 4 nodes. This stage is actually the workhorse of the job, and requires much more processing power than the initial stage. We are trying to understand why only a few map tasks are being used, as we are not getting the full advantage of our cluster. -- Harsh J www.harshj.com