Re: Print execution time
Also available on jobtracker web ui. Farhan Husain wrote: The logs might help. On Tue, Apr 7, 2009 at 7:28 PM, Mithila Nagendra mnage...@asu.edu wrote: Hey all! Is there a way to print out the execution time of a map reduce task? An inbuilt function or option to be used with bin/hadoop Thanks! Mithila Nagendra
Re: Very assymetric data allocation
Great thanks for the info! Right after I finished my last question I started to think about how Hadoop measures data allocation. Are the figures presented actually the size of HDFS on each machine or the amount of disk allocated and measured by issuing something like df. The reason why I am asking is that df -h is quite close to the figures presented in the GUI but it could be a coincidence. //Marcus On Tue, Apr 7, 2009 at 4:02 PM, Koji Noguchi knogu...@yahoo-inc.com wrote: Marcus, One known issue in 0.18.3 is HADOOP-5465. CopyPaste from https://issues.apache.org/jira/browse/HADOOP-4489?focusedCommentId=12693 956page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpa nel#action_12693956https://issues.apache.org/jira/browse/HADOOP-4489?focusedCommentId=12693%0A956page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpa%0Anel#action_12693956 Hairong said: This bug might be caused by HADOOP-5465. Once a datanode hits HADOOP-5465, NameNode sends an empty replication request to the data node on every reply to a heartbeat, thus not a single scheduled block deletion request can be sent to the data node. (Also, if you're always writing from one of the nodes, that node is more likely to get full.) Nigel, not sure if this is the issue, but it would be nice to have 0.18.4 out. Koji -Original Message- From: Marcus Herou [mailto:marcus.he...@tailsweep.com] Sent: Tuesday, April 07, 2009 12:45 AM To: hadoop-u...@lucene.apache.org Subject: Very assymetric data allocation Hi. We are running Hadoop 0.18.3 and noticed a strange issue when one of our machines went out of disk yesterday. If you can see the table below it would display that the server mapredcoord is 66.91% allocated and the others are almost empty. How can that be ? Any information about this would be very helpful. mapredcoord is as well our jobtracker. //Marcus Node Last Contact Admin State Size (GB) Used (%) Used (%) Remaining (GB) Blocks mapredcoordhttp://mapredcoord:50076/browseDirectory.jsp?namenodeInfoPor t=50070dir=%2Fhttp://mapredcoord:50076/browseDirectory.jsp?namenodeInfoPor%0At=50070dir=%2F 2In Service416.6966.91 90.9419806 mapreduce2http://mapreduce2:50076/browseDirectory.jsp?namenodeInfoPort= 50070dir=%2Fhttp://mapreduce2:50076/browseDirectory.jsp?namenodeInfoPort=%0A50070dir=%2F 2In Service416.696.71 303.54456 mapreduce3http://mapreduce3:50076/browseDirectory.jsp?namenodeInfoPort= 50070dir=%2Fhttp://mapreduce3:50076/browseDirectory.jsp?namenodeInfoPort=%0A50070dir=%2F 2In Service416.690.44 351.693975 mapreduce4http://mapreduce4:50076/browseDirectory.jsp?namenodeInfoPort= 50070dir=%2Fhttp://mapreduce4:50076/browseDirectory.jsp?namenodeInfoPort=%0A50070dir=%2F 0In Service416.690.25 355.821549 mapreduce5http://mapreduce5:50076/browseDirectory.jsp?namenodeInfoPort= 50070dir=%2Fhttp://mapreduce5:50076/browseDirectory.jsp?namenodeInfoPort=%0A50070dir=%2F 2In Service416.690.42 347.683995 mapreduce6http://mapreduce6:50076/browseDirectory.jsp?namenodeInfoPort= 50070dir=%2Fhttp://mapreduce6:50076/browseDirectory.jsp?namenodeInfoPort=%0A50070dir=%2F 0In Service416.690.43 352.73982 mapreduce7http://mapreduce7:50076/browseDirectory.jsp?namenodeInfoPort= 50070dir=%2Fhttp://mapreduce7:50076/browseDirectory.jsp?namenodeInfoPort=%0A50070dir=%2F 0In Service416.690.5 351.914079 mapreduce8http://mapreduce8:50076/browseDirectory.jsp?namenodeInfoPort= 50070dir=%2Fhttp://mapreduce8:50076/browseDirectory.jsp?namenodeInfoPort=%0A50070dir=%2F 1In Service416.690.48 350.154169 -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/ -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ http://blogg.tailsweep.com/
Getting free and used space
Hi. I'm trying to use the API to get the overall used and free spaces. I tried this function getUsed(), but it always returns 0. Any idea? Thanks.
Re: Too many fetch errors
I have checked the log and found that for each map task , there are 3 failures which look like machin1(failed) - machine2(failed) - machine1(failed) - machine2(succeeded). All failures are Too many fetch failures. And i am sure there is no firewall between the two nodes , at least port 50060 can be accessed from web browser. How can I check whether two nodes can fetch mapper outputs from one another? I have no idea how reducers fetch these data ... Thanks! On Wed, Apr 8, 2009 at 2:21 AM, Aaron Kimball aa...@cloudera.com wrote: Xiaolin, Are you certain that the two nodes can fetch mapper outputs from one another? If it's taking that long to complete, it might be the case that what makes it complete is just that eventually it abandons one of your two nodes and runs everything on a single node where it succeeds -- defeating the point, of course. Might there be a firewall between the two nodes that blocks the port used by the reducer to fetch the mapper outputs? (I think this is on 50060 by default.) - Aaron On Tue, Apr 7, 2009 at 8:08 AM, xiaolin guo xiao...@hulu.com wrote: This simple map-recude application will take nearly 1 hour to finish running on the two-node cluster ,due to lots of Failed/Killed task attempts, while in the single node cluster this application only takes 1 minite ... I am quite confusing why there are so many Failed/Killed attempts .. On Tue, Apr 7, 2009 at 10:40 PM, xiaolin guo xiao...@hulu.com wrote: I am trying to setup a small hadoop cluster , everything was ok before I moved from single node cluster to two-node cluster. I followed the article http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29 http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29 http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29 to config master and slaves.However, when I tried to run the example wordcount map-reduce application , the reduce task got stuck in 19% for a log time . Then I got a notice:INFO mapred.JobClient: TaskId : attempt_200904072219_0001_m_02_0, Status : FAILED too many fetch errors and an error message : Error reading task outputslave. All map tasks in both task nodes had been finished which could be verified in task tracker pages. Both nodes work well in single node mode . And the Hadoop file system seems to be healthy in multi-node mode. Can anyone help me with this issue? Have already got entangled in this issue for a long time ... Thanks very much!
Re: Too many fetch errors
Fixed the problem The problem is that the one of the nodes can not resolve the name of the other node. Even if I use ip address in the masters and slaves , hadoop will use the name of the node instead of the ip address ... On Wed, Apr 8, 2009 at 7:26 PM, xiaolin guo xiao...@hulu.com wrote: I have checked the log and found that for each map task , there are 3 failures which look like machin1(failed) - machine2(failed) - machine1(failed) - machine2(succeeded). All failures are Too many fetch failures. And i am sure there is no firewall between the two nodes , at least port 50060 can be accessed from web browser. How can I check whether two nodes can fetch mapper outputs from one another? I have no idea how reducers fetch these data ... Thanks! On Wed, Apr 8, 2009 at 2:21 AM, Aaron Kimball aa...@cloudera.com wrote: Xiaolin, Are you certain that the two nodes can fetch mapper outputs from one another? If it's taking that long to complete, it might be the case that what makes it complete is just that eventually it abandons one of your two nodes and runs everything on a single node where it succeeds -- defeating the point, of course. Might there be a firewall between the two nodes that blocks the port used by the reducer to fetch the mapper outputs? (I think this is on 50060 by default.) - Aaron On Tue, Apr 7, 2009 at 8:08 AM, xiaolin guo xiao...@hulu.com wrote: This simple map-recude application will take nearly 1 hour to finish running on the two-node cluster ,due to lots of Failed/Killed task attempts, while in the single node cluster this application only takes 1 minite ... I am quite confusing why there are so many Failed/Killed attempts .. On Tue, Apr 7, 2009 at 10:40 PM, xiaolin guo xiao...@hulu.com wrote: I am trying to setup a small hadoop cluster , everything was ok before I moved from single node cluster to two-node cluster. I followed the article http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29 http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29 http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29 to config master and slaves.However, when I tried to run the example wordcount map-reduce application , the reduce task got stuck in 19% for a log time . Then I got a notice:INFO mapred.JobClient: TaskId : attempt_200904072219_0001_m_02_0, Status : FAILED too many fetch errors and an error message : Error reading task outputslave. All map tasks in both task nodes had been finished which could be verified in task tracker pages. Both nodes work well in single node mode . And the Hadoop file system seems to be healthy in multi-node mode. Can anyone help me with this issue? Have already got entangled in this issue for a long time ... Thanks very much!
using cascading fro map-reduce
hi, I am trying to use cascading API for replacing map-reduce. can anybody give me the detailed description on how to run it. i have followed the instruction given on http://www.cascading.org/ but couldn't do it so if anybody had run that can please guide me on this. DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Re: Getting free and used space
Hey Stas, Did you try this as a privileged user? There might be some permission errors... in most of the released versions, getUsed() is only available to the Hadoop superuser. It may be that the exception isn't propagating correctly. Brian On Apr 8, 2009, at 3:13 AM, Stas Oskin wrote: Hi. I'm trying to use the API to get the overall used and free spaces. I tried this function getUsed(), but it always returns 0. Any idea? Thanks.
using cascading
hi, I am new to the cascading concept.can anybody help me on how to run the cascading example I followed the http://cascading.org link but its not helping me please can anybody explain me the steps needed to be followed to run the cascading example DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
RE: Hadoop data nodes failing to start
FYI: Problem fixed. It was apparently a timeout condition present in 0.18.3 that only popped up when the additional nodes were added. The solution was to put the following entry in hadoop-site.xml: property namedfs.datanode.socket.write.timeout/name value0/value /property Thanks to 'jdcryans' and 'digarok' from IRC for the help. -kevin -Original Message- From: Kevin Eppinger [mailto:keppin...@adknowledge.com] Sent: Tuesday, April 07, 2009 1:05 PM To: core-user@hadoop.apache.org Subject: Hadoop data nodes failing to start Hello everyone- So I have a 5 node cluster that I've been running for a few weeks with no problems. Today I decided to add nodes and double its size to 10. After doing all the setup and starting the cluster, I discovered that four out of the 10 nodes had failed to startup. Specifically, the data nodes didn't start. The task trackers seemed to start fine. Thinking it was something I did incorrectly with the expansion, I then reverted back to the 5 node configuration but I'm experiencing the same problem...with only 2 of 5 nodes starting correctly. Here is what I'm seeing in the hadoop-*-datanode*.log files: 2009-04-07 12:35:40,628 INFO org.apache.hadoop.dfs.DataNode: Starting Periodic block scanner. 2009-04-07 12:35:45,548 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 9269 blocks got processed in 1128 msecs 2009-04-07 12:35:45,584 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.254.165.223:50010, storageID=DS-202528624-10.254.13 1.244-50010-1238604807366, infoPort=50075, ipcPort=50020):DataXceiveServer: Exiting due to:java.nio.channels.ClosedSelectorException at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:66) at sun.nio.ch.SelectorImpl.selectNow(SelectorImpl.java:88) at sun.nio.ch.Util.releaseTemporarySelector(Util.java:135) at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:120) at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997) at java.lang.Thread.run(Thread.java:619) After this the data node shuts down. This same message is appearing on all the failed nodes. Help! -kevin
Re: Hadoop data nodes failing to start
Kevin, I'm glad it worked for you. We talked a bit about 5114 yesterday, any chance of trying 0.18 branch on that same cluster without the socket timeout thing? Thx, J-D On Wed, Apr 8, 2009 at 9:24 AM, Kevin Eppinger keppin...@adknowledge.com wrote: FYI: Problem fixed. It was apparently a timeout condition present in 0.18.3 that only popped up when the additional nodes were added. The solution was to put the following entry in hadoop-site.xml: property namedfs.datanode.socket.write.timeout/name value0/value /property Thanks to 'jdcryans' and 'digarok' from IRC for the help. -kevin -Original Message- From: Kevin Eppinger [mailto:keppin...@adknowledge.com] Sent: Tuesday, April 07, 2009 1:05 PM To: core-user@hadoop.apache.org Subject: Hadoop data nodes failing to start Hello everyone- So I have a 5 node cluster that I've been running for a few weeks with no problems. Today I decided to add nodes and double its size to 10. After doing all the setup and starting the cluster, I discovered that four out of the 10 nodes had failed to startup. Specifically, the data nodes didn't start. The task trackers seemed to start fine. Thinking it was something I did incorrectly with the expansion, I then reverted back to the 5 node configuration but I'm experiencing the same problem...with only 2 of 5 nodes starting correctly. Here is what I'm seeing in the hadoop-*-datanode*.log files: 2009-04-07 12:35:40,628 INFO org.apache.hadoop.dfs.DataNode: Starting Periodic block scanner. 2009-04-07 12:35:45,548 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 9269 blocks got processed in 1128 msecs 2009-04-07 12:35:45,584 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.254.165.223:50010, storageID=DS-202528624-10.254.13 1.244-50010-1238604807366, infoPort=50075, ipcPort=50020):DataXceiveServer: Exiting due to:java.nio.channels.ClosedSelectorException at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:66) at sun.nio.ch.SelectorImpl.selectNow(SelectorImpl.java:88) at sun.nio.ch.Util.releaseTemporarySelector(Util.java:135) at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:120) at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997) at java.lang.Thread.run(Thread.java:619) After this the data node shuts down. This same message is appearing on all the failed nodes. Help! -kevin
Re: Example of deploying jars through DistributedCache?
Does it work if you use addArchiveToClassPath()? Also, it may be more convenient to use GenericOptionsParser's -libjars option. Tom On Mon, Mar 2, 2009 at 7:42 AM, Aaron Kimball aa...@cloudera.com wrote: Hi all, I'm stumped as to how to use the distributed cache's classpath feature. I have a library of Java classes I'd like to distribute to jobs and use in my mapper; I figured the DCache's addFileToClassPath() method was the correct means, given the example at http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html. I've boiled it down to the following non-working example: in TestDriver.java: private void runJob() throws IOException { JobConf conf = new JobConf(getConf(), TestDriver.class); // do standard job configuration. FileInputFormat.addInputPath(conf, new Path(input)); FileOutputFormat.setOutputPath(conf, new Path(output)); conf.setMapperClass(TestMapper.class); conf.setNumReduceTasks(0); // load aaronTest2.jar into the dcache; this contains the class ValueProvider FileSystem fs = FileSystem.get(conf); fs.copyFromLocalFile(new Path(aaronTest2.jar), new Path(tmp/aaronTest2.jar)); DistributedCache.addFileToClassPath(new Path(tmp/aaronTest2.jar), conf); // run the job. JobClient.runJob(conf); } and then in TestMapper: public void map(LongWritable key, Text value, OutputCollectorLongWritable, Text output, Reporter reporter) throws IOException { try { ValueProvider vp = (ValueProvider) Class.forName(ValueProvider).newInstance(); Text val = vp.getValue(); output.collect(new LongWritable(1), val); } catch (ClassNotFoundException e) { throw new IOException(not found: + e.toString()); // newInstance() throws to here. } catch (Exception e) { throw new IOException(Exception: + e.toString()); } } The class ValueProvider is to be loaded from aaronTest2.jar. I can verify that this code works if I put ValueProvider into the main jar I deploy. I can verify that aaronTest2.jar makes it into the ${mapred.local.dir}/taskTracker/archive/ But when run with ValueProvider in aaronTest2.jar, the job fails with: $ bin/hadoop jar aaronTest1.jar TestDriver 09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to process : 10 09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to process : 10 09/03/01 22:36:04 INFO mapred.JobClient: Running job: job_200903012210_0005 09/03/01 22:36:05 INFO mapred.JobClient: map 0% reduce 0% 09/03/01 22:36:14 INFO mapred.JobClient: Task Id : attempt_200903012210_0005_m_00_0, Status : FAILED java.io.IOException: not found: java.lang.ClassNotFoundException: ValueProvider at TestMapper.map(Unknown Source) at TestMapper.map(Unknown Source) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) Do I need to do something else (maybe in Mapper.configure()?) to actually classload the jar? The documentation makes me believe it should already be in the classpath by doing only what I've done above. I'm on Hadoop 0.18.3. Thanks, - Aaron
Re: using cascading fro map-reduce
Hi! If you are interested in Cascading I recommend you to ask on the Cascading mailing list or come ask in the irc channel. The mailing list can be found at the bottom left corner of www.cascading.org . Regards Erik
Re: Getting free and used space
Hi. Thanks for the explanation. Now for the easier part - how do I specify the user when connecting? :) Is it a config file level, or run-time level setting? Regards. 2009/4/8 Brian Bockelman bbock...@cse.unl.edu Hey Stas, Did you try this as a privileged user? There might be some permission errors... in most of the released versions, getUsed() is only available to the Hadoop superuser. It may be that the exception isn't propagating correctly. Brian On Apr 8, 2009, at 3:13 AM, Stas Oskin wrote: Hi. I'm trying to use the API to get the overall used and free spaces. I tried this function getUsed(), but it always returns 0. Any idea? Thanks.
Re: Getting free and used space
Hey Stas, What we do locally is apply the latest patch for this issue: https://issues.apache.org/jira/browse/HADOOP-4368 This makes getUsed (actually, it switches to FileSystem.getStatus) not a privileged action. As far as specifying the user ... gee, I can't think of it off the top of my head. It's a variable you can insert into the JobConf, but I'd have to poke around google or the code to remember which one (I try to not override it if possible). Brian On Apr 8, 2009, at 8:51 AM, Stas Oskin wrote: Hi. Thanks for the explanation. Now for the easier part - how do I specify the user when connecting? :) Is it a config file level, or run-time level setting? Regards. 2009/4/8 Brian Bockelman bbock...@cse.unl.edu Hey Stas, Did you try this as a privileged user? There might be some permission errors... in most of the released versions, getUsed() is only available to the Hadoop superuser. It may be that the exception isn't propagating correctly. Brian On Apr 8, 2009, at 3:13 AM, Stas Oskin wrote: Hi. I'm trying to use the API to get the overall used and free spaces. I tried this function getUsed(), but it always returns 0. Any idea? Thanks.
run the pipes wordcount example with nopipe
Hi, With several days investigation, the wordcount-nopipe example in the hadoop-0.19.1 package can be run finally. However, there are some changes I did but not sure if this is proper/correct way. Could anyone please help to verify? 1. start up the job with the -inputformat argument with value org.apache.hadoop.mapred.pipes.WordCountInputFormat 2. since the RecordReader/Writer in C++ is used, so no Java based RecordReader/Writer should be used. However, I faced the error like RecordReader defined while not needed from the pipes. After checking the org.apache.hadoop.mapred.pipes.Submitter.java, I found this code snippet: if (results.hasOption(-inputformat)) { setIsJavaRecordReader(job, true); job.setInputFormat(getClass(results, -inputformat, job, InputFormat.class)); } So it seems that with -inputformat specified, the JavaRecordReader will be enabled. This caused the error before. Then I comment the line setIsJavaRecordReader(job, true);. Then the examples can be run. Is this the proper way to make the wordcount-nopipe example works? I see from the code that there is a commented line //cli.addArgument(javareader, false, is the RecordReader in Java); . Should this line be uncommented to support the disable of JavaRecordReader in command line option? 3. It seems that the wordcount-nopipe only works for input/output with local URI file:///home/... Is this true? Thanks, Jianmin
RE: Example of deploying jars through DistributedCache?
I use addArchiveToClassPath, and it works for me. DistributedCache.addArchiveToClassPath(new Path(path), conf); I was curious about this block of code. Why are you coping to tmp? FileSystem fs = FileSystem.get(conf); fs.copyFromLocalFile(new Path(aaronTest2.jar), new Path(tmp/aaronTest2.jar)); -Original Message- From: Tom White [mailto:t...@cloudera.com] Sent: Wednesday, April 08, 2009 9:36 AM To: core-user@hadoop.apache.org Subject: Re: Example of deploying jars through DistributedCache? Does it work if you use addArchiveToClassPath()? Also, it may be more convenient to use GenericOptionsParser's -libjars option. Tom On Mon, Mar 2, 2009 at 7:42 AM, Aaron Kimball aa...@cloudera.com wrote: Hi all, I'm stumped as to how to use the distributed cache's classpath feature. I have a library of Java classes I'd like to distribute to jobs and use in my mapper; I figured the DCache's addFileToClassPath() method was the correct means, given the example at http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html. I've boiled it down to the following non-working example: in TestDriver.java: private void runJob() throws IOException { JobConf conf = new JobConf(getConf(), TestDriver.class); // do standard job configuration. FileInputFormat.addInputPath(conf, new Path(input)); FileOutputFormat.setOutputPath(conf, new Path(output)); conf.setMapperClass(TestMapper.class); conf.setNumReduceTasks(0); // load aaronTest2.jar into the dcache; this contains the class ValueProvider FileSystem fs = FileSystem.get(conf); fs.copyFromLocalFile(new Path(aaronTest2.jar), new Path(tmp/aaronTest2.jar)); DistributedCache.addFileToClassPath(new Path(tmp/aaronTest2.jar), conf); // run the job. JobClient.runJob(conf); } and then in TestMapper: public void map(LongWritable key, Text value, OutputCollectorLongWritable, Text output, Reporter reporter) throws IOException { try { ValueProvider vp = (ValueProvider) Class.forName(ValueProvider).newInstance(); Text val = vp.getValue(); output.collect(new LongWritable(1), val); } catch (ClassNotFoundException e) { throw new IOException(not found: + e.toString()); // newInstance() throws to here. } catch (Exception e) { throw new IOException(Exception: + e.toString()); } } The class ValueProvider is to be loaded from aaronTest2.jar. I can verify that this code works if I put ValueProvider into the main jar I deploy. I can verify that aaronTest2.jar makes it into the ${mapred.local.dir}/taskTracker/archive/ But when run with ValueProvider in aaronTest2.jar, the job fails with: $ bin/hadoop jar aaronTest1.jar TestDriver 09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to process : 10 09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to process : 10 09/03/01 22:36:04 INFO mapred.JobClient: Running job: job_200903012210_0005 09/03/01 22:36:05 INFO mapred.JobClient: map 0% reduce 0% 09/03/01 22:36:14 INFO mapred.JobClient: Task Id : attempt_200903012210_0005_m_00_0, Status : FAILED java.io.IOException: not found: java.lang.ClassNotFoundException: ValueProvider at TestMapper.map(Unknown Source) at TestMapper.map(Unknown Source) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) Do I need to do something else (maybe in Mapper.configure()?) to actually classload the jar? The documentation makes me believe it should already be in the classpath by doing only what I've done above. I'm on Hadoop 0.18.3. Thanks, - Aaron _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please contact the sender and delete the material from any computer.
Re: Web ui
@Nick, I'm using ajax very often and previously done projects with ZK and JQuery, I can easily say that GWT was the easiest of them. Javascript is only needed where core features aren't enough. I can easily assume that we won't need any inline javascript. @Philip, Thanks for the point. That is a better solution than I imagine, actually, and I won't have to wait since it's a resolved issue. -- M. Raşit ÖZDAŞ
RE: Hadoop data nodes failing to start
Unfortunately not. I don't have much leeway to experiment with this cluster. -kevin -Original Message- From: jdcry...@gmail.com [mailto:jdcry...@gmail.com] On Behalf Of Jean-Daniel Cryans Sent: Wednesday, April 08, 2009 8:30 AM To: core-user@hadoop.apache.org Subject: Re: Hadoop data nodes failing to start Kevin, I'm glad it worked for you. We talked a bit about 5114 yesterday, any chance of trying 0.18 branch on that same cluster without the socket timeout thing? Thx, J-D On Wed, Apr 8, 2009 at 9:24 AM, Kevin Eppinger keppin...@adknowledge.com wrote: FYI: Problem fixed. It was apparently a timeout condition present in 0.18.3 that only popped up when the additional nodes were added. The solution was to put the following entry in hadoop-site.xml: property namedfs.datanode.socket.write.timeout/name value0/value /property Thanks to 'jdcryans' and 'digarok' from IRC for the help. -kevin -Original Message- From: Kevin Eppinger [mailto:keppin...@adknowledge.com] Sent: Tuesday, April 07, 2009 1:05 PM To: core-user@hadoop.apache.org Subject: Hadoop data nodes failing to start Hello everyone- So I have a 5 node cluster that I've been running for a few weeks with no problems. Today I decided to add nodes and double its size to 10. After doing all the setup and starting the cluster, I discovered that four out of the 10 nodes had failed to startup. Specifically, the data nodes didn't start. The task trackers seemed to start fine. Thinking it was something I did incorrectly with the expansion, I then reverted back to the 5 node configuration but I'm experiencing the same problem...with only 2 of 5 nodes starting correctly. Here is what I'm seeing in the hadoop-*-datanode*.log files: 2009-04-07 12:35:40,628 INFO org.apache.hadoop.dfs.DataNode: Starting Periodic block scanner. 2009-04-07 12:35:45,548 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 9269 blocks got processed in 1128 msecs 2009-04-07 12:35:45,584 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.254.165.223:50010, storageID=DS-202528624-10.254.13 1.244-50010-1238604807366, infoPort=50075, ipcPort=50020):DataXceiveServer: Exiting due to:java.nio.channels.ClosedSelectorException at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:66) at sun.nio.ch.SelectorImpl.selectNow(SelectorImpl.java:88) at sun.nio.ch.Util.releaseTemporarySelector(Util.java:135) at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:120) at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997) at java.lang.Thread.run(Thread.java:619) After this the data node shuts down. This same message is appearing on all the failed nodes. Help! -kevin
Chaining Multiple Map reduce jobs.
hi everyone, i have to chain multiple map reduce jobs actually 2 to 4 jobs , each of the jobs depends on the o/p of preceding job. In the reducer of each job I'm doing very little just grouping by key from the maps. I want to give the output of one MapReduce job to the next job without having to go to the disk. Does anyone have any ideas on how to do this? Thanx.
Re: Getting free and used space
You can insert this propery into the jobconf, or specify it on the command line e.g.: -D hadoop.job.ugi=username,group,group,group. - Aaron On Wed, Apr 8, 2009 at 7:04 AM, Brian Bockelman bbock...@cse.unl.eduwrote: Hey Stas, What we do locally is apply the latest patch for this issue: https://issues.apache.org/jira/browse/HADOOP-4368 This makes getUsed (actually, it switches to FileSystem.getStatus) not a privileged action. As far as specifying the user ... gee, I can't think of it off the top of my head. It's a variable you can insert into the JobConf, but I'd have to poke around google or the code to remember which one (I try to not override it if possible). Brian On Apr 8, 2009, at 8:51 AM, Stas Oskin wrote: Hi. Thanks for the explanation. Now for the easier part - how do I specify the user when connecting? :) Is it a config file level, or run-time level setting? Regards. 2009/4/8 Brian Bockelman bbock...@cse.unl.edu Hey Stas, Did you try this as a privileged user? There might be some permission errors... in most of the released versions, getUsed() is only available to the Hadoop superuser. It may be that the exception isn't propagating correctly. Brian On Apr 8, 2009, at 3:13 AM, Stas Oskin wrote: Hi. I'm trying to use the API to get the overall used and free spaces. I tried this function getUsed(), but it always returns 0. Any idea? Thanks.
Re: How many people is using Hadoop Streaming ?
Excellent. Thanks - A On Tue, Apr 7, 2009 at 2:16 PM, Owen O'Malley omal...@apache.org wrote: On Apr 7, 2009, at 11:41 AM, Aaron Kimball wrote: Owen, Is binary streaming actually readily available? https://issues.apache.org/jira/browse/HADOOP-1722
Re: Example of deploying jars through DistributedCache?
Ooh. The other DCache-based operations assume that you're dcaching files already resident in HDFS. I guess this assumes that the filenames are on the local filesystem. - Aaron On Wed, Apr 8, 2009 at 8:32 AM, Brian MacKay brian.mac...@medecision.comwrote: I use addArchiveToClassPath, and it works for me. DistributedCache.addArchiveToClassPath(new Path(path), conf); I was curious about this block of code. Why are you coping to tmp? FileSystem fs = FileSystem.get(conf); fs.copyFromLocalFile(new Path(aaronTest2.jar), new Path(tmp/aaronTest2.jar)); -Original Message- From: Tom White [mailto:t...@cloudera.com] Sent: Wednesday, April 08, 2009 9:36 AM To: core-user@hadoop.apache.org Subject: Re: Example of deploying jars through DistributedCache? Does it work if you use addArchiveToClassPath()? Also, it may be more convenient to use GenericOptionsParser's -libjars option. Tom On Mon, Mar 2, 2009 at 7:42 AM, Aaron Kimball aa...@cloudera.com wrote: Hi all, I'm stumped as to how to use the distributed cache's classpath feature. I have a library of Java classes I'd like to distribute to jobs and use in my mapper; I figured the DCache's addFileToClassPath() method was the correct means, given the example at http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html . I've boiled it down to the following non-working example: in TestDriver.java: private void runJob() throws IOException { JobConf conf = new JobConf(getConf(), TestDriver.class); // do standard job configuration. FileInputFormat.addInputPath(conf, new Path(input)); FileOutputFormat.setOutputPath(conf, new Path(output)); conf.setMapperClass(TestMapper.class); conf.setNumReduceTasks(0); // load aaronTest2.jar into the dcache; this contains the class ValueProvider FileSystem fs = FileSystem.get(conf); fs.copyFromLocalFile(new Path(aaronTest2.jar), new Path(tmp/aaronTest2.jar)); DistributedCache.addFileToClassPath(new Path(tmp/aaronTest2.jar), conf); // run the job. JobClient.runJob(conf); } and then in TestMapper: public void map(LongWritable key, Text value, OutputCollectorLongWritable, Text output, Reporter reporter) throws IOException { try { ValueProvider vp = (ValueProvider) Class.forName(ValueProvider).newInstance(); Text val = vp.getValue(); output.collect(new LongWritable(1), val); } catch (ClassNotFoundException e) { throw new IOException(not found: + e.toString()); // newInstance() throws to here. } catch (Exception e) { throw new IOException(Exception: + e.toString()); } } The class ValueProvider is to be loaded from aaronTest2.jar. I can verify that this code works if I put ValueProvider into the main jar I deploy. I can verify that aaronTest2.jar makes it into the ${mapred.local.dir}/taskTracker/archive/ But when run with ValueProvider in aaronTest2.jar, the job fails with: $ bin/hadoop jar aaronTest1.jar TestDriver 09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to process : 10 09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to process : 10 09/03/01 22:36:04 INFO mapred.JobClient: Running job: job_200903012210_0005 09/03/01 22:36:05 INFO mapred.JobClient: map 0% reduce 0% 09/03/01 22:36:14 INFO mapred.JobClient: Task Id : attempt_200903012210_0005_m_00_0, Status : FAILED java.io.IOException: not found: java.lang.ClassNotFoundException: ValueProvider at TestMapper.map(Unknown Source) at TestMapper.map(Unknown Source) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) Do I need to do something else (maybe in Mapper.configure()?) to actually classload the jar? The documentation makes me believe it should already be in the classpath by doing only what I've done above. I'm on Hadoop 0.18.3. Thanks, - Aaron _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please contact the sender and delete the material from any computer.
Re: Chaining Multiple Map reduce jobs.
Hi, by far I am not an Hadoop expert but I think you can not start Map task until the previous Reduce is finished. Saying this it means that you probably have to store the Map output to the disk first (because a] it may not fit into memory and b] you would risk data loss if the system crashes). As for the job chaining you can check JobControl class ( http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html)http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html Also you can look at https://issues.apache.org/jira/browse/HADOOP-3702 Regards, Lukas On Wed, Apr 8, 2009 at 11:30 PM, asif md asif.d...@gmail.com wrote: hi everyone, i have to chain multiple map reduce jobs actually 2 to 4 jobs , each of the jobs depends on the o/p of preceding job. In the reducer of each job I'm doing very little just grouping by key from the maps. I want to give the output of one MapReduce job to the next job without having to go to the disk. Does anyone have any ideas on how to do this? Thanx. -- http://blog.lukas-vlcek.com/
Re: Chaining Multiple Map reduce jobs.
You can also try decreasing the replication factor for the intermediate files between jobs. This will make writing those files faster. On Apr 8, 2009, at 3:14 PM, Lukáš Vlček wrote: Hi, by far I am not an Hadoop expert but I think you can not start Map task until the previous Reduce is finished. Saying this it means that you probably have to store the Map output to the disk first (because a] it may not fit into memory and b] you would risk data loss if the system crashes). As for the job chaining you can check JobControl class ( http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html) http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html Also you can look at https://issues.apache.org/jira/browse/HADOOP-3702 Regards, Lukas On Wed, Apr 8, 2009 at 11:30 PM, asif md asif.d...@gmail.com wrote: hi everyone, i have to chain multiple map reduce jobs actually 2 to 4 jobs , each of the jobs depends on the o/p of preceding job. In the reducer of each job I'm doing very little just grouping by key from the maps. I want to give the output of one MapReduce job to the next job without having to go to the disk. Does anyone have any ideas on how to do this? Thanx. -- http://blog.lukas-vlcek.com/
Re: BytesWritable get() returns more bytes then what's stored
Hi Bing, The issue here is that BytesWritable uses an internal buffer which is grown but not shrunk. The cause of this is that Writables in general are single instances that are shared across multiple input records. If you look at the internals of the input reader, you'll see that a single BytesWritable is instantiated, and then each time a record is read, it's read into that same instance. The purpose here is to avoid the allocation cost for each row. The end result is, as you've seen, that getBytes() returns an array which may be larger than the actual amount of data. In fact, the extra bytes (between .getSize() and .get().length) have undefined contents, not zero. Unfortunately, if the protobuffer API doesn't allow you to deserialize out of a smaller portion of a byte array, you're out of luck and will have to do the copy like you've mentioned. I imagine, though, that there's some way around this in the protobuffer API - perhaps you can use a ByteArrayInputStream here to your advantage. Hope that helps -Todd On Wed, Apr 8, 2009 at 4:59 PM, bzheng bing.zh...@gmail.com wrote: I tried to store protocolbuffer as BytesWritable in a sequence file Text, BytesWritable. It's stored using SequenceFile.Writer(new Text(key), new BytesWritable(protobuf.convertToBytes())). When reading the values from key/value pairs using value.get(), it returns more then what's stored. However, value.getSize() returns the correct number. This means in order to convert the byte[] to protocol buffer again, I have to do Arrays.copyOf(value.get(), value.getSize()). This happens on both version 0.17.2 and 0.18.3. Does anyone know why this happens? Sample sizes for a few entries in the sequence file below. The extra bytes in value.get() all have values of zero. value.getSize(): 7066 value.get().length: 10599 value.getSize(): 36456 value.get().length: 54684 value.getSize(): 32275 value.get().length: 54684 value.getSize(): 40561 value.get().length: 54684 value.getSize(): 16855 value.get().length: 54684 value.getSize(): 66304 value.get().length: 99456 value.getSize(): 26488 value.get().length: 99456 value.getSize(): 59327 value.get().length: 99456 value.getSize(): 36865 value.get().length: 99456 -- View this message in context: http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
BytesWritable get() returns more bytes then what's stored
I tried to store protocolbuffer as BytesWritable in a sequence file Text, BytesWritable. It's stored using SequenceFile.Writer(new Text(key), new BytesWritable(protobuf.convertToBytes())). When reading the values from key/value pairs using value.get(), it returns more then what's stored. However, value.getSize() returns the correct number. This means in order to convert the byte[] to protocol buffer again, I have to do Arrays.copyOf(value.get(), value.getSize()). This happens on both version 0.17.2 and 0.18.3. Does anyone know why this happens? Sample sizes for a few entries in the sequence file below. The extra bytes in value.get() all have values of zero. value.getSize(): 7066 value.get().length: 10599 value.getSize(): 36456 value.get().length: 54684 value.getSize(): 32275 value.get().length: 54684 value.getSize(): 40561 value.get().length: 54684 value.getSize(): 16855 value.get().length: 54684 value.getSize(): 66304 value.get().length: 99456 value.getSize(): 26488 value.get().length: 99456 value.getSize(): 59327 value.get().length: 99456 value.getSize(): 36865 value.get().length: 99456 -- View this message in context: http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: BytesWritable get() returns more bytes then what's stored
Arrays.copyOf isn't required, protocol buffer has a method to merge from bytes. you can do: protobuf.newBuilder().mergeFrom(value.getBytes(), 0, value.getLength()) the above is for hadoop 0.19.1, the corresponding method names for BytesWritable for earlier version of hadoop might be slightly different. -- gaurav On Apr 8, 2009, at 7:13 PM, Todd Lipcon wrote: Hi Bing, The issue here is that BytesWritable uses an internal buffer which is grown but not shrunk. The cause of this is that Writables in general are single instances that are shared across multiple input records. If you look at the internals of the input reader, you'll see that a single BytesWritable is instantiated, and then each time a record is read, it's read into that same instance. The purpose here is to avoid the allocation cost for each row. The end result is, as you've seen, that getBytes() returns an array which may be larger than the actual amount of data. In fact, the extra bytes (between .getSize() and .get().length) have undefined contents, not zero. Unfortunately, if the protobuffer API doesn't allow you to deserialize out of a smaller portion of a byte array, you're out of luck and will have to do the copy like you've mentioned. I imagine, though, that there's some way around this in the protobuffer API - perhaps you can use a ByteArrayInputStream here to your advantage. Hope that helps -Todd On Wed, Apr 8, 2009 at 4:59 PM, bzheng bing.zh...@gmail.com wrote: I tried to store protocolbuffer as BytesWritable in a sequence file Text, BytesWritable. It's stored using SequenceFile.Writer(new Text(key), new BytesWritable(protobuf.convertToBytes())). When reading the values from key/value pairs using value.get(), it returns more then what's stored. However, value.getSize() returns the correct number. This means in order to convert the byte[] to protocol buffer again, I have to do Arrays.copyOf(value.get(), value.getSize()). This happens on both version 0.17.2 and 0.18.3. Does anyone know why this happens? Sample sizes for a few entries in the sequence file below. The extra bytes in value.get() all have values of zero. value.getSize(): 7066 value.get().length: 10599 value.getSize(): 36456 value.get().length: 54684 value.getSize(): 32275 value.get().length: 54684 value.getSize(): 40561 value.get().length: 54684 value.getSize(): 16855 value.get().length: 54684 value.getSize(): 66304 value.get().length: 99456 value.getSize(): 26488 value.get().length: 99456 value.getSize(): 59327 value.get().length: 99456 value.getSize(): 36865 value.get().length: 99456 -- View this message in context: http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: BytesWritable get() returns more bytes then what's stored
Thanks for the clarification. Though I still find it strange why not have the get() method return what's actually stored regardless of buffer size. Is there any reason why you'd want to use/examine what's in the buffer? Todd Lipcon-4 wrote: Hi Bing, The issue here is that BytesWritable uses an internal buffer which is grown but not shrunk. The cause of this is that Writables in general are single instances that are shared across multiple input records. If you look at the internals of the input reader, you'll see that a single BytesWritable is instantiated, and then each time a record is read, it's read into that same instance. The purpose here is to avoid the allocation cost for each row. The end result is, as you've seen, that getBytes() returns an array which may be larger than the actual amount of data. In fact, the extra bytes (between .getSize() and .get().length) have undefined contents, not zero. Unfortunately, if the protobuffer API doesn't allow you to deserialize out of a smaller portion of a byte array, you're out of luck and will have to do the copy like you've mentioned. I imagine, though, that there's some way around this in the protobuffer API - perhaps you can use a ByteArrayInputStream here to your advantage. Hope that helps -Todd On Wed, Apr 8, 2009 at 4:59 PM, bzheng bing.zh...@gmail.com wrote: I tried to store protocolbuffer as BytesWritable in a sequence file Text, BytesWritable. It's stored using SequenceFile.Writer(new Text(key), new BytesWritable(protobuf.convertToBytes())). When reading the values from key/value pairs using value.get(), it returns more then what's stored. However, value.getSize() returns the correct number. This means in order to convert the byte[] to protocol buffer again, I have to do Arrays.copyOf(value.get(), value.getSize()). This happens on both version 0.17.2 and 0.18.3. Does anyone know why this happens? Sample sizes for a few entries in the sequence file below. The extra bytes in value.get() all have values of zero. value.getSize(): 7066 value.get().length: 10599 value.getSize(): 36456 value.get().length: 54684 value.getSize(): 32275 value.get().length: 54684 value.getSize(): 40561 value.get().length: 54684 value.getSize(): 16855 value.get().length: 54684 value.getSize(): 66304 value.get().length: 99456 value.getSize(): 26488 value.get().length: 99456 value.getSize(): 59327 value.get().length: 99456 value.getSize(): 36865 value.get().length: 99456 -- View this message in context: http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22963309.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: BytesWritable get() returns more bytes then what's stored
On Wed, Apr 8, 2009 at 7:14 PM, bzheng bing.zh...@gmail.com wrote: Thanks for the clarification. Though I still find it strange why not have the get() method return what's actually stored regardless of buffer size. Is there any reason why you'd want to use/examine what's in the buffer? Because doing so requires an array copy. It's important for hadoop performance to avoid needless copies of data when they're unnecessary. Most APIs that take byte[] arrays have a version that includes an offset and length. -Todd Todd Lipcon-4 wrote: Hi Bing, The issue here is that BytesWritable uses an internal buffer which is grown but not shrunk. The cause of this is that Writables in general are single instances that are shared across multiple input records. If you look at the internals of the input reader, you'll see that a single BytesWritable is instantiated, and then each time a record is read, it's read into that same instance. The purpose here is to avoid the allocation cost for each row. The end result is, as you've seen, that getBytes() returns an array which may be larger than the actual amount of data. In fact, the extra bytes (between .getSize() and .get().length) have undefined contents, not zero. Unfortunately, if the protobuffer API doesn't allow you to deserialize out of a smaller portion of a byte array, you're out of luck and will have to do the copy like you've mentioned. I imagine, though, that there's some way around this in the protobuffer API - perhaps you can use a ByteArrayInputStream here to your advantage. Hope that helps -Todd On Wed, Apr 8, 2009 at 4:59 PM, bzheng bing.zh...@gmail.com wrote: I tried to store protocolbuffer as BytesWritable in a sequence file Text, BytesWritable. It's stored using SequenceFile.Writer(new Text(key), new BytesWritable(protobuf.convertToBytes())). When reading the values from key/value pairs using value.get(), it returns more then what's stored. However, value.getSize() returns the correct number. This means in order to convert the byte[] to protocol buffer again, I have to do Arrays.copyOf(value.get(), value.getSize()). This happens on both version 0.17.2 and 0.18.3. Does anyone know why this happens? Sample sizes for a few entries in the sequence file below. The extra bytes in value.get() all have values of zero. value.getSize(): 7066 value.get().length: 10599 value.getSize(): 36456 value.get().length: 54684 value.getSize(): 32275 value.get().length: 54684 value.getSize(): 40561 value.get().length: 54684 value.getSize(): 16855 value.get().length: 54684 value.getSize(): 66304 value.get().length: 99456 value.getSize(): 26488 value.get().length: 99456 value.getSize(): 59327 value.get().length: 99456 value.getSize(): 36865 value.get().length: 99456 -- View this message in context: http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22963309.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: BytesWritable get() returns more bytes then what's stored
FYI: this (open) JIRA might be interesting to you: http://issues.apache.org/jira/browse/HADOOP-3788 Alex On Wed, Apr 8, 2009 at 7:18 PM, Todd Lipcon t...@cloudera.com wrote: On Wed, Apr 8, 2009 at 7:14 PM, bzheng bing.zh...@gmail.com wrote: Thanks for the clarification. Though I still find it strange why not have the get() method return what's actually stored regardless of buffer size. Is there any reason why you'd want to use/examine what's in the buffer? Because doing so requires an array copy. It's important for hadoop performance to avoid needless copies of data when they're unnecessary. Most APIs that take byte[] arrays have a version that includes an offset and length. -Todd Todd Lipcon-4 wrote: Hi Bing, The issue here is that BytesWritable uses an internal buffer which is grown but not shrunk. The cause of this is that Writables in general are single instances that are shared across multiple input records. If you look at the internals of the input reader, you'll see that a single BytesWritable is instantiated, and then each time a record is read, it's read into that same instance. The purpose here is to avoid the allocation cost for each row. The end result is, as you've seen, that getBytes() returns an array which may be larger than the actual amount of data. In fact, the extra bytes (between .getSize() and .get().length) have undefined contents, not zero. Unfortunately, if the protobuffer API doesn't allow you to deserialize out of a smaller portion of a byte array, you're out of luck and will have to do the copy like you've mentioned. I imagine, though, that there's some way around this in the protobuffer API - perhaps you can use a ByteArrayInputStream here to your advantage. Hope that helps -Todd On Wed, Apr 8, 2009 at 4:59 PM, bzheng bing.zh...@gmail.com wrote: I tried to store protocolbuffer as BytesWritable in a sequence file Text, BytesWritable. It's stored using SequenceFile.Writer(new Text(key), new BytesWritable(protobuf.convertToBytes())). When reading the values from key/value pairs using value.get(), it returns more then what's stored. However, value.getSize() returns the correct number. This means in order to convert the byte[] to protocol buffer again, I have to do Arrays.copyOf(value.get(), value.getSize()). This happens on both version 0.17.2 and 0.18.3. Does anyone know why this happens? Sample sizes for a few entries in the sequence file below. The extra bytes in value.get() all have values of zero. value.getSize(): 7066 value.get().length: 10599 value.getSize(): 36456 value.get().length: 54684 value.getSize(): 32275 value.get().length: 54684 value.getSize(): 40561 value.get().length: 54684 value.getSize(): 16855 value.get().length: 54684 value.getSize(): 66304 value.get().length: 99456 value.getSize(): 26488 value.get().length: 99456 value.getSize(): 59327 value.get().length: 99456 value.getSize(): 36865 value.get().length: 99456 -- View this message in context: http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22963309.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Chaining Multiple Map reduce jobs.
Chapter 8 of my book covers this in detail, the alpha chapter should be available at the apress web site Chain mapping rules! http://www.apress.com/book/view/1430219424 On Wed, Apr 8, 2009 at 3:30 PM, Nathan Marz nat...@rapleaf.com wrote: You can also try decreasing the replication factor for the intermediate files between jobs. This will make writing those files faster. On Apr 8, 2009, at 3:14 PM, Lukáš Vlček wrote: Hi, by far I am not an Hadoop expert but I think you can not start Map task until the previous Reduce is finished. Saying this it means that you probably have to store the Map output to the disk first (because a] it may not fit into memory and b] you would risk data loss if the system crashes). As for the job chaining you can check JobControl class ( http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html ) http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html Also you can look at https://issues.apache.org/jira/browse/HADOOP-3702 Regards, Lukas On Wed, Apr 8, 2009 at 11:30 PM, asif md asif.d...@gmail.com wrote: hi everyone, i have to chain multiple map reduce jobs actually 2 to 4 jobs , each of the jobs depends on the o/p of preceding job. In the reducer of each job I'm doing very little just grouping by key from the maps. I want to give the output of one MapReduce job to the next job without having to go to the disk. Does anyone have any ideas on how to do this? Thanx. -- http://blog.lukas-vlcek.com/ -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
CloudBurst: Hadoop for DNA Sequence Analysis
Hadoop Users, I just wanted to announce my Hadoop application 'CloudBurst' is available open source at: http://cloudburst-bio.sourceforge.net In a nutshell, it is an application for mapping millions of short DNA sequences to a reference genome to, for example, map out differences in one individual's genome compared to the reference genome. As you might imagine, this is a very data intense problem, but Hadoop enables the application to scale up linearly to large clusters. A full description of the program is available in the journal Bioinformatics: http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp236 I also wanted to take this opportunity to thank everyone on this mailing list. The discussions posted were essential for navigating the ins and outs of hadoop during the development of CloudBurst. Thanks everyone! Michael Schatz http://www.cbcb.umd.edu/~mschatz
RE: CloudBurst: Hadoop for DNA Sequence Analysis
As a matter of fact it is nowhere near close to being data intensive, it does take gigabytes of input data to process, however it is mostly RAM and CPU intensive. Although post-processing of alignment files is exactly where hadoop excels. At least as far as I understand majority of time is spent on DP alignment whereas navigation in seed space and N*log(n) sort requires only a fraction of that time - that was my experience applying hadoop cluster to sequencing human genomes. --- Dmitry Pushkarev +1-650-644-8988 -Original Message- From: michael.sch...@gmail.com [mailto:michael.sch...@gmail.com] On Behalf Of Michael Schatz Sent: Wednesday, April 08, 2009 9:19 PM To: core-user@hadoop.apache.org Subject: CloudBurst: Hadoop for DNA Sequence Analysis Hadoop Users, I just wanted to announce my Hadoop application 'CloudBurst' is available open source at: http://cloudburst-bio.sourceforge.net In a nutshell, it is an application for mapping millions of short DNA sequences to a reference genome to, for example, map out differences in one individual's genome compared to the reference genome. As you might imagine, this is a very data intense problem, but Hadoop enables the application to scale up linearly to large clusters. A full description of the program is available in the journal Bioinformatics: http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp236 I also wanted to take this opportunity to thank everyone on this mailing list. The discussions posted were essential for navigating the ins and outs of hadoop during the development of CloudBurst. Thanks everyone! Michael Schatz http://www.cbcb.umd.edu/~mschatz