Question about writing to local file system in reduce job
Hello, I am using FileSystem.startLocalOutput() and FileSystem.completeLocalOutput() in my reduce tasks (more than 1) to produce some output. I have two questions: 1. if everything runs correctly, after the call to completeLocalOutput(), the output is copied to HDFS and the local file is deleted. However if the reduce tasks are killed in the middle for whatever reason, the local files are not deleted. How can I delete it when the reduce task failed. 2. if speculative execution is on, how can I force the two speculative tasks that are working on the same output to write to different local path, in case they happens to run on the same machine? Also, will there be any problem if they both succeed and copy the output back to HDFS in the same path in HDFS? appreciate any help. Cheers, Cedric
Re: Problem with start-all on 0.16.4
Hi Same problem for me. I tried to rm -rf the datastore as well (prior to reformat) but no change. Any clue is welcome Regards Adam Wynne wrote: Hi, I have a working 0.15.3 install and am trying to upgrade to 0.16.4. I want to start clean with an empty filesystem, so I just reformatted the filesystem instead of using the upgrade option. When I run start-all.sh, I get a null pointer exception originating from the NetUtils.getServerAddress() method. This cluster is on a private network, could there be a bug with the way hadoop is looking up the address? Other ideas? Here is the full error and stack trace from the namenode log: 2008-05-14 08:03:37,252 INFO org.apache.hadoop.fs.FSNamesystem: fsOwner=qeadmin,qeadmin,wheel 2008-05-14 08:03:37,253 INFO org.apache.hadoop.fs.FSNamesystem: supergroup=supergroup 2008-05-14 08:03:37,253 INFO org.apache.hadoop.fs.FSNamesystem: isPermissionEnabled=true 2008-05-14 08:03:37,358 INFO org.apache.hadoop.fs.FSNamesystem: Finished loading FSImage in 137 msecs 2008-05-14 08:03:37,362 INFO org.apache.hadoop.fs.FSNamesystem: Leaving safemode after 142 msecs 2008-05-14 08:03:37,362 INFO org.apache.hadoop.dfs.StateChange: STATE* Network topology has 0 racks and 0 datanodes 2008-05-14 08:03:37,363 INFO org.apache.hadoop.dfs.StateChange: STATE* UnderReplicatedBlocks has 0 blocks 2008-05-14 08:03:37,377 INFO org.apache.hadoop.fs.FSNamesystem: Registered FSNamesystemStatusMBean 2008-05-14 08:03:37,398 ERROR org.apache.hadoop.dfs.NameNode: java.lang.NullPointerException at org.apache.hadoop.net.NetUtils.getServerAddress(NetUtils.java:148) at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:279) at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:235) at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:176) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:162) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855) 2008-05-14 08:03:37,399 INFO org.apache.hadoop.dfs.NameNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NameNode at compute-0-0.local/192.168.1.254 / Thanks -- View this message in context: http://www.nabble.com/Problem-with-start-all-on-0.16.4-tp17233437p17364262.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Hadoop 0.17 AMI?
Any word on 0.17? I was able to build an AMI from a trunk checkout and deploy a single node cluster but the create-hadoop-image-remote script really wants a tarball in the archive. I'd rather not waste time munging the scripts if a release is near. Jeff Nigel Daley wrote: Hadoop 0.17 hasn't been released yet. I (or Mukund) is hoping to call a vote this afternoon or tomorrow. Nige On May 14, 2008, at 12:36 PM, Jeff Eastman wrote: I'm trying to bring up a cluster on EC2 using (http://wiki.apache.org/hadoop/AmazonEC2) and it seems that 0.17 is the version to use because of the DNS improvements, etc. Unfortunately, I cannot find a public AMI with this build. Is there one that I'm not finding or do I need to create one? Jeff
Hadoop Streaming - revised
Ok, I turned on verbose output. It looks as though it is adding everything in my /tmp directory to the jar file it builds. Where do I tell it not to do that? Thanks! Tanton
Hadoop Streaming - final
Ok, I figured it out. Hadoop Streaming adds the entire stream.shipped.hadoopstreaming directory to the jar file. For me, I wasn't setting it and it was defaulting to /tmp. That means my entire /tmp directory was getting added to the jar. I set that directory to the location of my hadoop streaming jar directory and it seemed to work fine. Sorry for the noise.
RE: Monthly Hadoop user group meetings
Reminder: the user group meeting is today at 6 pm at Yahoo! Mission College. Ajay From: Ajay Anand Sent: Wednesday, May 14, 2008 9:53 AM To: '[EMAIL PROTECTED]'; '[EMAIL PROTECTED]'; '[EMAIL PROTECTED]' Cc: 'Chad Walters'; 'Jeff Hammerbacher'; Owen O'Malley Subject: RE: Monthly Hadoop user group meetings Agenda for the Hadoop user group meeting on Wednesday 5/21 6:00-7:30 pm at Yahoo! Mission College: - Hadoop .17 release - Sameer Paranjpye - Mahout update - Jeff Eastman - And plenty of opportunity for networking, discussions and beer... Look forward to seeing you there. (Registration is at http://upcoming.yahoo.com/event/591971/. ) Ajay From: Ajay Anand Sent: Tuesday, May 06, 2008 9:53 AM To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Cc: Chad Walters; Jeff Hammerbacher; Owen O'Malley Subject: Monthly Hadoop user group meetings One of the things we had discussed at the Hadoop summit was to set up monthly user group meetings to discuss topics of interest to the hadoop community. We have scheduled the first of these meetings for May 21st from 6 to 7:30 pm at the Yahoo! Mission College campus. You can register for this at http://upcoming.yahoo.com/event/591971/. The core group organizing these includes Chad Walters from Powerset, Jeff Hammerbacher from Facebook and Owen O'Malley from Yahoo. Please send us any suggestions for topics or things you would like to share with the group. Topics related to pig, hbase, zookeeper and mahout are welcome as well Look forward to seeing you there! Ajay
Re: joins in map reduce
On May 21, 2008, at 11:16 AM, Shirley Cohen wrote: How does one do a join operation in map reduce? Is there more than one way to do a join? Which way works better and why? There are a couple of ways, depending on what you need to do. If your input data is sorted and partitioned equivalently on the same key, you can do a join before the map (aka map-side join). The documentation is at: http://tinyurl.com/5v4rot If your data is not sorted and partitioned consistently, you need to do the join in the reduce. There is a library to help at: http:// tinyurl.com/5cz669 -- Owen
Re: Monthly Hadoop user group meetings
And anybody who wants to be early can meet some of us at Bennigan's. On Wed, May 21, 2008 at 11:20 AM, Ajay Anand [EMAIL PROTECTED] wrote: Reminder: the user group meeting is today at 6 pm at Yahoo! Mission College. Ajay From: Ajay Anand Sent: Wednesday, May 14, 2008 9:53 AM To: '[EMAIL PROTECTED]'; '[EMAIL PROTECTED]'; '[EMAIL PROTECTED]' Cc: 'Chad Walters'; 'Jeff Hammerbacher'; Owen O'Malley Subject: RE: Monthly Hadoop user group meetings Agenda for the Hadoop user group meeting on Wednesday 5/21 6:00-7:30 pm at Yahoo! Mission College: - Hadoop .17 release - Sameer Paranjpye - Mahout update - Jeff Eastman - And plenty of opportunity for networking, discussions and beer... Look forward to seeing you there. (Registration is at http://upcoming.yahoo.com/event/591971/. ) Ajay From: Ajay Anand Sent: Tuesday, May 06, 2008 9:53 AM To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Cc: Chad Walters; Jeff Hammerbacher; Owen O'Malley Subject: Monthly Hadoop user group meetings One of the things we had discussed at the Hadoop summit was to set up monthly user group meetings to discuss topics of interest to the hadoop community. We have scheduled the first of these meetings for May 21st from 6 to 7:30 pm at the Yahoo! Mission College campus. You can register for this at http://upcoming.yahoo.com/event/591971/. The core group organizing these includes Chad Walters from Powerset, Jeff Hammerbacher from Facebook and Owen O'Malley from Yahoo. Please send us any suggestions for topics or things you would like to share with the group. Topics related to pig, hbase, zookeeper and mahout are welcome as well Look forward to seeing you there! Ajay -- ted
[ANNOUNCE] Hadoop release 0.17.0 available
Release 0.17.0 contains many improvements, new features, bug fixes and optimizations. For release details and downloads, visit: http://hadoop.apache.org/core/releases.html Hadoop 0.17.0 Release Notes are at http://hadoop.apache.org/core/docs/r0.17.0/releasenotes.html Thanks to all who contributed to this release! Mukund
Re: Hadoop experts wanted
Interesting!! BTW, Where do you work? On Thu, May 15, 2008 at 2:23 PM, Jim R. Wilson [EMAIL PROTECTED] wrote: Hi all, Hadoop is a great project and a growing niche. As it becomes even more popular, there will be increasing demand for experts in the field. I am compiling a contact list of Hadoop experts who may be interested in opportunities under the right circumstances. I am not a recruiter - I'm a regular developer who sometimes gets asked for referrals when I'm not personally available. If you'd like to be on my shortlist of go-to experts, please contact me off-list at: [EMAIL PROTECTED] Please be prepared to show your expertise by any of the following: * Committer status or patches accepted * Commit access to another open source project which uses Hadoop * Bugs reported which were either resolved or are still open (real bugs) * Articles / blog entries written about Hadoop concepts or development * Speaking engagements or user groups at which you've presented * Significant contributions to documentation * Other? (I'm sure I didn't think of everything) I'll be happy to answer any questions, and I look forward to hearing from you! -- Jim R. Wilson (jimbojw)
Re: Hadoop experts wanted
Oh, yes, i guessed wrong. Thanks :) Edward On Thu, May 22, 2008 at 7:42 AM, Jeff Eastman [EMAIL PROTECTED] wrote: Hi Edward, Check out this link (http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable) before you panic over the similar postings. Jim's a little vague about what he's actually going to do with this data or when, but I found it useful. Jeff Edward J. Yoon wrote: Hey Akshar! Just FYI, See http://www.nabble.com/Django-experts-wanted-td17322054.html -Edward On Thu, May 22, 2008 at 6:24 AM, Akshar [EMAIL PROTECTED] wrote: Interesting!! BTW, Where do you work? On Thu, May 15, 2008 at 2:23 PM, Jim R. Wilson [EMAIL PROTECTED] wrote: Hi all, Hadoop is a great project and a growing niche. As it becomes even more popular, there will be increasing demand for experts in the field. I am compiling a contact list of Hadoop experts who may be interested in opportunities under the right circumstances. I am not a recruiter - I'm a regular developer who sometimes gets asked for referrals when I'm not personally available. If you'd like to be on my shortlist of go-to experts, please contact me off-list at: [EMAIL PROTECTED] Please be prepared to show your expertise by any of the following: * Committer status or patches accepted * Commit access to another open source project which uses Hadoop * Bugs reported which were either resolved or are still open (real bugs) * Articles / blog entries written about Hadoop concepts or development * Speaking engagements or user groups at which you've presented * Significant contributions to documentation * Other? (I'm sure I didn't think of everything) I'll be happy to answer any questions, and I look forward to hearing from you! -- Jim R. Wilson (jimbojw) -- Best regards, Edward J. Yoon, http://blog.udanax.org
missing CompressionLevel for ZLibCompressor.
All, The class ZLibCompressor contains a enum for the CompressionLevel, and only a few compression levels have been implemented. Is there a reason for that? I 'd like to add all the levels (0-9). How do i proceed to check-in that change? Thanks, S.
Avoiding Newline Problems in Hadoop Streaming + StreamXMLRecordReader
Greetings, I have an interesting problem I'm trying to solve. I currently store a bunch of webpages in a large XML file in Hadoop. I'm trying to parse information out of these webpages using a complex C# program that I have running on Mono (I'm in a Linux environment). Therefore, I'm using Hadoop Streaming and the StreamXMLRecordReader in order to get the information to my C# parser. The problem is that even wrapped in XML, the Hadoop Streaming ends the records at newlines! This makes the map input data pretty useless. Does anyone have any hints on how to get around this? Here's the XML structure I'm trying to use: ContentRecordRecordURLhttp://www.blah/RecordURLPageContent![CDATA[page text would be here including newlines ]]/PageContent/ContentRecord Any ideas? Cheers, Bradford
Re: Hadoop experts wanted
Thanks Jeff - glad for the support :) I appreciate your concern Edward. My background is primarily in MediaWiki, and I'm a relative newcomer to Hadoop/Hbase - writing MapReduce Python jobs using Hadoop streaming and connecting PHP to HBase through Thrift. It's all been a very interesting journey which I plan to write more articles about as time permits. I'm also preparing a patch for HBase to support generating EC2 AMIs with Hadoop+HBase since all the latest public AMIs have only Hadoop. Regarding the feelers, I've posted feeler messages only in communities where I feel I could intelligently contribute to a conversation on the subject. I wouldn't, for example, post such a feeler on a Linux kernel development list as I have no experience or knowledge about it. Based on my recent experience with Hadoop/HBase, I felt I'd be able to vet any potentially interested experts by evaluating code samples, asking pointed questions, reading published articles etc. Being a wiki guy, the system I eventually create to present the expert list will almost certainly have a wiki component, giving experts the opportunity to elaborate on their experience or knowledge without restriction, but also have an uneditable (moderator only) section where I'd list the affirmed credentials (such as significant patches, enhancements, articles on the subject etc). I'm still not sure yet what the whole thing will look like, but I've gotten a fairly positive response to my query mails so far, so I'll begin cooking something up soon. Sorry for taking this so far off-topic, it wasn't my intent to do so. I appreciate your concern, and if you have suggestions on how I could make my emails seem less spammy, I'd be happy to alter them. :) -- Jim On Wed, May 21, 2008 at 7:12 PM, Edward J. Yoon [EMAIL PROTECTED] wrote: Oh, yes, i guessed wrong. Thanks :) Edward On Thu, May 22, 2008 at 7:42 AM, Jeff Eastman [EMAIL PROTECTED] wrote: Hi Edward, Check out this link (http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable) before you panic over the similar postings. Jim's a little vague about what he's actually going to do with this data or when, but I found it useful. Jeff Edward J. Yoon wrote: Hey Akshar! Just FYI, See http://www.nabble.com/Django-experts-wanted-td17322054.html -Edward On Thu, May 22, 2008 at 6:24 AM, Akshar [EMAIL PROTECTED] wrote: Interesting!! BTW, Where do you work? On Thu, May 15, 2008 at 2:23 PM, Jim R. Wilson [EMAIL PROTECTED] wrote: Hi all, Hadoop is a great project and a growing niche. As it becomes even more popular, there will be increasing demand for experts in the field. I am compiling a contact list of Hadoop experts who may be interested in opportunities under the right circumstances. I am not a recruiter - I'm a regular developer who sometimes gets asked for referrals when I'm not personally available. If you'd like to be on my shortlist of go-to experts, please contact me off-list at: [EMAIL PROTECTED] Please be prepared to show your expertise by any of the following: * Committer status or patches accepted * Commit access to another open source project which uses Hadoop * Bugs reported which were either resolved or are still open (real bugs) * Articles / blog entries written about Hadoop concepts or development * Speaking engagements or user groups at which you've presented * Significant contributions to documentation * Other? (I'm sure I didn't think of everything) I'll be happy to answer any questions, and I look forward to hearing from you! -- Jim R. Wilson (jimbojw) -- Best regards, Edward J. Yoon, http://blog.udanax.org
Re: Hadoop 0.17 AMI?
Hi Jeff, 0.17.0 was released yesterday, from what I can tell. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Jeff Eastman [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Wednesday, May 21, 2008 11:18:56 AM Subject: Re: Hadoop 0.17 AMI? Any word on 0.17? I was able to build an AMI from a trunk checkout and deploy a single node cluster but the create-hadoop-image-remote script really wants a tarball in the archive. I'd rather not waste time munging the scripts if a release is near. Jeff Nigel Daley wrote: Hadoop 0.17 hasn't been released yet. I (or Mukund) is hoping to call a vote this afternoon or tomorrow. Nige On May 14, 2008, at 12:36 PM, Jeff Eastman wrote: I'm trying to bring up a cluster on EC2 using (http://wiki.apache.org/hadoop/AmazonEC2) and it seems that 0.17 is the version to use because of the DNS improvements, etc. Unfortunately, I cannot find a public AMI with this build. Is there one that I'm not finding or do I need to create one? Jeff
Questions on how to use DistributedCache
Dear all, I am trying to use DistributedCache class for distributing files required for running my jobs. While API documentation provides good guidelines, Is there any tips or usage examples (e.g. sample codes)? If you could share your experience with me, I would really appreciate it. Thank you in advance, /Taeho
Confuse about the Client.Connection
hi,all I took a look at the source code of org.apache.hadoop.ipc.Client ,and i wonder if there are two client thread invoke the getConnection() specifing the same arguments,then they will get a same Connection object,how could they distinguish the results from each other? I noticed the results streamed back from the server is collected by the Connection's thread,not the callers' threads, and the Connection's thread expects reaults :callId_XX,reaultBody_XX. Is there a situation in which Connection's result thread collects callId_by_threadA,reaultBody_by_threadB,callId_by_threadB,resultBody_by_threadA?I think this situation is kind of reasonable,how does the current code handle this? heyongqiang [EMAIL PROTECTED] 2008-05-22