Re: Decommission multiple datanodes at a time?
Not sure if it's up to date, but the FAQ attempts to answer this question: http://wiki.apache.org/hadoop/FAQ#A17. On Tue, Aug 31, 2010 at 1:04 PM, Allen Wittenauer awittena...@linkedin.comwrote: On Aug 31, 2010, at 12:58 PM, jiang licht wrote: Is the number of nodes to be decommissioned bounded by replication factor? No, it is bounded mainly by network bandwidth and NN/DN RPC call rate. I've seen decommissions of like 400 nodes at once.
Re: where do I see all hadoop logs?
Hey Mark, You might find this blog post useful: http://www.cloudera.com/blog/2009/09/apache-hadoop-log-files-where-to-find-them-in-cdh-and-what-info-they-contain/. Given that it's a year old now, I expect it's out of date, but it might help fill in some gaps. Thanks, Jeff On Mon, Aug 30, 2010 at 1:51 PM, abhishek sharma absha...@usc.edu wrote: On Mon, Aug 30, 2010 at 1:43 PM, Mark Kerzner markkerz...@gmail.com wrote: The statements are in my code. When I am running one node, I see them, but I am not sure what happens when I am running more. Do you see them in the .out files produced for each node? Abhishek On Mon, Aug 30, 2010 at 3:41 PM, abhishek sharma absha...@usc.edu wrote: Hi Mark, Are these print statements in the code of your job or the Hadoop components (JobTracker, TaskTracker, etc.)? Depending on where you place them, they might show up in the *.out files that are created along with the log files (and not show up on the console). Abhishek On Mon, Aug 30, 2010 at 1:35 PM, Mark Kerzner markkerz...@gmail.com wrote: Hi, do all nodes send their System.out.println() logs to the same place in Hadoop job console? I don't see the mixture I would expect. Thank you, Mark
Re: Decommission multiple datanodes at a time?
On 01/09/10 08:35, Jeff Hammerbacher wrote: Not sure if it's up to date, but the FAQ attempts to answer this question: http://wiki.apache.org/hadoop/FAQ#A17. On Tue, Aug 31, 2010 at 1:04 PM, Allen Wittenauer awittena...@linkedin.comwrote: On Aug 31, 2010, at 12:58 PM, jiang licht wrote: Is the number of nodes to be decommissioned bounded by replication factor? No, it is bounded mainly by network bandwidth and NN/DN RPC call rate. I've seen decommissions of like 400 nodes at once. I've been worrying a bit about network partition events; seems to me that if a switch fails a large cluster could get overloaded by excessive failing heartbeats, replication work and incoming client calls. Some options to mitigate this (HDFS issues, obviously) -have a way to temporarily say It's OK to underreplicate anything that was on these servers with a list of servers; pull back the replication count for a while. -put the namenode into safe mode if a sufficiently large #of workers go offline within a specific time period
namesecondary has no content
I have configured the secondary namenode in a separate node, but it only has the following content under directory /namesecondary: current in_use.lock lastcheckpoint.tmp the directory current and lastcheckpoint.tmp are null Do anyone know why? ps: everything else are now running well in the cluster. 2010-09-01 shangan
Re: Combining Only Once?
Thanks for the reply. 2010/8/31 Owen O'Malley omal...@apache.org: There used to be a compatibility switch, but I believe it was removed in 0.19 or 0.20. I recognized that, the switch has already been removed. Can you describe what you are trying to accomplish? Combiners were always intended to only be used for operations that are idempotent, associative, and commutative. Clearly your combiner doesn't satisfy one of those properties or you wouldn't care if it was applied more than once. Actually, I have to apply an operation on the final output of each map task, but only once. For each map task, for each key, I see fully-aggregated final value and then reduce the map output size by a large amount according to the values. Basically that is something people usually do in the reduce phase. However, since my keys are large and many for each mapper; I want to lower the network cost, by pre-removing keys which I don't need in the final output. This can be done, only if I can reach the locally aggregated final output of the map tasks in the map phase. Yagiz -- Owen
Job performance issue: output.collect()
Hi all, My job (written in old 0.18 api, but that's not the issue here) is producing large amounts of map output. Each map() call generates about ~20 output.collects, and each output is pretty big (~1K) = each map() produces about 20K. All of this data is fed to a combiner that really reduces the output's size + amounts. the job input is not so big: there are about 120M map input records. This job is pretty slow. Other jobs that work on the same input are much faster, since they do not produce so much output. Analyzing the job performance (timing the map() function parts), I've seen that much time is spent on the output.collect() line itself. I know that during the output.collect() command the output is being written to local filesystem spills (when the spill buffer reaches a 80% limit), so I guessed that reducing the size of each output will improve performance. This was not the case - after cutting 30% of the map output size, the job took the same amount of time. The thing that I cannot reduce is the amount of output lines being written out of the map. I would like to know what happens in the output.collect line that takes lots of time, in order to cut down this job's running time. Please keep in mind that I have a combiner, and to my understanding different things happen to the map output when a combiner is present. Can anyone help me understand how can I save this precious time? Thanks, -- Oded
api doc incomplete
Hi all, does anybody notice the online api doc is incomplete? At http://hadoop.apache.org/common/docs/current/api/ there is even no mapred or mapreduce package there. I remember I use it well before. What happen? Thanks, -Gang
Re: Job performance issue: output.collect()
Hey Oded Rosen I am not sure what is the functionality of your map() method. Intuitively, move the map() method computation to the reduce() method if your map() output is problematic. I mean just let the map() method act as a data input reader and divider and let reduce() method do all you computation. In this way, your intermediate results are less than before. Shuffle time can also be reduced. If the computation is still slow, I think it may not be the MapReduce framework problem, but your programs. Hope this helps. Chen On Wed, Sep 1, 2010 at 7:18 AM, Oded Rosen o...@legolas-media.com wrote: Hi all, My job (written in old 0.18 api, but that's not the issue here) is producing large amounts of map output. Each map() call generates about ~20 output.collects, and each output is pretty big (~1K) = each map() produces about 20K. All of this data is fed to a combiner that really reduces the output's size + amounts. the job input is not so big: there are about 120M map input records. This job is pretty slow. Other jobs that work on the same input are much faster, since they do not produce so much output. Analyzing the job performance (timing the map() function parts), I've seen that much time is spent on the output.collect() line itself. I know that during the output.collect() command the output is being written to local filesystem spills (when the spill buffer reaches a 80% limit), so I guessed that reducing the size of each output will improve performance. This was not the case - after cutting 30% of the map output size, the job took the same amount of time. The thing that I cannot reduce is the amount of output lines being written out of the map. I would like to know what happens in the output.collect line that takes lots of time, in order to cut down this job's running time. Please keep in mind that I have a combiner, and to my understanding different things happen to the map output when a combiner is present. Can anyone help me understand how can I save this precious time? Thanks, -- Oded
Re: api doc incomplete
On Sep 1, 2010, at 8:56 AM, Gang Luo wrote: Hi all, does anybody notice the online api doc is incomplete? At http://hadoop.apache.org/common/docs/current/api/ there is even no mapred or mapreduce package there. I remember I use it well before. What happen? When {common,hdfs,mapreduce}-0.21.0 was released, it became current. Since the project split happened between 0.20 to 0.21, that means the current docs are now split. If you look at http://hadoop.apache.org/mapreduce/docs/current/api , you'll find what you are looking for. Additionally, we should make a stable link that points to the latest of the 0.20 line. -- Owen
Re: accounts permission on hadoop
On Tue, Aug 31, 2010 at 5:28 PM, Allen Wittenauer awittena...@linkedin.com wrote: On Aug 31, 2010, at 2:43 PM, Edward Capriolo wrote: On Tue, Aug 31, 2010 at 5:07 PM, Gang Luo lgpub...@yahoo.com.cn wrote: Hi all, I am the administrator of a hadoop cluster. I want to know how to specify a group a user belong to. Or hadoop just use the group/user information from the linux system it runs on? For example, if a user 'smith' belongs to a group 'research' in the linux system. what is his account and group on HDFS? Currently hadoop gets its user groups from the posix user/groups. ... based upon what the client sends, not what the server knows. Not anymore in trunk or the security branch - now it's mapped on the server side with a configurable resolver class. -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: Job performance issue: output.collect()
On Sep 1, 2010, at 5:18 AM, Oded Rosen wrote: I would like to know what happens in the output.collect line that takes lots of time, in order to cut down this job's running time. Please keep in mind that I have a combiner, and to my understanding different things happen to the map output when a combiner is present. The best presentation on the map side sort is the one that Chris Douglas (who did most of the implementation) did for the Bay Area HUG. http://developer.yahoo.net/blogs/hadoop/2010/01/hadoop_bay_area_january_2010_u.html There are both slides and a video of the presentation. I'd run through that first. You most likely are getting more spills than you deserve. The variables to look at: io.sort.mb - should be most of the task's ram budget io.sort.record.percent - depends on record size io.sort.factor - typically 25 * (# of disks / node) -- Owen
Re: Classpath
From http://blog.sematext.com/2010/05/31/hadoop-digest-may-2010/ FAQ section: How can I attach external libraries (jars) which my jobs depend on? You can put them in a “lib” subdirectory of your jar root directory. Alternatively you can use DistributedCache API. Alex Baranau Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - HBase On Sun, Aug 29, 2010 at 8:29 AM, Mark static.void@gmail.com wrote: How can I add jars to Hadoops classpath when running MapReduce jobs for the following situations? 1) Assuming that the jars are local the nodes that running the job. 2) The jobs are only local to the client submitting the job. I'm assuming I can just jar up all required jobs into the main job jar being submitted, but I was wondering if there was some other way. Thanks
Re: missing part folder - how to debug?
Hi, Adding Solr user list. We used similar approach to the one in this patch but with Hadoop Streaming. Did you determine that indices are really missing? I mean did you find missing documents in the output indices? Alex Baranau Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - HBase 2010/8/31 Koji Sekiguchi k...@r.email.ne.jp Hello, We are using Hadoop to make Solr index. We are using SOLR-1301 that was first contributed by Andrzej: https://issues.apache.org/jira/browse/SOLR-1301 It works great on testing environment, 4 servers. Today, we run it on production environment, 320 servers. We run 5120 reducers (16 per server). This results 5120 indexes i.e. part-X folders should be created. But about 20 part folders were missing, and Hadoop didn't produce any error logs. How can we investigate/debug this problem? Any pointers, experiences would be highly appreciated! Thanks, Koji -- http://www.rondhuit.com/en/
Re: accounts permission on hadoop
On Sep 1, 2010, at 9:08 AM, Todd Lipcon wrote: Currently hadoop gets its user groups from the posix user/groups. ... based upon what the client sends, not what the server knows. Not anymore in trunk or the security branch - now it's mapped on the server side with a configurable resolver class. Yes, but only like 3 people use that stuff presently. Trunk=unicorns and ponies.
Re: how to revert from a new version to an older one (CDH3)?
Hey guys, In CDH3 you can pin your repo to a particular release. Eg in the following docs to use beta 1 specify redhat/cdh/3b1 instead of redhat/cdh/3 in the repo file (for RH), or DISTRO-cdh3b1 instead of DISTRO-cdh3 in the list file (for Debian). You'll need to do a yum clean metadata or apt-get clean update so the new packages are seen. https://wiki.cloudera.com/display/DOC/Hadoop+Installation+(CDH3) Also, please direct CDH usage queries to the user list: https://groups.google.com/a/cloudera.org/group/cdh-user Thanks, Eli On Tue, Aug 24, 2010 at 11:05 AM, Edward Capriolo edlinuxg...@gmail.com wrote: On Tue, Aug 24, 2010 at 1:36 PM, jiang licht licht_ji...@yahoo.com wrote: Thanks Sudhir and Michael. I want to replace a new release of CDH3 (0.20.2+320) to a previous release of CDH3 (0.20.2+228). The problem is that there is no installation package for previous release of CDH3 and no source to rebuild from. If you do yum install from cloudera repository, you always get the latest release. That's why I want to know a nice way to do this. Please correct me if I am wrong. I also noticed that ppl talked about a package for each release in Cloudera-supported forum getsatisfaction.com but don't know current status. In the end, to get work done and since hadoop is simply a java application, I simply used the files installed by previous release (on other machines) and set up configurations that point to the right locations. Thanks, Michael --- On Tue, 8/24/10, Sudhir Vallamkondu sudhir.vallamko...@icrossing.com wrote: From: Sudhir Vallamkondu sudhir.vallamko...@icrossing.com Subject: RE: how to revert from a new version to an older one (CDH3)? To: common-user@hadoop.apache.org Date: Tuesday, August 24, 2010, 10:57 AM More specifics on Michael¹s comment. You can use the yum remove or apt-get purge to remove the existing install. For Red Hat systems, run this command: # yum remove hadoop -y For Debian systems, run this command: # apt-get purge hadoop Verify that you have no Hadoop packages installed on your cluster. For Red Hat systems, run this command which should return no packages: $ rpm -qa | grep hadoop For Debian systems, run this command which should return no packages: $ dpkg -l | grep hadoop References: https://docs.cloudera.com/display/DOC/Hadoop+Upgrade+from+CDH2+to+CDH3 On Aug/24/ 5:08 AM, common-user-digest-h...@hadoop.apache.org common-user-digest-h...@hadoop.apache.org wrote: From: Michael Segel michael_se...@hotmail.com Date: Tue, 24 Aug 2010 06:21:30 -0500 To: common-user@hadoop.apache.org Subject: RE: how to revert from a new version to an older one (CDH3)? Not sure if you got your question answered... You need to delete the current version (via yum) and then specifically re-install the version you want by specifying the full name including version. HTH -Mike Date: Mon, 23 Aug 2010 15:00:39 -0700 From: licht_ji...@yahoo.com Subject: how to revert from a new version to an older one (CDH3)? To: common-user@hadoop.apache.org I want to replace a new CDH version 0.20.2+320 with an older one 0.20.2+228. yum downgrade reports that version can only be upgraded. I also didn't find a way to yum install the older version. I guess I can download tar ball of the old version and extract it to where the new version is installed and overwrite it. But seems not a good solution because it might have negative impact on upgrading in the future. So, what is the best way to do this? Thanks, Michael iCrossing Privileged and Confidential Information This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information of iCrossing. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. Ah. The dangers of installing things from the Internet!!! The cloudera package for hadoop is great. I use it, but I DO NOT download it from the internet every time! Why? Because of the exact problem you are having, packages get updated and finding the older one can be hard. Always keep a copy of your RPMs locally! (and run your own yum repo) You used to be able to navigate around the clouder repo and find the older RPM inside the same folder. You still probably can hunt around and you should be able to find it. http://archive.cloudera.com/cdh/3/ Good luck!
Re: From X to Hadoop MapReduce
Sounds good! Please give some examples :) I just got back from some holidays and will start posting some more stuff shortly Cheers James. On 2010-07-21, at 7:22 PM, Jeff Zhang wrote: Cool, James. I am very interested to contribute to this. I think group by, join and order by can been added to the examples. On Thu, Jul 22, 2010 at 4:59 AM, James Seigel ja...@tynt.com wrote: Oh yeah, it would help if I put the url: http://github.com/seigel/MRPatterns James On 2010-07-21, at 2:55 PM, James Seigel wrote: Here is a skeleton project I stuffed up on github (feel free to offer other suggestions/alternatives). There is a wiki, a place to commit code, a place to fork around, etc.. Over the next couple of days I’ll try and put up some sample samples for people to poke around with. Feel free to attack the wiki, contribute code, etc... If anyone can derive some cool pseudo code to write map reduce type algorithms that’d be great. Cheers James. On 2010-07-21, at 10:51 AM, James Seigel wrote: Jeff, I agree that cascading looks cool and might/should have a place in everyone’s tool box, however at some corps it takes a while to get those kinds of changes in place and therefore they might have to hand craft some java code before moving (if they ever can) to a different technology. I will get something up and going and post a link back for whomever is interested. To answer Himanshu’s question, I am thinking something like this (with some code): Hadoop M/R Patterns, and ones that match Pig Structures 1. COUNT: [Mapper] Spit out one key and the value of 1. [Combiner] Same as reducer. [Reducer] count = count + next.value. [Emit] Single result. 2. FREQ COUNT: [Mapper] Item, 1. [Combiner] Same as reducer. [Reducer] count = count + next.value. [Emit] list of Key, count 3. UNIQUE: [Mapper] Item, One. [Combiner] None. [Reducer + Emit] spit out list of keys and no value. I think adding a description of why the technique works would be helpful for people learning as well. I see some questions from people not understanding what happens to the data between mappers and reducers, or what data they will see when it gets to the reducer...etc... Cheers James. -- Best Regards Jeff Zhang
Re: From X to Hadoop MapReduce
'hamake' on github looks like a handy tool as well- haven't used it. It does the old unix 'make' timestamp dependency trick on the inputoutput file sets, to decide which jobs to run in sequence. And possibly in parallel. Lance On Wed, Sep 1, 2010 at 12:27 PM, James Seigel ja...@tynt.com wrote: Sounds good! Please give some examples :) I just got back from some holidays and will start posting some more stuff shortly Cheers James. On 2010-07-21, at 7:22 PM, Jeff Zhang wrote: Cool, James. I am very interested to contribute to this. I think group by, join and order by can been added to the examples. On Thu, Jul 22, 2010 at 4:59 AM, James Seigel ja...@tynt.com wrote: Oh yeah, it would help if I put the url: http://github.com/seigel/MRPatterns James On 2010-07-21, at 2:55 PM, James Seigel wrote: Here is a skeleton project I stuffed up on github (feel free to offer other suggestions/alternatives). There is a wiki, a place to commit code, a place to fork around, etc.. Over the next couple of days I’ll try and put up some sample samples for people to poke around with. Feel free to attack the wiki, contribute code, etc... If anyone can derive some cool pseudo code to write map reduce type algorithms that’d be great. Cheers James. On 2010-07-21, at 10:51 AM, James Seigel wrote: Jeff, I agree that cascading looks cool and might/should have a place in everyone’s tool box, however at some corps it takes a while to get those kinds of changes in place and therefore they might have to hand craft some java code before moving (if they ever can) to a different technology. I will get something up and going and post a link back for whomever is interested. To answer Himanshu’s question, I am thinking something like this (with some code): Hadoop M/R Patterns, and ones that match Pig Structures 1. COUNT: [Mapper] Spit out one key and the value of 1. [Combiner] Same as reducer. [Reducer] count = count + next.value. [Emit] Single result. 2. FREQ COUNT: [Mapper] Item, 1. [Combiner] Same as reducer. [Reducer] count = count + next.value. [Emit] list of Key, count 3. UNIQUE: [Mapper] Item, One. [Combiner] None. [Reducer + Emit] spit out list of keys and no value. I think adding a description of why the technique works would be helpful for people learning as well. I see some questions from people not understanding what happens to the data between mappers and reducers, or what data they will see when it gets to the reducer...etc... Cheers James. -- Best Regards Jeff Zhang -- Lance Norskog goks...@gmail.com