Re: Decommission multiple datanodes at a time?

2010-09-01 Thread Jeff Hammerbacher
Not sure if it's up to date, but the FAQ attempts to answer this question:
http://wiki.apache.org/hadoop/FAQ#A17.

On Tue, Aug 31, 2010 at 1:04 PM, Allen Wittenauer
awittena...@linkedin.comwrote:


 On Aug 31, 2010, at 12:58 PM, jiang licht wrote:

  Is the number of nodes to be decommissioned bounded by replication
 factor?

 No, it is bounded mainly by network bandwidth and NN/DN RPC call rate.
 I've seen decommissions of like 400 nodes at once.




Re: where do I see all hadoop logs?

2010-09-01 Thread Jeff Hammerbacher
Hey Mark,

You might find this blog post useful:
http://www.cloudera.com/blog/2009/09/apache-hadoop-log-files-where-to-find-them-in-cdh-and-what-info-they-contain/.
Given that it's a year old now, I expect it's out of date, but it might help
fill in some gaps.

Thanks,
Jeff

On Mon, Aug 30, 2010 at 1:51 PM, abhishek sharma absha...@usc.edu wrote:

 On Mon, Aug 30, 2010 at 1:43 PM, Mark Kerzner markkerz...@gmail.com
 wrote:
  The statements are in my code. When I am running one node, I see them,
 but I
  am not sure what happens when I am running more.
 

 Do you see them in the .out files produced for each node?

 Abhishek
  On Mon, Aug 30, 2010 at 3:41 PM, abhishek sharma absha...@usc.edu
 wrote:
 
  Hi Mark,
 
  Are these print statements in the code of your job or the Hadoop
  components (JobTracker, TaskTracker, etc.)?
  Depending on where you place them, they might show up in the *.out
  files that are created along with the log files (and not show up on
  the console).
 
  Abhishek
 
 
  On Mon, Aug 30, 2010 at 1:35 PM, Mark Kerzner markkerz...@gmail.com
  wrote:
   Hi,
  
   do all nodes send their System.out.println() logs to the same place in
   Hadoop job console? I don't see the mixture I would expect.
  
   Thank you,
   Mark
  
 
 



Re: Decommission multiple datanodes at a time?

2010-09-01 Thread Steve Loughran

On 01/09/10 08:35, Jeff Hammerbacher wrote:

Not sure if it's up to date, but the FAQ attempts to answer this question:
http://wiki.apache.org/hadoop/FAQ#A17.

On Tue, Aug 31, 2010 at 1:04 PM, Allen Wittenauer
awittena...@linkedin.comwrote:



On Aug 31, 2010, at 12:58 PM, jiang licht wrote:


Is the number of nodes to be decommissioned bounded by replication

factor?

No, it is bounded mainly by network bandwidth and NN/DN RPC call rate.
I've seen decommissions of like 400 nodes at once.






I've been worrying a bit about network partition events; seems to me 
that if a switch fails a large cluster could get overloaded by excessive 
failing heartbeats, replication work and incoming client calls.


Some options to mitigate this (HDFS issues, obviously)

-have a way to temporarily say It's OK to underreplicate anything that 
was on these servers with a list of servers; pull back the replication 
count for a while.


-put the namenode into safe mode if a sufficiently large #of workers go 
offline within a specific time period




namesecondary has no content

2010-09-01 Thread shangan
I have configured the secondary namenode in a separate node, but it only has 
the following content under directory /namesecondary:
current  
in_use.lock 
lastcheckpoint.tmp 

the directory current and lastcheckpoint.tmp are null


Do anyone know why? 
ps: everything else are now running well in the cluster.

2010-09-01 



shangan 


Re: Combining Only Once?

2010-09-01 Thread Yağız Kargın
Thanks for the reply.

2010/8/31 Owen O'Malley omal...@apache.org:
 There used to be a compatibility switch, but I believe it was removed
 in 0.19 or 0.20.

I recognized that, the switch has already been removed.


 Can you describe what you are trying to accomplish? Combiners were
 always intended to only be used for  operations that are idempotent,
 associative, and commutative. Clearly your combiner doesn't satisfy
 one of those properties or you wouldn't care if it was applied more
 than once.

Actually, I have to apply an operation on the final output of each map
task, but only once. For each map task, for each key, I see
fully-aggregated final value and then reduce the map output size by a
large amount according to the values. Basically that is something
people usually do in the reduce phase. However, since my keys are
large and many for each mapper; I want to lower the network cost, by
pre-removing keys which I don't need in the final output. This can be
done, only if I can reach the locally aggregated final output of the
map tasks in the map phase.

Yagiz


 -- Owen



Job performance issue: output.collect()

2010-09-01 Thread Oded Rosen
Hi all,

My job (written in old 0.18 api, but that's not the issue here) is producing
large amounts of map output.
Each map() call generates about ~20 output.collects, and each output is
pretty big (~1K) = each map() produces about 20K.
All of this data is fed to a combiner that really reduces the output's size
+ amounts.
the job input is not so big: there are about 120M map input records.

This job is pretty slow. Other jobs that work on the same input are much
faster, since they do not produce so much output.
Analyzing the job performance (timing the map() function parts), I've seen
that much time is spent on the output.collect() line itself.

I know that during the output.collect() command the output is being written
to local filesystem spills (when the spill buffer reaches a 80% limit),
so I guessed that reducing the size of each output will improve performance.
This was not the case - after cutting 30% of the map output size, the job
took the same amount of time. The thing that I cannot reduce is the amount
of output lines being written out of the map.

I would like to know what happens in the output.collect line that takes lots
of time, in order to cut down this job's running time.
Please keep in mind that I have a combiner, and to my understanding
different things happen to the map output when a combiner is present.

Can anyone help me understand how can I save this precious time?
Thanks,

-- 
Oded


api doc incomplete

2010-09-01 Thread Gang Luo
Hi all,
does anybody notice the online api doc is incomplete? At 
http://hadoop.apache.org/common/docs/current/api/ there is even no mapred or 
mapreduce package there. I remember I use it well before. What happen?

Thanks,
-Gang






Re: Job performance issue: output.collect()

2010-09-01 Thread He Chen
Hey Oded Rosen

I am not sure what is the functionality of your map() method. Intuitively,
move the map() method computation to the reduce() method if your map()
output is problematic. I mean just let the map() method act as a data input
reader and divider and let reduce() method do all you computation. In this
way, your intermediate results are less than before. Shuffle time can also
be reduced.

If the computation is still slow, I think it may not be the MapReduce
framework problem, but your programs. Hope this helps.


Chen

On Wed, Sep 1, 2010 at 7:18 AM, Oded Rosen o...@legolas-media.com wrote:

 Hi all,

 My job (written in old 0.18 api, but that's not the issue here) is
 producing
 large amounts of map output.
 Each map() call generates about ~20 output.collects, and each output is
 pretty big (~1K) = each map() produces about 20K.
 All of this data is fed to a combiner that really reduces the output's size
 + amounts.
 the job input is not so big: there are about 120M map input records.

 This job is pretty slow. Other jobs that work on the same input are much
 faster, since they do not produce so much output.
 Analyzing the job performance (timing the map() function parts), I've seen
 that much time is spent on the output.collect() line itself.

 I know that during the output.collect() command the output is being written
 to local filesystem spills (when the spill buffer reaches a 80% limit),
 so I guessed that reducing the size of each output will improve
 performance.
 This was not the case - after cutting 30% of the map output size, the job
 took the same amount of time. The thing that I cannot reduce is the amount
 of output lines being written out of the map.

 I would like to know what happens in the output.collect line that takes
 lots
 of time, in order to cut down this job's running time.
 Please keep in mind that I have a combiner, and to my understanding
 different things happen to the map output when a combiner is present.

 Can anyone help me understand how can I save this precious time?
 Thanks,

 --
 Oded



Re: api doc incomplete

2010-09-01 Thread Owen O'Malley


On Sep 1, 2010, at 8:56 AM, Gang Luo wrote:


Hi all,
does anybody notice the online api doc is incomplete? At
http://hadoop.apache.org/common/docs/current/api/ there is even no  
mapred or

mapreduce package there. I remember I use it well before. What happen?


When {common,hdfs,mapreduce}-0.21.0 was released, it became current.  
Since the project split happened between 0.20 to 0.21, that means the  
current docs are now split. If you look at http://hadoop.apache.org/mapreduce/docs/current/api 
, you'll find what you are looking for. Additionally, we should make a  
stable link that points to the latest of the 0.20 line.


-- Owen


Re: accounts permission on hadoop

2010-09-01 Thread Todd Lipcon
On Tue, Aug 31, 2010 at 5:28 PM, Allen Wittenauer
awittena...@linkedin.com wrote:

 On Aug 31, 2010, at 2:43 PM, Edward Capriolo wrote:

 On Tue, Aug 31, 2010 at 5:07 PM, Gang Luo lgpub...@yahoo.com.cn wrote:
 Hi all,
 I am the administrator of a hadoop cluster. I want to know how to specify a
 group a user belong to. Or hadoop just use the group/user information from 
 the
 linux system it runs on? For example, if a user 'smith' belongs to a group
 'research' in the linux system. what is his account and group on HDFS?



 Currently hadoop gets its user groups from the posix user/groups.

 ... based upon what the client sends, not what the server knows.

Not anymore in trunk or the security branch - now it's mapped on the
server side with a configurable resolver class.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Job performance issue: output.collect()

2010-09-01 Thread Owen O'Malley


On Sep 1, 2010, at 5:18 AM, Oded Rosen wrote:

I would like to know what happens in the output.collect line that  
takes lots

of time, in order to cut down this job's running time.
Please keep in mind that I have a combiner, and to my understanding
different things happen to the map output when a combiner is present.


The best presentation on the map side sort is the one that Chris  
Douglas (who did most of the implementation) did for the Bay Area HUG.


http://developer.yahoo.net/blogs/hadoop/2010/01/hadoop_bay_area_january_2010_u.html

There are both slides and a video of the presentation. I'd run through  
that first.


You most likely are getting more spills than you deserve. The  
variables to look at:


io.sort.mb - should be most of the task's ram budget
io.sort.record.percent - depends on record size
io.sort.factor - typically 25 * (# of disks / node)

-- Owen


Re: Classpath

2010-09-01 Thread Alex Baranau
From http://blog.sematext.com/2010/05/31/hadoop-digest-may-2010/ FAQ
section:

How can I attach external libraries (jars) which my jobs depend on?
You can put them in a “lib” subdirectory of your jar root directory.
Alternatively you can use DistributedCache API.

Alex Baranau

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - HBase

On Sun, Aug 29, 2010 at 8:29 AM, Mark static.void@gmail.com wrote:

  How can I add jars to Hadoops classpath when running MapReduce jobs for
 the following situations?

 1) Assuming that the jars are local the nodes that running the job.
 2) The jobs are only local to the client submitting the job.

 I'm assuming I can just jar up all required jobs into the main job jar
 being submitted, but I was wondering if there was some other way. Thanks



Re: missing part folder - how to debug?

2010-09-01 Thread Alex Baranau
Hi,

Adding Solr user list.

We used similar approach to the one in this patch but with Hadoop Streaming.
Did you determine that indices are really missing? I mean did you find
missing documents in the output indices?

Alex Baranau

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - HBase

2010/8/31 Koji Sekiguchi k...@r.email.ne.jp

  Hello,

 We are using Hadoop to make Solr index. We are using SOLR-1301
 that was first contributed by Andrzej:

 https://issues.apache.org/jira/browse/SOLR-1301

 It works great on testing environment, 4 servers.
 Today, we run it on production environment, 320 servers.
 We run 5120 reducers (16 per server). This results 5120 indexes
 i.e. part-X folders should be created. But about 20 part
 folders were missing, and Hadoop didn't produce any error logs.
 How can we investigate/debug this problem?

 Any pointers, experiences would be highly appreciated!

 Thanks,

 Koji

 --
 http://www.rondhuit.com/en/




Re: accounts permission on hadoop

2010-09-01 Thread Allen Wittenauer

On Sep 1, 2010, at 9:08 AM, Todd Lipcon wrote:
 
 Currently hadoop gets its user groups from the posix user/groups.
 
 ... based upon what the client sends, not what the server knows.
 
 Not anymore in trunk or the security branch - now it's mapped on the
 server side with a configurable resolver class.


Yes, but only like 3 people use that stuff presently.

Trunk=unicorns and ponies.




Re: how to revert from a new version to an older one (CDH3)?

2010-09-01 Thread Eli Collins
Hey guys,

In CDH3 you can pin your repo to a particular release. Eg in the
following docs to use beta 1 specify redhat/cdh/3b1 instead of
redhat/cdh/3 in the repo file (for RH), or DISTRO-cdh3b1 instead
of DISTRO-cdh3 in the list file (for Debian). You'll need to do a
yum clean metadata or apt-get clean update so the new packages are
seen.

https://wiki.cloudera.com/display/DOC/Hadoop+Installation+(CDH3)

Also, please direct CDH usage queries to the user list:

https://groups.google.com/a/cloudera.org/group/cdh-user

Thanks,
Eli

On Tue, Aug 24, 2010 at 11:05 AM, Edward Capriolo edlinuxg...@gmail.com wrote:
 On Tue, Aug 24, 2010 at 1:36 PM, jiang licht licht_ji...@yahoo.com wrote:
 Thanks Sudhir and Michael. I want to replace a new release of CDH3 
 (0.20.2+320) to a previous release of CDH3 (0.20.2+228). The problem is that 
 there is no installation package for previous release of CDH3 and no source 
 to rebuild from. If you do yum install from cloudera repository, you always 
 get the latest release. That's why I want to know a nice way to do this. 
 Please correct me if I am wrong. I also noticed that ppl talked about a 
 package for each release in Cloudera-supported forum getsatisfaction.com but 
 don't know current status.

 In the end, to get work done and since hadoop is simply a java application, 
 I simply used the files installed by previous release (on other machines) 
 and set up configurations that point to the right locations.

 Thanks,

 Michael

 --- On Tue, 8/24/10, Sudhir Vallamkondu sudhir.vallamko...@icrossing.com 
 wrote:

 From: Sudhir Vallamkondu sudhir.vallamko...@icrossing.com
 Subject: RE: how to revert from a new version to an older one (CDH3)?
 To: common-user@hadoop.apache.org
 Date: Tuesday, August 24, 2010, 10:57 AM

 More specifics on Michael¹s comment. You can use the yum remove or apt-get
 purge to remove the existing install.

 For Red Hat systems, run this command:
 # yum remove hadoop -y

 For Debian systems, run this command:
 # apt-get purge hadoop

 Verify that you have no Hadoop packages installed on your cluster.

 For Red Hat systems, run this command which should return no packages:
 $ rpm -qa | grep hadoop

 For Debian systems, run this command which should return no packages:
 $ dpkg -l | grep hadoop

 References:
 https://docs.cloudera.com/display/DOC/Hadoop+Upgrade+from+CDH2+to+CDH3

 On Aug/24/ 5:08 AM, common-user-digest-h...@hadoop.apache.org
 common-user-digest-h...@hadoop.apache.org wrote:

 From: Michael Segel michael_se...@hotmail.com
 Date: Tue, 24 Aug 2010 06:21:30 -0500
 To: common-user@hadoop.apache.org
 Subject: RE: how to revert from a new version to an older one (CDH3)?


 Not sure if you got your question answered...

 You need to delete the current version (via yum) and then specifically
 re-install the version you want by specifying the full name including 
 version.

 HTH
 -Mike


  Date: Mon, 23 Aug 2010 15:00:39 -0700
  From: licht_ji...@yahoo.com
  Subject: how to revert from a new version to an older one (CDH3)?
  To: common-user@hadoop.apache.org
 
  I want to replace a new CDH version 0.20.2+320 with an older one
 0.20.2+228.
 
  yum downgrade reports that version can only be upgraded. I also didn't
 find a way to yum install the older version.
 
  I guess I can download tar ball of the old version and extract it to 
  where
 the new version is installed and overwrite it. But seems not a good 
 solution
 because it might have negative impact on upgrading in the future.
 
  So, what is the best way to do this?
 
  Thanks,
 
  Michael
 
 
 



 iCrossing Privileged and Confidential Information
 This email message is for the sole use of the intended recipient(s) and may 
 contain confidential and privileged information of iCrossing. Any 
 unauthorized review, use, disclosure or distribution is prohibited. If you 
 are not the intended recipient, please contact the sender by reply email and 
 destroy all copies of the original message.







 Ah. The dangers of installing things from the Internet!!!

 The cloudera package for hadoop is great. I use it, but I DO NOT
 download it from the internet every time! Why?

 Because of the exact problem you are having, packages get updated and
 finding the older one can be hard. Always keep a copy of your RPMs
 locally! (and run your own yum repo)

 You used to be able to navigate around the clouder repo and find the
 older RPM inside the same folder. You still probably can hunt around
 and you should be able to find it.

 http://archive.cloudera.com/cdh/3/

 Good luck!



Re: From X to Hadoop MapReduce

2010-09-01 Thread James Seigel
Sounds good!  Please give some examples :)

I just got back from some holidays and will start posting some more stuff 
shortly

Cheers
James.


On 2010-07-21, at 7:22 PM, Jeff Zhang wrote:

 Cool, James. I am very interested to contribute to this.
 I think group by, join and order by can been added to the examples.
 
 
 On Thu, Jul 22, 2010 at 4:59 AM, James Seigel ja...@tynt.com wrote:
 
 Oh yeah, it would help if I put the url:
 
 http://github.com/seigel/MRPatterns
 
 James
 
 On 2010-07-21, at 2:55 PM, James Seigel wrote:
 
 Here is a skeleton project I stuffed up on github (feel free to offer
 other suggestions/alternatives).  There is a wiki, a place to commit code, a
 place to fork around, etc..
 
 Over the next couple of days I’ll try and put up some sample samples for
 people to poke around with.  Feel free to attack the wiki, contribute code,
 etc...
 
 If anyone can derive some cool pseudo code to write map reduce type
 algorithms that’d be great.
 
 Cheers
 James.
 
 
 On 2010-07-21, at 10:51 AM, James Seigel wrote:
 
 Jeff, I agree that cascading looks cool and might/should have a place in
 everyone’s tool box, however at some corps it takes a while to get those
 kinds of changes in place and therefore they might have to hand craft some
 java code before moving (if they ever can) to a different technology.
 
 I will get something up and going and post a link back for whomever is
 interested.
 
 To answer Himanshu’s question, I am thinking something like this (with
 some code):
 
 Hadoop M/R Patterns, and ones that match Pig Structures
 
 1. COUNT: [Mapper] Spit out one key and the value of 1. [Combiner] Same
 as reducer. [Reducer] count = count + next.value.  [Emit] Single result.
 2. FREQ COUNT: [Mapper] Item, 1.  [Combiner] Same as reducer. [Reducer]
 count = count + next.value.  [Emit] list of Key, count
 3. UNIQUE: [Mapper] Item, One.  [Combiner] None.  [Reducer + Emit] spit
 out list of keys and no value.
 
 I think adding a description of why the technique works would be helpful
 for people learning as well.  I see some questions from people not
 understanding what happens to the data between mappers and reducers, or what
 data they will see when it gets to the reducer...etc...
 
 Cheers
 James.
 
 
 
 
 
 
 -- 
 Best Regards
 
 Jeff Zhang



Re: From X to Hadoop MapReduce

2010-09-01 Thread Lance Norskog
'hamake' on github looks like a handy tool as well- haven't used it.
It does the old unix 'make' timestamp dependency trick on the
inputoutput file sets, to decide which jobs to run in sequence. And
possibly in parallel.

Lance

On Wed, Sep 1, 2010 at 12:27 PM, James Seigel ja...@tynt.com wrote:
 Sounds good!  Please give some examples :)

 I just got back from some holidays and will start posting some more stuff 
 shortly

 Cheers
 James.


 On 2010-07-21, at 7:22 PM, Jeff Zhang wrote:

 Cool, James. I am very interested to contribute to this.
 I think group by, join and order by can been added to the examples.


 On Thu, Jul 22, 2010 at 4:59 AM, James Seigel ja...@tynt.com wrote:

 Oh yeah, it would help if I put the url:

 http://github.com/seigel/MRPatterns

 James

 On 2010-07-21, at 2:55 PM, James Seigel wrote:

 Here is a skeleton project I stuffed up on github (feel free to offer
 other suggestions/alternatives).  There is a wiki, a place to commit code, a
 place to fork around, etc..

 Over the next couple of days I’ll try and put up some sample samples for
 people to poke around with.  Feel free to attack the wiki, contribute code,
 etc...

 If anyone can derive some cool pseudo code to write map reduce type
 algorithms that’d be great.

 Cheers
 James.


 On 2010-07-21, at 10:51 AM, James Seigel wrote:

 Jeff, I agree that cascading looks cool and might/should have a place in
 everyone’s tool box, however at some corps it takes a while to get those
 kinds of changes in place and therefore they might have to hand craft some
 java code before moving (if they ever can) to a different technology.

 I will get something up and going and post a link back for whomever is
 interested.

 To answer Himanshu’s question, I am thinking something like this (with
 some code):

 Hadoop M/R Patterns, and ones that match Pig Structures

 1. COUNT: [Mapper] Spit out one key and the value of 1. [Combiner] Same
 as reducer. [Reducer] count = count + next.value.  [Emit] Single result.
 2. FREQ COUNT: [Mapper] Item, 1.  [Combiner] Same as reducer. [Reducer]
 count = count + next.value.  [Emit] list of Key, count
 3. UNIQUE: [Mapper] Item, One.  [Combiner] None.  [Reducer + Emit] spit
 out list of keys and no value.

 I think adding a description of why the technique works would be helpful
 for people learning as well.  I see some questions from people not
 understanding what happens to the data between mappers and reducers, or what
 data they will see when it gets to the reducer...etc...

 Cheers
 James.






 --
 Best Regards

 Jeff Zhang





-- 
Lance Norskog
goks...@gmail.com