date:20090305

Re: Reduce doesn't start until map finishes

2009-03-05 Thread Rasit OZDAS

So, is there currently no solution to my problem?
Should I live with it? Or do we have to have a JIRA for this?
What do you think?


2009/3/4 Nick Cen 

> Thanks, about the "Secondary Sort", can you provide some example. What does
> the intermediate keys stands for?
>
> Assume I have two mapper, m1 and m2. The output of m1 is (k1,v1),(k2,v2)
> and
> the output of m2 is (k1,v3),(k2,v4). Assume k1 and k2 belongs to the same
> partition and k1 < k2, so i think the order inside reducer maybe:
> (k1,v1)
> (k1,v3)
> (k2,v2)
> (k2,v4)
>
> can the Secondary Sort change this order?
>
>
>
> 2009/3/4 Chris Douglas 
>
> > The output of each map is sorted by partition and by key within that
> > partition. The reduce merges sorted map output assigned to its partition
> > into the reduce. The following may be helpful:
> >
> > http://hadoop.apache.org/core/docs/current/mapred_tutorial.html
> >
> > If your job requires total order, consider
> > o.a.h.mapred.lib.TotalOrderPartitioner. -C
> >
> >
> > On Mar 3, 2009, at 7:24 PM, Nick Cen wrote:
> >
> >  can you provide more info about sortint? The sort is happend on the
> whole
> >> data set, or just on the specified partion?
> >>
> >> 2009/3/4 Mikhail Yakshin 
> >>
> >>  On Wed, Mar 4, 2009 at 2:09 AM, Chris Douglas wrote:
> >>>
>  This is normal behavior. The Reducer is guaranteed to receive all the
>  results for its partition in sorted order. No reduce can start until
> all
> 
> >>> the
> >>>
>  maps are completed, since any running map could emit a result that
> would
>  violate the order for the results it currently has. -C
> 
> >>>
> >>> _Reducers_ usually start almost immediately and start downloading data
> >>> emitted by mappers as they go. This is their first phase. Their second
> >>> phase can start only after completion of all mappers. In their second
> >>> phase, they're sorting received data, and in their third phase they're
> >>> doing real reduction.
> >>>
> >>> --
> >>> WBR, Mikhail Yakshin
> >>>
> >>>
> >>
> >>
> >> --
> >> http://daily.appspot.com/food/
> >>
> >
> >
>
>
> --
> http://daily.appspot.com/food/
>



-- 
M. Raşit ÖZDAŞ

Fetch errors. 2 node cluster.

2009-03-05 Thread pavelkolodin



Hello to all.
I have 2 nodes in cluster - master + slave.
names "master1" and "slave1" stored in /etc/hosts on both hosts and they  
are 100% correct.


conf/masters:
master1

conf/slaves:
master1
slave1

"conf/slaves" + "conf/masters" are empty on "slave1" node. I tried to fill  
them in many ways - it didn't helped.


"master1" is AMD-64, "slave1" is Xeon-32.
I have compiled one C++ wordcount-simple binary on 32bit machine and put  
it on HDFS.

The binary successfully runs on both machines.

I have 5 files in "/input" on HDFS:
i1.txt - 2 MB
i2.txt - 2 MB
i3.txt - 2 MB
i4.txt ~ 50 MB
i5.txt ~ 50 MB

I have tried 0.18.3, 0.19.1, "trunk" svn dir, "branch-0.20" svn dir.  
Result the same...


running job on "master1":
localhost$> bin/hadoop pipes -conf src/examples/pipes/conf/word.xml -input  
/input -output /o1


word.xml: http://pastebin.com/m25577ea4
conf/hadoop-default.xml: http://pastebin.com/m199c08f0
conf/hadoop-site.xml: http://pastebin.com/m321ead97
conf/hadoop-env.sh: http://pastebin.com/m41c36f2f



Console output on "master1" contains WARN messages about fetching errors:

09/03/06 09:44:23 WARN mapred.JobClient: Error reading task  
outputhttp://localhost:50060/tasklog?plaintext=true&taskid=attempt_200903060939_0001_m_00_0&filter=stdout


[master1] logs/hadoop-hadoop-tasktracker-localhost.log contains this many  
times:


2009-03-06 09:41:51,178 WARN org.apache.hadoop.mapred.TaskTracker:  
getMapOutput(attempt_200903060939_0001_m_00_0,1) failed :
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find  
taskTracker/jobcache/job_200903060939_0001/attempt_200903060939_0001_m_00_0/output/file.out.index  
in any of the configured local directories
2009-03-06 09:41:51,179 WARN org.apache.hadoop.mapred.TaskTracker: Unknown  
child with bad map output: attempt_200903060939_0001_m_00_0. Ignored.
2009-03-06 09:41:51,224 INFO  
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 127.0.0.1:50060,  
dest: 127.0.0.1:53917, bytes: 0, op: MAPRED_SHUFFLE, cliID:  
attempt_200903060939_0001_m_00_0
2009-03-06 09:41:51,224 WARN org.mortbay.log: /mapOutput:  
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find  
taskTracker/jobcache/job_200903060939_0001/attempt_200903060939_0001_m_00_0/output/file.out.index  
in any of the configured local directories


[slave1] logs/hadoop-hadoop-tasktracker-srv.log contains this:

...
2009-03-06 09:40:50,094 INFO org.apache.hadoop.mapred.TaskTracker:  
attempt_200903060939_0001_m_00_0 0.61383915%  
hdfs://master1:9000/inputi4.txt:0+5336
2009-03-06 09:40:50,188 INFO org.apache.hadoop.mapred.TaskTracker:  
attempt_200903060939_0001_m_01_0 0.59977823%  
hdfs://master1:9000/inputi5.txt:0+5336
2009-03-06 09:40:53,097 INFO org.apache.hadoop.mapred.TaskTracker:  
attempt_200903060939_0001_m_00_0 0.66882175%  
hdfs://master1:9000/inputi4.txt:0+5336
2009-03-06 09:40:53,191 INFO org.apache.hadoop.mapred.TaskTracker:  
attempt_200903060939_0001_m_01_0 0.64430434%  
hdfs://master1:9000/inputi5.txt:0+5336
2009-03-06 09:40:56,100 INFO org.apache.hadoop.mapred.TaskTracker:  
attempt_200903060939_0001_m_00_0 0.7192957%  
hdfs://master1:9000/inputi4.txt:0+5336
2009-03-06 09:40:56,194 INFO org.apache.hadoop.mapred.TaskTracker:  
attempt_200903060939_0001_m_01_0 0.68883044%  
hdfs://master1:9000/inputi5.txt:0+5336
2009-03-06 09:40:59,103 INFO org.apache.hadoop.mapred.TaskTracker:  
attempt_200903060939_0001_m_00_0 0.7661652%  
hdfs://master1:9000/inputi4.txt:0+5336
2009-03-06 09:40:59,212 INFO org.apache.hadoop.mapred.TaskTracker:  
attempt_200903060939_0001_m_01_0 0.7263261%  
hdfs://master1:9000/inputi5.txt:0+5336
2009-03-06 09:41:02,106 INFO org.apache.hadoop.mapred.TaskTracker:  
attempt_200903060939_0001_m_00_0 0.80600435%  
hdfs://master1:9000/inputi4.txt:0+5336
2009-03-06 09:41:02,271 INFO org.apache.hadoop.mapred.TaskTracker:  
attempt_200903060939_0001_m_01_0 0.7802261%  
hdfs://master1:9000/inputi5.txt:0+5336

...

I have read some mailing lists and saw discussions about the ability nodes  
to network connections to each other,
but i cant imagine where my error is... Iptables is empty and i can ssh  
from master to slave and from slave to master... Also i checked  
tcp-connections from one to another with ports 9000, 9001 and other (by  
running "nc")...


Just another description of this problem:
http://dramele.livejournal.com/101634.html

Pavel.

Re: The cpu preemption between MPI and Hadoop programs on Same Cluster

2009-03-05 Thread Aaron Kimball

Song, you should be able to use 'nice' to reprioritize the MPI task
below that of your Hadoop jobs.
- Aaron

On Thu, Mar 5, 2009 at 8:26 PM, 柳松  wrote:
>
> Dear all:
> I run my hadoop program with another MPI program on the same cluster. 
> here is the result of "top".
>  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
> 11750 qianglv   25   0  233m  99m 6100 R 99.7  2.5 116:05.59 rosetta.mpich
> 18094 cip   17   0 3136m  68m  15m S  0.5  1.7   0:12.69 java
> 18244 cip   17   0 3142m  80m  15m S  0.2  2.0   0:17.61 java
> 18367 cip   18   0 2169m  88m  15m S  0.1  2.3   0:17.46 java
> 18012 cip   18   0 3141m  77m  15m S  0.1  2.0   0:14.49 java
> 18584 cip   21   0 m  46m  15m S  0.1  1.2   0:05.12 java
>
> My Hadoop program can only get no more than 1 percent cpu time slide in 
> total, compared with the rosetta.mpich program's 99.7%.
>
> I'm sure my program is in progress since the log files told me, they are 
> running normally.
>
> Someone told me, it's the nature of Java program, low cpu priority, 
> especially compared with C program.
>
> Is that true?
>
> Regards
> Song Liu in Suzhou University.

Re: Throw an exception if the configure method fails

2009-03-05 Thread Aaron Kimball

Try throwing RuntimeException, or any other unchecked exception (e.g., any
descendant classes of RuntimeException)
- Aaron

On Thu, Mar 5, 2009 at 4:24 PM, Saptarshi Guha wrote:

> hello,
> I'm not that comfortable with java, so here is my question. In the
> MapReduceBase class, i have implemented the configure method, which
> does not throw an exception. Suppose I detect an error in some
> options, i wish to raise an exception(in the configure method) - is
> there a way to do that? Is there a way to stop the job in case the
> configure method fails?,
> Saptarshi Guha
>
> [1] My map extends MapReduceBase and my reduce extends MapReduceBase -
> two separate classes.
>
> Thank you
>

Re: Repartitioned Joins

2009-03-05 Thread Aaron Kimball

Richa,

Since the mappers run independently, you'd have a hard time
determining whether a record in mapper A would be joined by a record
in mapper B. The solution, as it were, would be to do this in two
separate MapReduce passes:

* Take an educated guess at which table is the smaller data set.
* Run a MapReduce over this dataset, building up a bloom filter for
the record ids. Set entries in the filter to 1 for each record id you
see; leave the rest as 0.
* The bloom filter now has 1 meaning "maybe joinable" and 0 meaning
"definitely not joinable."
* Run a second MapReduce job over both datasets. Use the distributed
cache to send the filter to all mappers. Mappers emit all records
where filter[hash(record_id)] == 1.

- Aaron

On Wed, Mar 4, 2009 at 11:18 AM, Richa Khandelwal  wrote:
> Hi All,
> Does anyone know of tweaking in map-reduce joins that will optimize it
> further in terms of the moving only those tuples to reduce phase that join
> in the two tables? There are replicated joins and semi-join strategies but
> they are more of databases than map-reduce.
>
> Thanks,
> Richa Khandelwal
> University Of California,
> Santa Cruz.
> Ph:425-241-7763
>

Re: Running 0.19.2 branch in production before release

2009-03-05 Thread Aaron Kimball

Right, there's no sense in freezing your Hadoop version forever :)

But if you're an ops team tasked with keeping a production cluster running
24/7, running on 0.19 (or even more daringly, TRUNK) is not something that I
would consider a Best Practice. Ideally you'll be able to carve out some
spare capacity (maybe 3--5 nodes) to use as a staging cluster that runs on
0.19 or TRUNK that you can use to evaluate the next version. Then when you
are convinced that it's stable, and your staging cluster passes your
internal tests (e.g., running test versions of your critical nightly jobs
successfully), you can move that to production.

- Aaron


On Thu, Mar 5, 2009 at 2:33 AM, Steve Loughran  wrote:

> Aaron Kimball wrote:
>
>> I recommend 0.18.3 for production use and avoid the 19 branch entirely. If
>> your priority is stability, then stay a full minor version behind, not
>> just
>> a revision.
>>
>
> Of course, if everyone stays that far behind, they don't get to find the
> bugs for other people.
>
> * If you play with the latest releases early, while they are in the beta
> phase -you will encounter the problems specific to your
> applications/datacentres, and get them fixed fast.
>
> * If you work with stuff further back you get stability, but not only are
> you behind on features, you can't be sure that all "fixes" that matter to
> you get pushed back.
>
> * If you plan on making changes, of adding features, get onto SVN_HEAD
>
> * If you want to catch changes being made that break your site, SVN_HEAD.
> Better yet, have a private Hudson server checking out SVN_HEAD hadoop *then*
> building and testing your app against it.
>
> Normally I work with stable releases of things I dont depend on, and
> SVN_HEAD of OSS stuff whose code I have any intent to change; there is a
> price -merge time, the odd change breaking your code- but you get to make
> changes that help you long term.
>
> Where Hadoop is different is that it is a filesystem, and you don't want to
> hit bugs that delete files that matter. I'm only bringing up transient
> clusters on VMs, pulling in data from elsewhere, so this isn't an issue. All
> that remains is changing APIs.
>
> -Steve
>

Re: Throwing an IOException in Map, yet task does not fail

2009-03-05 Thread Jothi Padmanabhan

I meant, "not marked as failed" ...


On 3/6/09 10:37 AM, "Jothi Padmanabhan"  wrote:

> Just trying to understand this better, are you observing that the task, which
> failed with the IOException, not getting marked as killed? If yes, that does
> not look right...
> 
> Jothi 
> 
> On 3/6/09 8:12 AM, "Saptarshi Guha"  wrote:
> 
>> Hello,
>> I have given a case where my mapper should fail. That is, based on a
>> result it throws an exception
>> if(res==0) throw new IOException("Error in code!, see stderr/out");
>> ,
>> When i go to the JobTracker website, e.g
>> 
http://tracker.com:50030/jobdetails.jsp?jobid=job_200903051709_0024&refresh=3>>
0
>> and click on one of the running tasks, I see an IOException in the
>> errors column.
>> But on the jobtracker page for the job, it doesn't fail - it stays in
>> the running column , never moving to the failed/killed columns (not
>> even after 10 minutes)
>> 
>> Why so?
>> Regards
>> 
>> 
>> Saptarshi Guha

Re: Throwing an IOException in Map, yet task does not fail

2009-03-05 Thread Amareshwari Sriramadasu


Is your job a streaming job?
If so, Which version of hadoop are you using? what is the configured 
value for stream.non.zero.exit.is.failure? Can you see 
stream.non.zero.exit.is.failure to true and try again?

Thanks
Amareshwari
Saptarshi Guha wrote:

Hello,
I have given a case where my mapper should fail. That is, based on a
result it throws an exception
if(res==0) throw new IOException("Error in code!, see stderr/out");
,
When i go to the JobTracker website, e.g
http://tracker.com:50030/jobdetails.jsp?jobid=job_200903051709_0024&refresh=30
and click on one of the running tasks, I see an IOException in the
errors column.
But on the jobtracker page for the job, it doesn't fail - it stays in
the running column , never moving to the failed/killed columns (not
even after 10 minutes)

Why so?
Regards


Saptarshi Guha

Re: Throwing an IOException in Map, yet task does not fail

2009-03-05 Thread Jothi Padmanabhan

Just trying to understand this better, are you observing that the task,
which failed with the IOException, not getting marked as killed? If yes,
that does not look right...

Jothi 

On 3/6/09 8:12 AM, "Saptarshi Guha"  wrote:

> Hello,
> I have given a case where my mapper should fail. That is, based on a
> result it throws an exception
> if(res==0) throw new IOException("Error in code!, see stderr/out");
> ,
> When i go to the JobTracker website, e.g
> http://tracker.com:50030/jobdetails.jsp?jobid=job_200903051709_0024&refresh=30
> and click on one of the running tasks, I see an IOException in the
> errors column.
> But on the jobtracker page for the job, it doesn't fail - it stays in
> the running column , never moving to the failed/killed columns (not
> even after 10 minutes)
> 
> Why so?
> Regards
> 
> 
> Saptarshi Guha

The cpu preemption between MPI and Hadoop programs on Same Cluster

2009-03-05 Thread 柳松

Dear all:
    I run my hadoop program with another MPI program on the same cluster. here 
is the result of "top". 
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
11750 qianglv   25   0  233m  99m 6100 R 99.7  2.5 116:05.59 rosetta.mpich  
18094 cip   17   0 3136m  68m  15m S  0.5  1.7   0:12.69 java   
18244 cip   17   0 3142m  80m  15m S  0.2  2.0   0:17.61 java   
18367 cip   18   0 2169m  88m  15m S  0.1  2.3   0:17.46 java   
18012 cip   18   0 3141m  77m  15m S  0.1  2.0   0:14.49 java   
18584 cip   21   0 m  46m  15m S  0.1  1.2   0:05.12 java  

My Hadoop program can only get no more than 1 percent cpu time slide in total, 
compared with the rosetta.mpich program's 99.7%.

I'm sure my program is in progress since the log files told me, they are 
running normally.

Someone told me, it's the nature of Java program, low cpu priority, especially 
compared with C program.

Is that true?

Regards
Song Liu in Suzhou University.

Re: wordcount getting slower with more mappers and reducers?

2009-03-05 Thread haizhou zhao

As I metioned above, you should at least try like this:
map2 reduce1
map4 reduce1
map8 reduce1

map4 reduce1
map4 reduce2
map4 reduce4

instead of :
map2 reduce2
map4 reduce4
map8 reduce8

2009/3/6 Sandy 

> I was trying to control the maximum number of tasks per tasktracker by
> using
> the
> mapred.tasktracker.tasks.maximum parameter
>
> I am interpreting your comment to mean that maybe this parameter is
> malformed and should read:
> mapred.tasktracker.map.tasks.maximum = 8
> mapred.tasktracker.map.tasks.maximum = 8
>
> I did that, and reran on a 428MB input, and got the same results as before.
> I also ran it on a 3.3G dataset, and got the same pattern.
>
> I am still trying to run it on a 20 GB input. This should confirm if the
> filesystem cache thing is true.
>
> -SM
>
> On Thu, Mar 5, 2009 at 12:22 PM, Sandy  wrote:
>
> > Arun,
> >
> > How can I check the number of slots per tasktracker? Which parameter
> > controls that?
> >
> > Thanks,
> > -SM
> >
> >
> > On Thu, Mar 5, 2009 at 12:14 PM, Arun C Murthy 
> wrote:
> >
> >> I assume you have only 2 map and 2 reduce slots per tasktracker - which
> >> totals to 2 maps/reduces for you cluster. This means with more
> maps/reduces
> >> they are serialized to 2 at a time.
> >>
> >> Also, the -m is only a hint to the JobTracker, you might see less/more
> >> than the number of maps you have specified on the command line.
> >> The -r however is followed faithfully.
> >>
> >> Arun
> >>
> >>
> >> On Mar 4, 2009, at 2:46 PM, Sandy wrote:
> >>
> >>  Hello all,
> >>>
> >>> For the sake of benchmarking, I ran the standard hadoop wordcount
> example
> >>> on
> >>> an input file using 2, 4, and 8 mappers and reducers for my job.
> >>> In other words,  I do:
> >>>
> >>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
> >>> sample.txt output
> >>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
> >>> sample.txt output2
> >>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
> >>> sample.txt output3
> >>>
> >>> Strangely enough, when this increase in mappers and reducers result in
> >>> slower running times!
> >>> -On 2 mappers and reducers it ran for 40 seconds
> >>> on 4 mappers and reducers it ran for 60 seconds
> >>> on 8 mappers and reducers it ran for 90 seconds!
> >>>
> >>> Please note that the "sample.txt" file is identical in each of these
> >>> runs.
> >>>
> >>> I have the following questions:
> >>> - Shouldn't wordcount get -faster- with additional mappers and
> reducers,
> >>> instead of slower?
> >>> - If it does get faster for other people, why does it become slower for
> >>> me?
> >>>  I am running hadoop on psuedo-distributed mode on a single 64-bit Mac
> >>> Pro
> >>> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs
> >>>
> >>> I would greatly appreciate it if someone could explain this behavior to
> >>> me,
> >>> and tell me if I'm running this wrong. How can I change my settings (if
> >>> at
> >>> all) to get wordcount running faster when i increases that number of
> maps
> >>> and reduces?
> >>>
> >>> Thanks,
> >>> -SM
> >>>
> >>
> >>
> >
>

Throwing an IOException in Map, yet task does not fail

2009-03-05 Thread Saptarshi Guha

Hello,
I have given a case where my mapper should fail. That is, based on a
result it throws an exception
if(res==0) throw new IOException("Error in code!, see stderr/out");
,
When i go to the JobTracker website, e.g
http://tracker.com:50030/jobdetails.jsp?jobid=job_200903051709_0024&refresh=30
and click on one of the running tasks, I see an IOException in the
errors column.
But on the jobtracker page for the job, it doesn't fail - it stays in
the running column , never moving to the failed/killed columns (not
even after 10 minutes)

Why so?
Regards


Saptarshi Guha

Throw an exception if the configure method fails

2009-03-05 Thread Saptarshi Guha

hello,
I'm not that comfortable with java, so here is my question. In the
MapReduceBase class, i have implemented the configure method, which
does not throw an exception. Suppose I detect an error in some
options, i wish to raise an exception(in the configure method) - is
there a way to do that? Is there a way to stop the job in case the
configure method fails?,
Saptarshi Guha

[1] My map extends MapReduceBase and my reduce extends MapReduceBase -
two separate classes.

Thank you

Re: Mapreduce jobconf options:webpage

2009-03-05 Thread Saptarshi Guha

Thank you
Saptarshi Guha



On Thu, Mar 5, 2009 at 6:56 PM, james warren  wrote:
> Are you referring to
>
> http://hadoop.apache.org/core/docs/current/hadoop-default.html
>
> ?  The default settings are also available in the conf/ directory of your
> hadoop installation.
>
> cheers,
> -jw
>
> On Thu, Mar 5, 2009 at 3:51 PM, Saptarshi Guha 
> wrote:
>
>> Hello,
>> I came across a page, i think, on the hadoop website, listing all the
>> mapreduce options. Does anyone have a link?
>>
>> Regards
>>
>> Saptarshi Guha
>>
>

Re: Mapreduce jobconf options:webpage

2009-03-05 Thread james warren

Are you referring to

http://hadoop.apache.org/core/docs/current/hadoop-default.html

?  The default settings are also available in the conf/ directory of your
hadoop installation.

cheers,
-jw

On Thu, Mar 5, 2009 at 3:51 PM, Saptarshi Guha wrote:

> Hello,
> I came across a page, i think, on the hadoop website, listing all the
> mapreduce options. Does anyone have a link?
>
> Regards
>
> Saptarshi Guha
>

Mapreduce jobconf options:webpage

2009-03-05 Thread Saptarshi Guha

Hello,
I came across a page, i think, on the hadoop website, listing all the
mapreduce options. Does anyone have a link?

Regards

Saptarshi Guha

Re: wordcount getting slower with more mappers and reducers?

2009-03-05 Thread Sandy

I was trying to control the maximum number of tasks per tasktracker by using
the
mapred.tasktracker.tasks.maximum parameter

I am interpreting your comment to mean that maybe this parameter is
malformed and should read:
mapred.tasktracker.map.tasks.maximum = 8
mapred.tasktracker.map.tasks.maximum = 8

I did that, and reran on a 428MB input, and got the same results as before.
I also ran it on a 3.3G dataset, and got the same pattern.

I am still trying to run it on a 20 GB input. This should confirm if the
filesystem cache thing is true.

-SM

On Thu, Mar 5, 2009 at 12:22 PM, Sandy  wrote:

> Arun,
>
> How can I check the number of slots per tasktracker? Which parameter
> controls that?
>
> Thanks,
> -SM
>
>
> On Thu, Mar 5, 2009 at 12:14 PM, Arun C Murthy  wrote:
>
>> I assume you have only 2 map and 2 reduce slots per tasktracker - which
>> totals to 2 maps/reduces for you cluster. This means with more maps/reduces
>> they are serialized to 2 at a time.
>>
>> Also, the -m is only a hint to the JobTracker, you might see less/more
>> than the number of maps you have specified on the command line.
>> The -r however is followed faithfully.
>>
>> Arun
>>
>>
>> On Mar 4, 2009, at 2:46 PM, Sandy wrote:
>>
>>  Hello all,
>>>
>>> For the sake of benchmarking, I ran the standard hadoop wordcount example
>>> on
>>> an input file using 2, 4, and 8 mappers and reducers for my job.
>>> In other words,  I do:
>>>
>>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
>>> sample.txt output
>>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
>>> sample.txt output2
>>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
>>> sample.txt output3
>>>
>>> Strangely enough, when this increase in mappers and reducers result in
>>> slower running times!
>>> -On 2 mappers and reducers it ran for 40 seconds
>>> on 4 mappers and reducers it ran for 60 seconds
>>> on 8 mappers and reducers it ran for 90 seconds!
>>>
>>> Please note that the "sample.txt" file is identical in each of these
>>> runs.
>>>
>>> I have the following questions:
>>> - Shouldn't wordcount get -faster- with additional mappers and reducers,
>>> instead of slower?
>>> - If it does get faster for other people, why does it become slower for
>>> me?
>>>  I am running hadoop on psuedo-distributed mode on a single 64-bit Mac
>>> Pro
>>> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs
>>>
>>> I would greatly appreciate it if someone could explain this behavior to
>>> me,
>>> and tell me if I'm running this wrong. How can I change my settings (if
>>> at
>>> all) to get wordcount running faster when i increases that number of maps
>>> and reduces?
>>>
>>> Thanks,
>>> -SM
>>>
>>
>>
>

Batch processing map reduce jobs

2009-03-05 Thread Richa Khandelwal

Hi All,
Does anyone know how to run map reduce jobs using pipes or batch process map
reduce jobs?

Thanks,
Richa Khandelwal


University Of California,
Santa Cruz.
Ph:425-241-7763

Re: Avoiding Ganglia NPE on EC2

2009-03-05 Thread Mark Kerzner

News from ScaleUnlimited bootcamp - where I am now - use hadoop-0.17.2.1

On Thu, Mar 5, 2009 at 3:53 PM, Stuart Sierra  wrote:

> Hi all,
>
> I'm getting this NPE on Hadoop 0.18.3, using the EC2 contrib scripts:
>
>Exception in thread "Timer thread for monitoring dfs"
> java.lang.NullPointerException
>at
> org.apache.hadoop.metrics.ganglia.GangliaContext.xdr_string(GangliaContext.java:195)
>
> This is reported as: https://issues.apache.org/jira/browse/HADOOP-4137
>
> What's the easiest workaround?  Switch to another Hadoop version
> (which one)?  Or disable Ganglia entirely (how)?
>
> Thanks,
> -Stuart
>

Avoiding Ganglia NPE on EC2

2009-03-05 Thread Stuart Sierra

Hi all,

I'm getting this NPE on Hadoop 0.18.3, using the EC2 contrib scripts:

Exception in thread "Timer thread for monitoring dfs"
java.lang.NullPointerException
at 
org.apache.hadoop.metrics.ganglia.GangliaContext.xdr_string(GangliaContext.java:195)

This is reported as: https://issues.apache.org/jira/browse/HADOOP-4137

What's the easiest workaround?  Switch to another Hadoop version
(which one)?  Or disable Ganglia entirely (how)?

Thanks,
-Stuart

Live Datanodes only 1; all the time

2009-03-05 Thread Kumar, Amit H.

Hi All,

Very Interesting behavior:

http://machine2.xxx.xxx.xxx:50070/dfshealth.jsp shows that only one Live Nodes 
exist. Every time I refresh this page it shows a different node as alive. But 
Jobtracker Shows there are 8 nodes in the cluster summary.

Any Idea what could be going on here with the following detailed setup I am 
trying?

I am trying to configure hadoop as follows:

Cluster Setup: Version 0-18.3
1)  I want every user working on Login Nodes of our cluster to have their 
own Config dir. Hence edited the following in $HADOOP_HOME/conf/hadoop-env.sh
HADOOP_CONF_DIR=$HOME/hadoop/conf
Similarly HADOOP_LOD_DIR=$HOME/hadoop/logs
Note: $HADOOP_HOME is shared NFS hadoop install folder on the cluster head 
node. There are three Login Nodes for our cluster, excluding head node.  Head 
node is inaccessible to users.

2)  Every user will have his own 'masters' and 'slaves' files under their $ 
HADOOP_CONF_DIR
a.  When I had this setup and removed the masters file from $HADOOP_HOME it 
complained that it could not start SecondaryNameNode. Hence I replaced the 
'masters' file with an entry for our Login node. This worked and 
SecondaryNameNode starts without any error.

3)  As a user, I chose One of the Login box as an entry to my 
$HOME/hadoop/conf/masters file. 'Slaves' file includes few compute nodes.
4)  I don't see any errors when I start the Hadoop daemons, using 
start-dfs.sh and start-mapred.sh
5)  Only when I try to 'bin/hadoop fs -put conf input'  files onto HDFS it 
complains as shown below in the snip section.

NOTE: "grep ERROR *" in logs directory had no results.

Does any of the below error messages lights a bulb? Please help me understand 
what I could be doing wrong ???

Thank you,
Amit





[ahku...@machine2 ~/hadoop]$ $hbin/hadoop fs -put conf input
09/03/05 15:20:26 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: 
java.io.IOException: File /user/ahkumar/input/hadoop-metrics.properties could 
only be replicated to 0 nodes, instead of 1
at 
org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1123)
at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:890)

at org.apache.hadoop.ipc.Client.call(Client.java:716)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source)
at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2450)
at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2333)
at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1745)
at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1922)

09/03/05 15:20:26 WARN dfs.DFSClient: NotReplicatedYetException sleeping 
/user/ahkumar/input/hadoop-metrics.properties retries left 4
09/03/05 15:20:26 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: 
java.io.IOException: File /user/ahkumar/input/hadoop-metrics.properties could 
only be replicated to 0 nodes, instead of 1
<... same as above>

09/03/05 15:20:26 WARN dfs.DFSClient: NotReplicatedYetException sleeping 
/user/ahkumar/input/hadoop-metrics.properties retries left 3
09/03/05 15:20:27 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: 
java.io.IOException: File /user/ahkumar/input/hadoop-metrics.properties could 
only be replicated to 0 nodes, instead of 1
<... same as above>

09/03/05 15:20:27 WARN dfs.DFSClient: NotReplicatedYetException sleeping 
/user/ahkumar/input/hadoop-metrics.properties retries left 2
09/03/05 15:20:29 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: 
java.io.IOException: File /user/ahkumar/input/hadoop-metrics.properties could 
only be replicated to 0 nodes, instead of 1
<... same as above>

09/03/05 15:20:29 WARN dfs.DFSClient: NotReplicatedYetException sleeping 
/use

Re: DataNode stops cleaning disk?

2009-03-05 Thread Raghu Angadi


Igor Bolotin wrote:

That's what I saw just yesterday on one of the data nodes with this
situation (will confirm also next time it happens):
- Tmp and current were either empty or almost empty last time I checked.
- du on the entire data directory matched exactly with reported used
space in NameNode web UI and it did report that it uses some most of the
available disk space. 
- nothing else was using disk space (actually - it's dedicated DFS

cluster).


If 'du' command (you can run in the shell) counts properly then you 
should be able to see which files are taking space.


If 'du' can't but 'df' reports very less space available, then it is 
possible (though never seen) that datanode is keeping a a lot these 
files open.. 'ls -l /proc/datanodepid/fd' lists these files. If it is 
not datanode, then check lsof to find who is holding these files.


hope this helps.
Raghu.


Thank you for help!
Igor

-Original Message-
From: Raghu Angadi [mailto:rang...@yahoo-inc.com] 
Sent: Thursday, March 05, 2009 11:05 AM

To: core-user@hadoop.apache.org
Subject: Re: DataNode stops cleaning disk?


This is unexpected unless some other process is eating up space.

Couple of things to collect next time (along with log):

  - All the contents under datanode-directory/ (especially including 
'tmp' and 'current')
  - Does 'du' of this directory match with what is reported to NameNode 
(shown on webui) by this DataNode.

  - Is there anything else taking disk space on the machine?

Raghu.

Igor Bolotin wrote:

Normally I dislike writing about problems without being able to

provide

some more information, but unfortunately in this case I just can't

find

anything.

 


Here is the situation - DFS cluster running Hadoop version 0.19.0. The
cluster is running on multiple servers with practically identical
hardware. Everything works perfectly well, except for one thing - from
time to time one of the data nodes (every time it's a different node)
starts to consume more and more disk space. The node keeps going and

if

we don't do anything - it runs out of space completely (ignoring 20GB
reserved space settings). Once restarted - it cleans disk rapidly and
goes back to approximately the same utilization as the rest of data
nodes in the cluster.

 


Scanning datanodes and namenode logs and comparing thread dumps

(stacks)

from nodes experiencing problem and those that run normally didn't
produce any clues. Running balancer tool didn't help at all. FSCK

shows

that everything is healthy and number of over-replicated blocks is not
significant.

 


To me - it just looks like at some point the data node stops cleaning
invalidated/deleted blocks, but keeps reporting space consumed by

these

blocks as "not used", but I'm not familiar enough with the internals

and

just plain don't have enough free time to start digging deeper.

 


Anyone has an idea what is wrong or what else we can do to find out
what's wrong or maybe where to start looking in the code?

 


Thanks,

Igor

RE: DataNode stops cleaning disk?

2009-03-05 Thread Igor Bolotin

That's what I saw just yesterday on one of the data nodes with this
situation (will confirm also next time it happens):
- Tmp and current were either empty or almost empty last time I checked.
- du on the entire data directory matched exactly with reported used
space in NameNode web UI and it did report that it uses some most of the
available disk space. 
- nothing else was using disk space (actually - it's dedicated DFS
cluster).

Thank you for help!
Igor

-Original Message-
From: Raghu Angadi [mailto:rang...@yahoo-inc.com] 
Sent: Thursday, March 05, 2009 11:05 AM
To: core-user@hadoop.apache.org
Subject: Re: DataNode stops cleaning disk?


This is unexpected unless some other process is eating up space.

Couple of things to collect next time (along with log):

  - All the contents under datanode-directory/ (especially including 
'tmp' and 'current')
  - Does 'du' of this directory match with what is reported to NameNode 
(shown on webui) by this DataNode.
  - Is there anything else taking disk space on the machine?

Raghu.

Igor Bolotin wrote:
> Normally I dislike writing about problems without being able to
provide
> some more information, but unfortunately in this case I just can't
find
> anything.
> 
>  
> 
> Here is the situation - DFS cluster running Hadoop version 0.19.0. The
> cluster is running on multiple servers with practically identical
> hardware. Everything works perfectly well, except for one thing - from
> time to time one of the data nodes (every time it's a different node)
> starts to consume more and more disk space. The node keeps going and
if
> we don't do anything - it runs out of space completely (ignoring 20GB
> reserved space settings). Once restarted - it cleans disk rapidly and
> goes back to approximately the same utilization as the rest of data
> nodes in the cluster.
> 
>  
> 
> Scanning datanodes and namenode logs and comparing thread dumps
(stacks)
> from nodes experiencing problem and those that run normally didn't
> produce any clues. Running balancer tool didn't help at all. FSCK
shows
> that everything is healthy and number of over-replicated blocks is not
> significant.
> 
>  
> 
> To me - it just looks like at some point the data node stops cleaning
> invalidated/deleted blocks, but keeps reporting space consumed by
these
> blocks as "not used", but I'm not familiar enough with the internals
and
> just plain don't have enough free time to start digging deeper.
> 
>  
> 
> Anyone has an idea what is wrong or what else we can do to find out
> what's wrong or maybe where to start looking in the code?
> 
>  
> 
> Thanks,
> 
> Igor
> 
>  
> 
>

Re: DataNode stops cleaning disk?

2009-03-05 Thread Raghu Angadi



This is unexpected unless some other process is eating up space.

Couple of things to collect next time (along with log):

 - All the contents under datanode-directory/ (especially including 
'tmp' and 'current')
 - Does 'du' of this directory match with what is reported to NameNode 
(shown on webui) by this DataNode.

 - Is there anything else taking disk space on the machine?

Raghu.

Igor Bolotin wrote:

Normally I dislike writing about problems without being able to provide
some more information, but unfortunately in this case I just can't find
anything.

 


Here is the situation - DFS cluster running Hadoop version 0.19.0. The
cluster is running on multiple servers with practically identical
hardware. Everything works perfectly well, except for one thing - from
time to time one of the data nodes (every time it's a different node)
starts to consume more and more disk space. The node keeps going and if
we don't do anything - it runs out of space completely (ignoring 20GB
reserved space settings). Once restarted - it cleans disk rapidly and
goes back to approximately the same utilization as the rest of data
nodes in the cluster.

 


Scanning datanodes and namenode logs and comparing thread dumps (stacks)
from nodes experiencing problem and those that run normally didn't
produce any clues. Running balancer tool didn't help at all. FSCK shows
that everything is healthy and number of over-replicated blocks is not
significant.

 


To me - it just looks like at some point the data node stops cleaning
invalidated/deleted blocks, but keeps reporting space consumed by these
blocks as "not used", but I'm not familiar enough with the internals and
just plain don't have enough free time to start digging deeper.

 


Anyone has an idea what is wrong or what else we can do to find out
what's wrong or maybe where to start looking in the code?

 


Thanks,

Igor

Re: Recommend JSON Library? net.sf.json has memory leak

2009-03-05 Thread Doug Cutting


Ian Swett wrote:

We've used Jackson(http://jackson.codehaus.org/), which we've found to be easy 
to use and faster than any other option.


I also use Jackson and recommend it.

Doug

Re: wordcount getting slower with more mappers and reducers?

2009-03-05 Thread Sandy

Arun,

How can I check the number of slots per tasktracker? Which parameter
controls that?

Thanks,
-SM

On Thu, Mar 5, 2009 at 12:14 PM, Arun C Murthy  wrote:

> I assume you have only 2 map and 2 reduce slots per tasktracker - which
> totals to 2 maps/reduces for you cluster. This means with more maps/reduces
> they are serialized to 2 at a time.
>
> Also, the -m is only a hint to the JobTracker, you might see less/more than
> the number of maps you have specified on the command line.
> The -r however is followed faithfully.
>
> Arun
>
>
> On Mar 4, 2009, at 2:46 PM, Sandy wrote:
>
>  Hello all,
>>
>> For the sake of benchmarking, I ran the standard hadoop wordcount example
>> on
>> an input file using 2, 4, and 8 mappers and reducers for my job.
>> In other words,  I do:
>>
>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
>> sample.txt output
>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
>> sample.txt output2
>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
>> sample.txt output3
>>
>> Strangely enough, when this increase in mappers and reducers result in
>> slower running times!
>> -On 2 mappers and reducers it ran for 40 seconds
>> on 4 mappers and reducers it ran for 60 seconds
>> on 8 mappers and reducers it ran for 90 seconds!
>>
>> Please note that the "sample.txt" file is identical in each of these runs.
>>
>> I have the following questions:
>> - Shouldn't wordcount get -faster- with additional mappers and reducers,
>> instead of slower?
>> - If it does get faster for other people, why does it become slower for
>> me?
>>  I am running hadoop on psuedo-distributed mode on a single 64-bit Mac Pro
>> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs
>>
>> I would greatly appreciate it if someone could explain this behavior to
>> me,
>> and tell me if I'm running this wrong. How can I change my settings (if at
>> all) to get wordcount running faster when i increases that number of maps
>> and reduces?
>>
>> Thanks,
>> -SM
>>
>
>

Re: wordcount getting slower with more mappers and reducers?

2009-03-05 Thread Arun C Murthy

I assume you have only 2 map and 2 reduce slots per tasktracker -  
which totals to 2 maps/reduces for you cluster. This means with more  
maps/reduces they are serialized to 2 at a time.


Also, the -m is only a hint to the JobTracker, you might see less/more  
than the number of maps you have specified on the command line.

The -r however is followed faithfully.

Arun

On Mar 4, 2009, at 2:46 PM, Sandy wrote:


Hello all,

For the sake of benchmarking, I ran the standard hadoop wordcount  
example on

an input file using 2, 4, and 8 mappers and reducers for my job.
In other words,  I do:

time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
sample.txt output
time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
sample.txt output2
time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
sample.txt output3

Strangely enough, when this increase in mappers and reducers result in
slower running times!
-On 2 mappers and reducers it ran for 40 seconds
on 4 mappers and reducers it ran for 60 seconds
on 8 mappers and reducers it ran for 90 seconds!

Please note that the "sample.txt" file is identical in each of these  
runs.


I have the following questions:
- Shouldn't wordcount get -faster- with additional mappers and  
reducers,

instead of slower?
- If it does get faster for other people, why does it become slower  
for me?
 I am running hadoop on psuedo-distributed mode on a single 64-bit  
Mac Pro

with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs

I would greatly appreciate it if someone could explain this behavior  
to me,
and tell me if I'm running this wrong. How can I change my settings  
(if at
all) to get wordcount running faster when i increases that number of  
maps

and reduces?

Thanks,
-SM

DataNode stops cleaning disk?

2009-03-05 Thread Igor Bolotin

Normally I dislike writing about problems without being able to provide
some more information, but unfortunately in this case I just can't find
anything.

 

Here is the situation - DFS cluster running Hadoop version 0.19.0. The
cluster is running on multiple servers with practically identical
hardware. Everything works perfectly well, except for one thing - from
time to time one of the data nodes (every time it's a different node)
starts to consume more and more disk space. The node keeps going and if
we don't do anything - it runs out of space completely (ignoring 20GB
reserved space settings). Once restarted - it cleans disk rapidly and
goes back to approximately the same utilization as the rest of data
nodes in the cluster.

 

Scanning datanodes and namenode logs and comparing thread dumps (stacks)
from nodes experiencing problem and those that run normally didn't
produce any clues. Running balancer tool didn't help at all. FSCK shows
that everything is healthy and number of over-replicated blocks is not
significant.

 

To me - it just looks like at some point the data node stops cleaning
invalidated/deleted blocks, but keeps reporting space consumed by these
blocks as "not used", but I'm not familiar enough with the internals and
just plain don't have enough free time to start digging deeper.

 

Anyone has an idea what is wrong or what else we can do to find out
what's wrong or maybe where to start looking in the code?

 

Thanks,

Igor

Re: Recommend JSON Library? net.sf.json has memory leak

2009-03-05 Thread Ken Weiner

I had discovered a memory leak in net.sf.json as well.  I filed an issue and
it got fixed in the latest release:
http://sourceforge.net/tracker/?func=detail&atid=857928&aid=2063201&group_id=171425

Have you tried the latest version 2.2.3?

On Thu, Mar 5, 2009 at 9:48 AM, Kevin Peterson  wrote:

> We're using JSON serialization for all our data, but we can't seem to find
> a
> good library. We just discovered that the root cause of out of memory
> errors
> is a leak in the net.sf.json library. Can anyone out there recommend a java
> json library that they have actually used successfully within Hadoop?
>

Re: Recommend JSON Library? net.sf.json has memory leak

2009-03-05 Thread Ian Swett


We've used Jackson(http://jackson.codehaus.org/), which we've found to be easy 
to use and faster than any other option.  We've also had problems with net.sf 
in terms of memory and performance.

You can see a performance comparison here: 
http://www.cowtowncoder.com/blog/archives/2009/02/entry_204.html

-Ian

--- On Thu, 3/5/09, Kevin Peterson  wrote:

> From: Kevin Peterson 
> Subject: Recommend JSON Library? net.sf.json has memory leak
> To: core-user@hadoop.apache.org
> Date: Thursday, March 5, 2009, 9:48 AM
> We're using JSON serialization for all our data, but we
> can't seem to find a
> good library. We just discovered that the root cause of out
> of memory errors
> is a leak in the net.sf.json library. Can anyone out there
> recommend a java
> json library that they have actually used successfully
> within Hadoop?

Recommend JSON Library? net.sf.json has memory leak

2009-03-05 Thread Kevin Peterson

We're using JSON serialization for all our data, but we can't seem to find a
good library. We just discovered that the root cause of out of memory errors
is a leak in the net.sf.json library. Can anyone out there recommend a java
json library that they have actually used successfully within Hadoop?

Re: wordcount getting slower with more mappers and reducers?

2009-03-05 Thread Sandy

I specified a directory containing my 428MB file split into 8 files. Same
results.

I should summarize my hadoop-site.xml file:

mapred.tasktracker.tasks.maximum = 4
mapred.line.input.format.linespermap = 1
mapred.task.timeout = 0
mapred.min.split.size = 1
mapred.child.java.opts = -Xmx2M
io.sort.factor = 200
io.sort.mb = 100
fs.inmemory.size.mb = 200
mapred.inmem.merge.threshold = 1000
dfs.replication = 1
mapred.reduce.parallel.copies = 5

I know the mapred.child.java.opts parameter is a little ridiculous, but I
was just playing around and seeing what could possibly make things faster.
For some reason, that did.

Nick, I'm going to try larger files and get back to you.

-SM

On Thu, Mar 5, 2009 at 10:37 AM, Nick Cen  wrote:

> Try to split your sample.txt into multi files.  and try it again.
> For text input format , the number of task is equals to the input size.
>
>
> 2009/3/6 Sandy 
>
> > I used three different sample.txt files, and was able to replicate the
> > error. The first was 1.5MB, the second 66MB, and the last 428MB. I get
> the
> > same problem despite what size of input file I use: the running time of
> > wordcount increases with the number of mappers and reducers specified. If
> > it
> > is the problem of the input file, how big do I have to go before it
> > disappears entirely?
> >
> > If it is psuedo-distributed mode that's the issue, what mode should I be
> > running on my machine, given it's specs? Once again, it is a SINGLE
> MacPro
> > with 16GB of RAM, 4  1TB hard disks, and 2 quad-core processors.
> >
> > I'm not sure if it's HADOOP-2771, since the sort/merge(shuffle) is what
> > seems to be taking the longest:
> > 2 M/R ==> map: 18 sec, shuffle: 15 sec, reduce: 9 sec
> > 4 M/R ==> map: 19 sec, shuffle: 37 sec, reduce: 2 sec
> > 8 M/R ==> map: 21 sec, shuffle: 1 min 10 sec, 1 sec
> >
> > To make sure it's not because of the combiner, I removed it and reran
> > everything again, and got the same bottom-line: With increasing maps and
> > reducers, running time goes up, with majority of time seeming to be in
> > sort/merge.
> >
> > Also, another thing we noticed is that the CPUs seem to be very active
> > during the map phase, but when the map phase reaches 100%, and only
> reduce
> > appears to be running, the CPUs all become idle. Furthermore, despite the
> > number of mappers I specify, all the CPUs become very active when a job
> is
> > running. Why is this so? If I specify 2 mappers and 2 reducers, won't
> there
> > be just 2 or 4 CPUs that should be active? Why are all 8 active?
> >
> > Since I can reproduce this error using Hadoop's standard word count
> > example,
> > I was hoping that someone else could tell me if they can reproduce this
> > too.
> > Is it true that when you increase the number of mappers and reducers on
> > your
> > systems, the running time of wordcount goes up?
> >
> > Thanks for the help! I'm looking forward to your responses.
> >
> > -SM
> >
> > On Thu, Mar 5, 2009 at 2:57 AM, Amareshwari Sriramadasu <
> > amar...@yahoo-inc.com> wrote:
> >
> > > Are you hitting HADOOP-2771?
> > > -Amareshwari
> > >
> > > Sandy wrote:
> > >
> > >> Hello all,
> > >>
> > >> For the sake of benchmarking, I ran the standard hadoop wordcount
> > example
> > >> on
> > >> an input file using 2, 4, and 8 mappers and reducers for my job.
> > >> In other words,  I do:
> > >>
> > >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
> > >> sample.txt output
> > >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
> > >> sample.txt output2
> > >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
> > >> sample.txt output3
> > >>
> > >> Strangely enough, when this increase in mappers and reducers result in
> > >> slower running times!
> > >> -On 2 mappers and reducers it ran for 40 seconds
> > >> on 4 mappers and reducers it ran for 60 seconds
> > >> on 8 mappers and reducers it ran for 90 seconds!
> > >>
> > >> Please note that the "sample.txt" file is identical in each of these
> > runs.
> > >>
> > >> I have the following questions:
> > >> - Shouldn't wordcount get -faster- with additional mappers and
> reducers,
> > >> instead of slower?
> > >> - If it does get faster for other people, why does it become slower
> for
> > >> me?
> > >>  I am running hadoop on psuedo-distributed mode on a single 64-bit Mac
> > Pro
> > >> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs
> > >>
> > >> I would greatly appreciate it if someone could explain this behavior
> to
> > >> me,
> > >> and tell me if I'm running this wrong. How can I change my settings
> (if
> > at
> > >> all) to get wordcount running faster when i increases that number of
> > maps
> > >> and reduces?
> > >>
> > >> Thanks,
> > >> -SM
> > >>
> > >>
> > >>
> > >
> > >
> >
>
>
>
> --
> http://daily.appspot.com/food/
>

Re: System Layout Best Practices

2009-03-05 Thread David Ritch

Thank you - that certainly is useful, and I would love to see more
information and discussion on that sort of thing.  However, I'm also looking
for some lower-level configuration, such as disk partitioning.

David

On Thu, Mar 5, 2009 at 11:36 AM, Sandy  wrote:

> Hi David,
>
> I don't know if you've seen this already, but this might be of some help:
> http://hadoop.apache.org/core/docs/r0.18.3/cluster_setup.html
>
> Near the bottom, there is a section called "Real-World Cluster
> Configurations" with some sample configuration parameters that were used to
> run a very large sort benchmark.
>
> All the best,
> -SM
>
> On Thu, Mar 5, 2009 at 10:20 AM, David Ritch 
> wrote:
>
> > Are there any published guidelines on system configuration for Hadoop?
> >
> > I've seen hardware suggestions, but I'm really interested in
> > recommendations
> > on disk layout and partitioning.  The defaults, as shipped and defined in
> > hadoop-default.xml, may be appropriate for testing, but are not really
> > appropriate for sustained use.  For example, data and metadata are both
> > stored in /tmp.  In typical use on a cluster with a couple hundred nodes,
> > the NameNode can generate 3-5GB of logs per day.  If you configure your
> > namenode host badly, it's easy to fill up the partition used by dfs for
> > metadata, and clobber your dfs filesystem.  I would think that
> thresholding
> > logs on WARN would be preferable to INFO.
> >
> > On a datanode, we would like to reserve as much space as we can for data,
> > but we know that map-reduce jobs need some local storage.  How do people
> > generally estimate the amount of space required for temporary storage?  I
> > would assume that it would be good to partition it from data storage, to
> > prevent running out of temp space on some nodes.  I would also think that
> > it
> > would be preferable for performance to have temp space on a different
> > spindle, so it and hdfs data can be accessed independently.
> >
> > I would be interested to know how other sites configure their systems,
> and
> > I
> > would love to see some guidelines for system configuration for Hadoop.
> >
> > Thank you!
> >
> > David
> >
>

Re: wordcount getting slower with more mappers and reducers?

2009-03-05 Thread Matt Ingenthron


Sandy wrote:

I used three different sample.txt files, and was able to replicate the
error. The first was 1.5MB, the second 66MB, and the last 428MB. I get the
same problem despite what size of input file I use: the running time of
wordcount increases with the number of mappers and reducers specified. If it
is the problem of the input file, how big do I have to go before it
disappears entirely?
  


Keep in mind that as long as the file < memory, it's likely coming 
straight out of filesystem cache.  In your kind of system configuration, 
running as fast a possible, a core or two can saturate a memory 
controller and then there would be contention showing no speedup with 
more mappers.


If you really want a feel for what this would be like, you should 
probably have much more input data.  It will entirely change as soon as 
you have to wait on disk IO.


Hope that helps,

- Matt

If it is psuedo-distributed mode that's the issue, what mode should I be
running on my machine, given it's specs? Once again, it is a SINGLE MacPro
with 16GB of RAM, 4  1TB hard disks, and 2 quad-core processors.

I'm not sure if it's HADOOP-2771, since the sort/merge(shuffle) is what
seems to be taking the longest:
2 M/R ==> map: 18 sec, shuffle: 15 sec, reduce: 9 sec
4 M/R ==> map: 19 sec, shuffle: 37 sec, reduce: 2 sec
8 M/R ==> map: 21 sec, shuffle: 1 min 10 sec, 1 sec

To make sure it's not because of the combiner, I removed it and reran
everything again, and got the same bottom-line: With increasing maps and
reducers, running time goes up, with majority of time seeming to be in
sort/merge.

Also, another thing we noticed is that the CPUs seem to be very active
during the map phase, but when the map phase reaches 100%, and only reduce
appears to be running, the CPUs all become idle. Furthermore, despite the
number of mappers I specify, all the CPUs become very active when a job is
running. Why is this so? If I specify 2 mappers and 2 reducers, won't there
be just 2 or 4 CPUs that should be active? Why are all 8 active?

Since I can reproduce this error using Hadoop's standard word count example,
I was hoping that someone else could tell me if they can reproduce this too.
Is it true that when you increase the number of mappers and reducers on your
systems, the running time of wordcount goes up?

Thanks for the help! I'm looking forward to your responses.

-SM

On Thu, Mar 5, 2009 at 2:57 AM, Amareshwari Sriramadasu <
amar...@yahoo-inc.com> wrote:

  

Are you hitting HADOOP-2771?
-Amareshwari

Sandy wrote:



Hello all,

For the sake of benchmarking, I ran the standard hadoop wordcount example
on
an input file using 2, 4, and 8 mappers and reducers for my job.
In other words,  I do:

time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
sample.txt output
time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
sample.txt output2
time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
sample.txt output3

Strangely enough, when this increase in mappers and reducers result in
slower running times!
-On 2 mappers and reducers it ran for 40 seconds
on 4 mappers and reducers it ran for 60 seconds
on 8 mappers and reducers it ran for 90 seconds!

Please note that the "sample.txt" file is identical in each of these runs.

I have the following questions:
- Shouldn't wordcount get -faster- with additional mappers and reducers,
instead of slower?
- If it does get faster for other people, why does it become slower for
me?
 I am running hadoop on psuedo-distributed mode on a single 64-bit Mac Pro
with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs

I would greatly appreciate it if someone could explain this behavior to
me,
and tell me if I'm running this wrong. How can I change my settings (if at
all) to get wordcount running faster when i increases that number of maps
and reduces?

Thanks,
-SM

Re: wordcount getting slower with more mappers and reducers?

2009-03-05 Thread Nick Cen

Try to split your sample.txt into multi files.  and try it again.
For text input format , the number of task is equals to the input size.


2009/3/6 Sandy 

> I used three different sample.txt files, and was able to replicate the
> error. The first was 1.5MB, the second 66MB, and the last 428MB. I get the
> same problem despite what size of input file I use: the running time of
> wordcount increases with the number of mappers and reducers specified. If
> it
> is the problem of the input file, how big do I have to go before it
> disappears entirely?
>
> If it is psuedo-distributed mode that's the issue, what mode should I be
> running on my machine, given it's specs? Once again, it is a SINGLE MacPro
> with 16GB of RAM, 4  1TB hard disks, and 2 quad-core processors.
>
> I'm not sure if it's HADOOP-2771, since the sort/merge(shuffle) is what
> seems to be taking the longest:
> 2 M/R ==> map: 18 sec, shuffle: 15 sec, reduce: 9 sec
> 4 M/R ==> map: 19 sec, shuffle: 37 sec, reduce: 2 sec
> 8 M/R ==> map: 21 sec, shuffle: 1 min 10 sec, 1 sec
>
> To make sure it's not because of the combiner, I removed it and reran
> everything again, and got the same bottom-line: With increasing maps and
> reducers, running time goes up, with majority of time seeming to be in
> sort/merge.
>
> Also, another thing we noticed is that the CPUs seem to be very active
> during the map phase, but when the map phase reaches 100%, and only reduce
> appears to be running, the CPUs all become idle. Furthermore, despite the
> number of mappers I specify, all the CPUs become very active when a job is
> running. Why is this so? If I specify 2 mappers and 2 reducers, won't there
> be just 2 or 4 CPUs that should be active? Why are all 8 active?
>
> Since I can reproduce this error using Hadoop's standard word count
> example,
> I was hoping that someone else could tell me if they can reproduce this
> too.
> Is it true that when you increase the number of mappers and reducers on
> your
> systems, the running time of wordcount goes up?
>
> Thanks for the help! I'm looking forward to your responses.
>
> -SM
>
> On Thu, Mar 5, 2009 at 2:57 AM, Amareshwari Sriramadasu <
> amar...@yahoo-inc.com> wrote:
>
> > Are you hitting HADOOP-2771?
> > -Amareshwari
> >
> > Sandy wrote:
> >
> >> Hello all,
> >>
> >> For the sake of benchmarking, I ran the standard hadoop wordcount
> example
> >> on
> >> an input file using 2, 4, and 8 mappers and reducers for my job.
> >> In other words,  I do:
> >>
> >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
> >> sample.txt output
> >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
> >> sample.txt output2
> >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
> >> sample.txt output3
> >>
> >> Strangely enough, when this increase in mappers and reducers result in
> >> slower running times!
> >> -On 2 mappers and reducers it ran for 40 seconds
> >> on 4 mappers and reducers it ran for 60 seconds
> >> on 8 mappers and reducers it ran for 90 seconds!
> >>
> >> Please note that the "sample.txt" file is identical in each of these
> runs.
> >>
> >> I have the following questions:
> >> - Shouldn't wordcount get -faster- with additional mappers and reducers,
> >> instead of slower?
> >> - If it does get faster for other people, why does it become slower for
> >> me?
> >>  I am running hadoop on psuedo-distributed mode on a single 64-bit Mac
> Pro
> >> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs
> >>
> >> I would greatly appreciate it if someone could explain this behavior to
> >> me,
> >> and tell me if I'm running this wrong. How can I change my settings (if
> at
> >> all) to get wordcount running faster when i increases that number of
> maps
> >> and reduces?
> >>
> >> Thanks,
> >> -SM
> >>
> >>
> >>
> >
> >
>



-- 
http://daily.appspot.com/food/

Re: System Layout Best Practices

2009-03-05 Thread Sandy

Hi David,

I don't know if you've seen this already, but this might be of some help:
http://hadoop.apache.org/core/docs/r0.18.3/cluster_setup.html

Near the bottom, there is a section called "Real-World Cluster
Configurations" with some sample configuration parameters that were used to
run a very large sort benchmark.

All the best,
-SM

On Thu, Mar 5, 2009 at 10:20 AM, David Ritch  wrote:

> Are there any published guidelines on system configuration for Hadoop?
>
> I've seen hardware suggestions, but I'm really interested in
> recommendations
> on disk layout and partitioning.  The defaults, as shipped and defined in
> hadoop-default.xml, may be appropriate for testing, but are not really
> appropriate for sustained use.  For example, data and metadata are both
> stored in /tmp.  In typical use on a cluster with a couple hundred nodes,
> the NameNode can generate 3-5GB of logs per day.  If you configure your
> namenode host badly, it's easy to fill up the partition used by dfs for
> metadata, and clobber your dfs filesystem.  I would think that thresholding
> logs on WARN would be preferable to INFO.
>
> On a datanode, we would like to reserve as much space as we can for data,
> but we know that map-reduce jobs need some local storage.  How do people
> generally estimate the amount of space required for temporary storage?  I
> would assume that it would be good to partition it from data storage, to
> prevent running out of temp space on some nodes.  I would also think that
> it
> would be preferable for performance to have temp space on a different
> spindle, so it and hdfs data can be accessed independently.
>
> I would be interested to know how other sites configure their systems, and
> I
> would love to see some guidelines for system configuration for Hadoop.
>
> Thank you!
>
> David
>

Re: wordcount getting slower with more mappers and reducers?

2009-03-05 Thread Sandy

I used three different sample.txt files, and was able to replicate the
error. The first was 1.5MB, the second 66MB, and the last 428MB. I get the
same problem despite what size of input file I use: the running time of
wordcount increases with the number of mappers and reducers specified. If it
is the problem of the input file, how big do I have to go before it
disappears entirely?

If it is psuedo-distributed mode that's the issue, what mode should I be
running on my machine, given it's specs? Once again, it is a SINGLE MacPro
with 16GB of RAM, 4  1TB hard disks, and 2 quad-core processors.

I'm not sure if it's HADOOP-2771, since the sort/merge(shuffle) is what
seems to be taking the longest:
2 M/R ==> map: 18 sec, shuffle: 15 sec, reduce: 9 sec
4 M/R ==> map: 19 sec, shuffle: 37 sec, reduce: 2 sec
8 M/R ==> map: 21 sec, shuffle: 1 min 10 sec, 1 sec

To make sure it's not because of the combiner, I removed it and reran
everything again, and got the same bottom-line: With increasing maps and
reducers, running time goes up, with majority of time seeming to be in
sort/merge.

Also, another thing we noticed is that the CPUs seem to be very active
during the map phase, but when the map phase reaches 100%, and only reduce
appears to be running, the CPUs all become idle. Furthermore, despite the
number of mappers I specify, all the CPUs become very active when a job is
running. Why is this so? If I specify 2 mappers and 2 reducers, won't there
be just 2 or 4 CPUs that should be active? Why are all 8 active?

Since I can reproduce this error using Hadoop's standard word count example,
I was hoping that someone else could tell me if they can reproduce this too.
Is it true that when you increase the number of mappers and reducers on your
systems, the running time of wordcount goes up?

Thanks for the help! I'm looking forward to your responses.

-SM

On Thu, Mar 5, 2009 at 2:57 AM, Amareshwari Sriramadasu <
amar...@yahoo-inc.com> wrote:

> Are you hitting HADOOP-2771?
> -Amareshwari
>
> Sandy wrote:
>
>> Hello all,
>>
>> For the sake of benchmarking, I ran the standard hadoop wordcount example
>> on
>> an input file using 2, 4, and 8 mappers and reducers for my job.
>> In other words,  I do:
>>
>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
>> sample.txt output
>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
>> sample.txt output2
>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
>> sample.txt output3
>>
>> Strangely enough, when this increase in mappers and reducers result in
>> slower running times!
>> -On 2 mappers and reducers it ran for 40 seconds
>> on 4 mappers and reducers it ran for 60 seconds
>> on 8 mappers and reducers it ran for 90 seconds!
>>
>> Please note that the "sample.txt" file is identical in each of these runs.
>>
>> I have the following questions:
>> - Shouldn't wordcount get -faster- with additional mappers and reducers,
>> instead of slower?
>> - If it does get faster for other people, why does it become slower for
>> me?
>>  I am running hadoop on psuedo-distributed mode on a single 64-bit Mac Pro
>> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs
>>
>> I would greatly appreciate it if someone could explain this behavior to
>> me,
>> and tell me if I'm running this wrong. How can I change my settings (if at
>> all) to get wordcount running faster when i increases that number of maps
>> and reduces?
>>
>> Thanks,
>> -SM
>>
>>
>>
>
>

Re: Hadoop AMI for EC2

2009-03-05 Thread Richa Khandelwal

Hi All,
I am trying trying to log map reduce jobs in HADOOP_LOG_DIR by setting its
value in hadoop-env.sh. But the directory has no log records when the job
finishes running. I am adding JobConf.setProfileEnabled(true) in my job. Can
anyone point out how to log in hadoop?

Thanks,
Richa

On Thu, Mar 5, 2009 at 8:20 AM, Richa Khandelwal  wrote:

> Thats pretty cool. Thanks
>
>
> On Thu, Mar 5, 2009 at 8:17 AM, tim robertson 
> wrote:
>
>> Yeps,
>>
>> A good starting read: http://wiki.apache.org/hadoop/AmazonEC2
>>
>> These are the AMIs:
>>
>> $ ec2-describe-images -a | grep hadoop
>> IMAGE   ami-245db94dcloudbase-1.1-hadoop-fc64/image.manifest.xml
>>  247610401714available   public  x86_64  machine
>> IMAGE   ami-791ffb10
>>  cloudbase-hadoop-fc64/cloudbase-hadoop-fc64.manifest.xml
>>  247610401714available   public  x86_64  machine
>> IMAGE   ami-f73adf9ecs345-hadoop-EC2-0.15.3/hadoop-0.15.3.manifest.xml
>>  825431212034available   public  i386machine
>> IMAGE   ami-c55db8acfedora8-hypertable-hadoop-kfs/image.manifest.xml
>>  291354417104available   public  x86_64  machine
>> aki-b51cf9dcari-b31cf9da
>> IMAGE   ami-ce6b8fa7hachero-hadoop/hadoop-0.19.0-i386.manifest.xml
>>  118946012109available   public  i386machine
>> aki-a71cf9ceari-a51cf9cc
>> IMAGE   ami-dd48acb4hachero-hadoop/hadoop-0.19.0-x86_64.manifest.xml
>>  118946012109available   public  x86_64  machine
>> aki-b51cf9dcari-b31cf9da
>> IMAGE   ami-ee53b687hadoop-ec2-images/hadoop-0.17.0-i386.manifest.xml
>>   111560892610available   public  i386machine
>> aki-a71cf9ceari-a51cf9cc
>> IMAGE   ami-f853b691
>>  hadoop-ec2-images/hadoop-0.17.0-x86_64.manifest.xml 111560892610
>>  available   public  x86_64  machine aki-b51cf9dc
>>  ari-b31cf9da
>> IMAGE   ami-65987c0chadoop-images/hadoop-0.17.1-i386.manifest.xml
>> 914733919441available   public  i386machine aki-a71cf9ce
>>ari-a51cf9cc
>> IMAGE   ami-4b987c22hadoop-images/hadoop-0.17.1-x86_64.manifest.xml
>> 914733919441available   public  x86_64  machine aki-b51cf9dc
>>ari-b31cf9da
>> IMAGE   ami-b0fe1ad9hadoop-images/hadoop-0.18.0-i386.manifest.xml
>> 914733919441available   public  i386machine aki-a71cf9ce
>>ari-a51cf9cc
>> IMAGE   ami-90fe1af9hadoop-images/hadoop-0.18.0-x86_64.manifest.xml
>> 914733919441available   public  x86_64  machine aki-b51cf9dc
>>ari-b31cf9da
>> IMAGE   ami-ea36d283hadoop-images/hadoop-0.18.1-i386.manifest.xml
>> 914733919441available   public  i386machine aki-a71cf9ce
>>ari-a51cf9cc
>> IMAGE   ami-fe37d397hadoop-images/hadoop-0.18.1-x86_64.manifest.xml
>> 914733919441available   public  x86_64  machine aki-b51cf9dc
>>ari-b31cf9da
>> IMAGE   ami-fa6a8e93hadoop-images/hadoop-0.19.0-i386.manifest.xml
>> 914733919441available   public  i386machine aki-a71cf9ce
>>ari-a51cf9cc
>> IMAGE   ami-cd6a8ea4hadoop-images/hadoop-0.19.0-x86_64.manifest.xml
>> 914733919441available   public  x86_64  machine aki-b51cf9dc
>>ari-b31cf9da
>> IMAGE   ami-15e80f7c
>>  hadoop-images/hadoop-base-20090210-i386.manifest.xml914733919441
>>  available   public  i386machine aki-a71cf9ce
>>  ari-a51cf9cc
>> IMAGE   ami-1ee80f77
>>  hadoop-images/hadoop-base-20090210-x86_64.manifest.xml  914733919441
>>  available   public  x86_64  machine aki-b51cf9dc
>>  ari-b31cf9da
>> IMAGE   ami-4de30724
>>  hbase-ami/hbase-0.2.0-hadoop-0.17.1-i386.manifest.xml   834125115996
>>  available   public  i386machine aki-a71cf9ce
>>  ari-a51cf9cc
>> IMAGE   ami-fe7c9997radlab-hadoop-4-large/image.manifest.xml
>>  117716615155available   public  x86_64  machine
>> IMAGE   ami-7f7f9a16radlab-hadoop-4/image.manifest.xml
>>  117716615155available   public  i386machine
>> $
>>
>> Cheers,
>>
>> Tim
>>
>>
>>
>> On Thu, Mar 5, 2009 at 5:13 PM, Richa Khandelwal 
>> wrote:
>> > Hi All,
>> > Is there an existing Hadoop AMI for EC2 which had Hadaoop setup on it?
>> >
>> > Thanks,
>> > Richa Khandelwal
>> >
>> >
>> > University Of California,
>> > Santa Cruz.
>> > Ph:425-241-7763
>> >
>>
>
>
>
> --
> Richa Khandelwal
>
>
> University Of California,
> Santa Cruz.
> Ph:425-241-7763
>



-- 
Richa Khandelwal


University Of California,
Santa Cruz.
Ph:425-241-7763

Re: Hadoop AMI for EC2

2009-03-05 Thread Richa Khandelwal

Thats pretty cool. Thanks

On Thu, Mar 5, 2009 at 8:17 AM, tim robertson wrote:

> Yeps,
>
> A good starting read: http://wiki.apache.org/hadoop/AmazonEC2
>
> These are the AMIs:
>
> $ ec2-describe-images -a | grep hadoop
> IMAGE   ami-245db94dcloudbase-1.1-hadoop-fc64/image.manifest.xml
>  247610401714available   public  x86_64  machine
> IMAGE   ami-791ffb10
>  cloudbase-hadoop-fc64/cloudbase-hadoop-fc64.manifest.xml
>  247610401714available   public  x86_64  machine
> IMAGE   ami-f73adf9ecs345-hadoop-EC2-0.15.3/hadoop-0.15.3.manifest.xml
>  825431212034available   public  i386machine
> IMAGE   ami-c55db8acfedora8-hypertable-hadoop-kfs/image.manifest.xml
>  291354417104available   public  x86_64  machine
> aki-b51cf9dcari-b31cf9da
> IMAGE   ami-ce6b8fa7hachero-hadoop/hadoop-0.19.0-i386.manifest.xml
>  118946012109available   public  i386machine
> aki-a71cf9ceari-a51cf9cc
> IMAGE   ami-dd48acb4hachero-hadoop/hadoop-0.19.0-x86_64.manifest.xml
>  118946012109available   public  x86_64  machine
> aki-b51cf9dcari-b31cf9da
> IMAGE   ami-ee53b687hadoop-ec2-images/hadoop-0.17.0-i386.manifest.xml
> 111560892610available   public  i386machine
> aki-a71cf9ceari-a51cf9cc
> IMAGE   ami-f853b691hadoop-ec2-images/hadoop-0.17.0-x86_64.manifest.xml
> 111560892610available   public  x86_64  machine
> aki-b51cf9dcari-b31cf9da
> IMAGE   ami-65987c0chadoop-images/hadoop-0.17.1-i386.manifest.xml
> 914733919441available   public  i386machine aki-a71cf9ce
>ari-a51cf9cc
> IMAGE   ami-4b987c22hadoop-images/hadoop-0.17.1-x86_64.manifest.xml
> 914733919441available   public  x86_64  machine aki-b51cf9dc
>ari-b31cf9da
> IMAGE   ami-b0fe1ad9hadoop-images/hadoop-0.18.0-i386.manifest.xml
> 914733919441available   public  i386machine aki-a71cf9ce
>ari-a51cf9cc
> IMAGE   ami-90fe1af9hadoop-images/hadoop-0.18.0-x86_64.manifest.xml
> 914733919441available   public  x86_64  machine aki-b51cf9dc
>ari-b31cf9da
> IMAGE   ami-ea36d283hadoop-images/hadoop-0.18.1-i386.manifest.xml
> 914733919441available   public  i386machine aki-a71cf9ce
>ari-a51cf9cc
> IMAGE   ami-fe37d397hadoop-images/hadoop-0.18.1-x86_64.manifest.xml
> 914733919441available   public  x86_64  machine aki-b51cf9dc
>ari-b31cf9da
> IMAGE   ami-fa6a8e93hadoop-images/hadoop-0.19.0-i386.manifest.xml
> 914733919441available   public  i386machine aki-a71cf9ce
>ari-a51cf9cc
> IMAGE   ami-cd6a8ea4hadoop-images/hadoop-0.19.0-x86_64.manifest.xml
> 914733919441available   public  x86_64  machine aki-b51cf9dc
>ari-b31cf9da
> IMAGE   ami-15e80f7c
>  hadoop-images/hadoop-base-20090210-i386.manifest.xml914733919441
>  available   public  i386machine aki-a71cf9ce
>  ari-a51cf9cc
> IMAGE   ami-1ee80f77
>  hadoop-images/hadoop-base-20090210-x86_64.manifest.xml  914733919441
>  available   public  x86_64  machine aki-b51cf9dc
>  ari-b31cf9da
> IMAGE   ami-4de30724
>  hbase-ami/hbase-0.2.0-hadoop-0.17.1-i386.manifest.xml   834125115996
>  available   public  i386machine aki-a71cf9ce
>  ari-a51cf9cc
> IMAGE   ami-fe7c9997radlab-hadoop-4-large/image.manifest.xml
>  117716615155available   public  x86_64  machine
> IMAGE   ami-7f7f9a16radlab-hadoop-4/image.manifest.xml
>  117716615155available   public  i386machine
> $
>
> Cheers,
>
> Tim
>
>
>
> On Thu, Mar 5, 2009 at 5:13 PM, Richa Khandelwal 
> wrote:
> > Hi All,
> > Is there an existing Hadoop AMI for EC2 which had Hadaoop setup on it?
> >
> > Thanks,
> > Richa Khandelwal
> >
> >
> > University Of California,
> > Santa Cruz.
> > Ph:425-241-7763
> >
>



-- 
Richa Khandelwal


University Of California,
Santa Cruz.
Ph:425-241-7763

System Layout Best Practices

2009-03-05 Thread David Ritch

Are there any published guidelines on system configuration for Hadoop?

I've seen hardware suggestions, but I'm really interested in recommendations
on disk layout and partitioning.  The defaults, as shipped and defined in
hadoop-default.xml, may be appropriate for testing, but are not really
appropriate for sustained use.  For example, data and metadata are both
stored in /tmp.  In typical use on a cluster with a couple hundred nodes,
the NameNode can generate 3-5GB of logs per day.  If you configure your
namenode host badly, it's easy to fill up the partition used by dfs for
metadata, and clobber your dfs filesystem.  I would think that thresholding
logs on WARN would be preferable to INFO.

On a datanode, we would like to reserve as much space as we can for data,
but we know that map-reduce jobs need some local storage.  How do people
generally estimate the amount of space required for temporary storage?  I
would assume that it would be good to partition it from data storage, to
prevent running out of temp space on some nodes.  I would also think that it
would be preferable for performance to have temp space on a different
spindle, so it and hdfs data can be accessed independently.

I would be interested to know how other sites configure their systems, and I
would love to see some guidelines for system configuration for Hadoop.

Thank you!

David

Re: Hadoop AMI for EC2

2009-03-05 Thread tim robertson

Yeps,

A good starting read: http://wiki.apache.org/hadoop/AmazonEC2

These are the AMIs:

$ ec2-describe-images -a | grep hadoop
IMAGE   ami-245db94dcloudbase-1.1-hadoop-fc64/image.manifest.xml
247610401714available   public  x86_64  machine
IMAGE   ami-791ffb10
cloudbase-hadoop-fc64/cloudbase-hadoop-fc64.manifest.xml247610401714
available   public  x86_64  machine
IMAGE   ami-f73adf9ecs345-hadoop-EC2-0.15.3/hadoop-0.15.3.manifest.xml  
825431212034available   public  i386machine
IMAGE   ami-c55db8acfedora8-hypertable-hadoop-kfs/image.manifest.xml
291354417104available   public  x86_64  machine aki-b51cf9dc
ari-b31cf9da
IMAGE   ami-ce6b8fa7hachero-hadoop/hadoop-0.19.0-i386.manifest.xml  
118946012109available   public  i386machine aki-a71cf9ce
ari-a51cf9cc
IMAGE   ami-dd48acb4hachero-hadoop/hadoop-0.19.0-x86_64.manifest.xml
118946012109available   public  x86_64  machine aki-b51cf9dc
ari-b31cf9da
IMAGE   ami-ee53b687hadoop-ec2-images/hadoop-0.17.0-i386.manifest.xml   
111560892610available   public  i386machine aki-a71cf9ce
ari-a51cf9cc
IMAGE   ami-f853b691hadoop-ec2-images/hadoop-0.17.0-x86_64.manifest.xml 
111560892610available   public  x86_64  machine aki-b51cf9dc
ari-b31cf9da
IMAGE   ami-65987c0chadoop-images/hadoop-0.17.1-i386.manifest.xml   
914733919441available   public  i386machine aki-a71cf9ce
ari-a51cf9cc
IMAGE   ami-4b987c22hadoop-images/hadoop-0.17.1-x86_64.manifest.xml 
914733919441available   public  x86_64  machine aki-b51cf9dc
ari-b31cf9da
IMAGE   ami-b0fe1ad9hadoop-images/hadoop-0.18.0-i386.manifest.xml   
914733919441available   public  i386machine aki-a71cf9ce
ari-a51cf9cc
IMAGE   ami-90fe1af9hadoop-images/hadoop-0.18.0-x86_64.manifest.xml 
914733919441available   public  x86_64  machine aki-b51cf9dc
ari-b31cf9da
IMAGE   ami-ea36d283hadoop-images/hadoop-0.18.1-i386.manifest.xml   
914733919441available   public  i386machine aki-a71cf9ce
ari-a51cf9cc
IMAGE   ami-fe37d397hadoop-images/hadoop-0.18.1-x86_64.manifest.xml 
914733919441available   public  x86_64  machine aki-b51cf9dc
ari-b31cf9da
IMAGE   ami-fa6a8e93hadoop-images/hadoop-0.19.0-i386.manifest.xml   
914733919441available   public  i386machine aki-a71cf9ce
ari-a51cf9cc
IMAGE   ami-cd6a8ea4hadoop-images/hadoop-0.19.0-x86_64.manifest.xml 
914733919441available   public  x86_64  machine aki-b51cf9dc
ari-b31cf9da
IMAGE   ami-15e80f7chadoop-images/hadoop-base-20090210-i386.manifest.xml
914733919441available   public  i386machine aki-a71cf9ce
ari-a51cf9cc
IMAGE   ami-1ee80f77hadoop-images/hadoop-base-20090210-x86_64.manifest.xml  
914733919441available   public  x86_64  machine aki-b51cf9dc
ari-b31cf9da
IMAGE   ami-4de30724hbase-ami/hbase-0.2.0-hadoop-0.17.1-i386.manifest.xml   
834125115996available   public  i386machine aki-a71cf9ce
ari-a51cf9cc
IMAGE   ami-fe7c9997radlab-hadoop-4-large/image.manifest.xml
117716615155available   public  x86_64  machine
IMAGE   ami-7f7f9a16radlab-hadoop-4/image.manifest.xml  117716615155
available   public  i386machine
$

Cheers,

Tim



On Thu, Mar 5, 2009 at 5:13 PM, Richa Khandelwal  wrote:
> Hi All,
> Is there an existing Hadoop AMI for EC2 which had Hadaoop setup on it?
>
> Thanks,
> Richa Khandelwal
>
>
> University Of California,
> Santa Cruz.
> Ph:425-241-7763
>

Re: Hadoop AMI for EC2

2009-03-05 Thread Tom White

Hi Richa,

Yes there is. Please see http://wiki.apache.org/hadoop/AmazonEC2.

Tom

On Thu, Mar 5, 2009 at 4:13 PM, Richa Khandelwal  wrote:
> Hi All,
> Is there an existing Hadoop AMI for EC2 which had Hadaoop setup on it?
>
> Thanks,
> Richa Khandelwal
>
>
> University Of California,
> Santa Cruz.
> Ph:425-241-7763
>

Re: contrib EC2 with hadoop 0.17

2009-03-05 Thread Tom White

I haven't used Eucalyptus, but you could start by trying out the
Hadoop EC2 scripts (http://wiki.apache.org/hadoop/AmazonEC2) with your
Eucalyptus installation.

Cheers,
Tom

On Tue, Mar 3, 2009 at 2:51 PM, falcon164  wrote:
>
> I am new to hadoop. I want to run hadoop on eucalyptus. Please let me know
> how to do this.
> --
> View this message in context: 
> http://www.nabble.com/contrib-EC2-with-hadoop-0.17-tp17711758p22310068.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Re: Running 0.19.2 branch in production before release

2009-03-05 Thread Steve Loughran


Aaron Kimball wrote:

I recommend 0.18.3 for production use and avoid the 19 branch entirely. If
your priority is stability, then stay a full minor version behind, not just
a revision.


Of course, if everyone stays that far behind, they don't get to find the 
bugs for other people.


* If you play with the latest releases early, while they are in the beta 
phase -you will encounter the problems specific to your 
applications/datacentres, and get them fixed fast.


* If you work with stuff further back you get stability, but not only 
are you behind on features, you can't be sure that all "fixes" that 
matter to you get pushed back.


* If you plan on making changes, of adding features, get onto SVN_HEAD

* If you want to catch changes being made that break your site, 
SVN_HEAD. Better yet, have a private Hudson server checking out SVN_HEAD 
hadoop *then* building and testing your app against it.


Normally I work with stable releases of things I dont depend on, and 
SVN_HEAD of OSS stuff whose code I have any intent to change; there is a 
price -merge time, the odd change breaking your code- but you get to 
make changes that help you long term.


Where Hadoop is different is that it is a filesystem, and you don't want 
to hit bugs that delete files that matter. I'm only bringing up 
transient clusters on VMs, pulling in data from elsewhere, so this isn't 
an issue. All that remains is changing APIs.


-Steve

Re: wordcount getting slower with more mappers and reducers?

2009-03-05 Thread Amareshwari Sriramadasu


Are you hitting HADOOP-2771?
-Amareshwari
Sandy wrote:

Hello all,

For the sake of benchmarking, I ran the standard hadoop wordcount example on
an input file using 2, 4, and 8 mappers and reducers for my job.
In other words,  I do:

time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
sample.txt output
time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
sample.txt output2
time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
sample.txt output3

Strangely enough, when this increase in mappers and reducers result in
slower running times!
-On 2 mappers and reducers it ran for 40 seconds
on 4 mappers and reducers it ran for 60 seconds
on 8 mappers and reducers it ran for 90 seconds!

Please note that the "sample.txt" file is identical in each of these runs.

I have the following questions:
- Shouldn't wordcount get -faster- with additional mappers and reducers,
instead of slower?
- If it does get faster for other people, why does it become slower for me?
  I am running hadoop on psuedo-distributed mode on a single 64-bit Mac Pro
with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs

I would greatly appreciate it if someone could explain this behavior to me,
and tell me if I'm running this wrong. How can I change my settings (if at
all) to get wordcount running faster when i increases that number of maps
and reduces?

Thanks,
-SM

45 matches

Mail list logo