Re: Spilled Records

2011-02-21 Thread maha
Thank you Saurabh, but the following setting didn't change # of spilled records:

conf.set("mapred.job.shuffle.merge.percent", ".9");//instead of .66
conf.set("mapred.inmem.merge.threshold", "1000");// instead of 1000

IS it's because of my memory being 4GB ??   

I'm using the pseudo distributed mode. 

Thank you,
Maha

On Feb 21, 2011, at 7:46 PM, Saurabh Dutta wrote:

> Hi Maha,
> 
> The spilled record has to do with the transient data during the map and 
> reduce operations. Note that it's not just the map operations that generate 
> the spilled records. When the in-memory buffer (controlled by 
> mapred.job.shuffle.merge.percent) runs out or reaches the threshold number of 
> map outputs (mapred.inmem.merge.threshold), it is merged and spilled to disk.
> 
> You are going in the right direction by tuning the io.sort.mb parameter and 
> try increasing it further. If it still doesn't work out, try the 
> io.sort.factor, fs.inmemory.size.mb. Also, try the other two variables that i 
> mentioned earlier.
> 
> Let us know what worked for you.
> 
> Sincerely,
> Saurabh Dutta
> Impetus Infotech India Pvt. Ltd.,
> Sarda House, 24-B, Palasia, A.B.Road, Indore - 452 001
> Phone: +91-731-4269200 4623
> Fax: + 91-731-4071256
> Email: saurabh.du...@impetus.co.in
> www.impetus.com
> 
> From: maha [m...@umail.ucsb.edu]
> Sent: Tuesday, February 22, 2011 8:21 AM
> To: common-user
> Subject: Spilled Records
> 
> Hello every one,
> 
> Does spilled records mean that the sort-buffer size for sorting is not enough 
> to sort all the input records, hence some records are written to local disk ?
> 
> If so, I tried setting my io.sort.mb from the default 100 to 200 and there 
> was still the same # of spilled records. Why ?
> 
> Does changing io.sort.record.percent to be .9 instead .8 might produce 
> unexpected exceptions ?
> 
> 
> Thank you,
> Maha
> 
> 
> 
> Impetus to Present Big Data -- Analytics Solutions and Strategies at TDWI 
> World Conference (Feb 13-18) in Las Vegas.We are also bringing cloud experts 
> together at CloudCamp, Delhi on Feb 12. CloudCamp is an unconference where 
> early adopters of Cloud Computing technologies exchange ideas.
> 
> Click http://www.impetus.com to know more.
> 
> 
> NOTE: This message may contain information that is confidential, proprietary, 
> privileged or otherwise protected by law. The message is intended solely for 
> the named addressee. If received in error, please destroy and notify the 
> sender. Any use of this email is prohibited when received in error. Impetus 
> does not represent, warrant and/or guarantee, that the integrity of this 
> communication has been maintained nor that the communication is free of 
> errors, virus, interception or interference.



Re: multiple hadoop instances on same cluster

2011-02-21 Thread Konstantin Boudnik
Make sure the instances' ports aren't conflicting and all directories
(NN, JT, etc.) are unique. That should do it.
--
  Take care,
Konstantin (Cos) Boudnik

On Mon, Feb 21, 2011 at 20:09, Gang Luo  wrote:
> Hello folks,
> I am trying to run multiple hadoop instances on the same cluster. I find it 
> hard
> to share. First I try two  instances, each of them run with the same master 
> and
> slaves. Only one of them could work. I try to divide the cluster such that
> hadoop 1 use machine 0-9 and hadoop 2 uses machine 10-19. Still, only one of
> them could work. The HDFS of the second hadoop is working well, but
> start-mapred.sh will result in such exception "java.io.IOException: Connection
> reset by peer" in the log.
>
>
> Any ideas on this or suggestion on how to run multiple hadoop instance on one
> cluster? I can total divide up the cluster such that different instances run 
> on
> different set of machines.
>
> Thanks.
>
> -Gang
>
>
>
>
>


multiple hadoop instances on same cluster

2011-02-21 Thread Gang Luo
Hello folks,
I am trying to run multiple hadoop instances on the same cluster. I find it 
hard 
to share. First I try two  instances, each of them run with the same master and 
slaves. Only one of them could work. I try to divide the cluster such that 
hadoop 1 use machine 0-9 and hadoop 2 uses machine 10-19. Still, only one of 
them could work. The HDFS of the second hadoop is working well, but 
start-mapred.sh will result in such exception "java.io.IOException: Connection 
reset by peer" in the log. 


Any ideas on this or suggestion on how to run multiple hadoop instance on one 
cluster? I can total divide up the cluster such that different instances run on 
different set of machines.

Thanks.

-Gang






RE: Spilled Records

2011-02-21 Thread Saurabh Dutta
Hi Maha,

The spilled record has to do with the transient data during the map and reduce 
operations. Note that it's not just the map operations that generate the 
spilled records. When the in-memory buffer (controlled by 
mapred.job.shuffle.merge.percent) runs out or reaches the threshold number of 
map outputs (mapred.inmem.merge.threshold), it is merged and spilled to disk.

You are going in the right direction by tuning the io.sort.mb parameter and try 
increasing it further. If it still doesn't work out, try the io.sort.factor, 
fs.inmemory.size.mb. Also, try the other two variables that i mentioned earlier.

Let us know what worked for you.

Sincerely,
Saurabh Dutta
Impetus Infotech India Pvt. Ltd.,
Sarda House, 24-B, Palasia, A.B.Road, Indore - 452 001
Phone: +91-731-4269200 4623
Fax: + 91-731-4071256
Email: saurabh.du...@impetus.co.in
www.impetus.com

From: maha [m...@umail.ucsb.edu]
Sent: Tuesday, February 22, 2011 8:21 AM
To: common-user
Subject: Spilled Records

Hello every one,

 Does spilled records mean that the sort-buffer size for sorting is not enough 
to sort all the input records, hence some records are written to local disk ?

 If so, I tried setting my io.sort.mb from the default 100 to 200 and there was 
still the same # of spilled records. Why ?

 Does changing io.sort.record.percent to be .9 instead .8 might produce 
unexpected exceptions ?


Thank you,
Maha



Impetus to Present Big Data -- Analytics Solutions and Strategies at TDWI World 
Conference (Feb 13-18) in Las Vegas.We are also bringing cloud experts together 
at CloudCamp, Delhi on Feb 12. CloudCamp is an unconference where early 
adopters of Cloud Computing technologies exchange ideas.

Click http://www.impetus.com to know more.


NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


Spilled Records

2011-02-21 Thread maha
Hello every one,

 Does spilled records mean that the sort-buffer size for sorting is not enough 
to sort all the input records, hence some records are written to local disk ?

 If so, I tried setting my io.sort.mb from the default 100 to 200 and there was 
still the same # of spilled records. Why ?

 Does changing io.sort.record.percent to be .9 instead .8 might produce 
unexpected exceptions ?


Thank you,
Maha

Re: how many output files can support by MultipleOutputs?

2011-02-21 Thread Jun Young Kim

hi,

I think the third error pattern is are not caused by xceiver key.

org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle 
in fetcher#5
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362)
at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
at org.apache.hadoop.mapred.Child.main(Child.java:211)
Caused by: java.lang.OutOfMemoryError: Java heap space
at 
org.apache.hadoop.io.BoundedByteArrayOutputStream.(BoundedByteArrayOutputStream.java:58)
at 
org.apache.hadoop.io.BoundedByteArrayOutputStream.(BoundedByteArrayOutputStream.java:45)
at 
org.apache.hadoop.mapreduce.task.reduce.MapOutput.(MapOutput.java:104)
at 
org.apache.hadoop.mapreduce.task.reduce.MergeManager.unconditionalReserve(MergeManager.java:267)
at org.apache.hadoop.mapreduce.task.re


by the google, this is by wrong ip entires which is  the one of my cluster.
but, I've checked several times again. ip addresses of my cluster are 
normal.


my cluster size is 9 (1 master, 8 slaves)

this is my mapred-site.xml:







mapreduce.job.tracker
thadpm01.scast:54311
The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.



mapreduce.jobtracker.taskscheduler
org.apache.hadoop.mapred.FairScheduler


mapreduce.child.java.opts
-Xmx1024m
true


mapreduce.map.java.opts
-Xmx1024m
true


mapreduce.reduce.java.opts
-Xmx1024m
true


mapreduce.tasktracker.map.tasks.maximum
83
true


mapreduce.tasktracker.reduce.tasks.maximum
11
true



mapreduce.jobtracker.handler.count
20
true


mapreduce.reduce.shuffle.parallelcopies
10
true


mapreduce.task.io.sort.factor
100
true


mapreduce.task.io.sort.mb
400
true



error log on stdout:
attempt_201102181827_0113_r_00_1: 2011-02-22 10:24:28[WARN 
][Child.java]main()(234) : Exception running child : 
org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in 
shuffle in fetcher#8
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124)
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362)
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.mapred.Child$4.run(Child.java:217)
attempt_201102181827_0113_r_00_1:   at 
java.security.AccessController.doPrivileged(Native Method)
attempt_201102181827_0113_r_00_1:   at 
javax.security.auth.Subject.doAs(Subject.java:396)
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.mapred.Child.main(Child.java:211)
attempt_201102181827_0113_r_00_1: Caused by: 
java.lang.OutOfMemoryError: Java heap space
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.io.BoundedByteArrayOutputStream.(BoundedByteArrayOutputStream.java:58)
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.io.BoundedByteArrayOutputStream.(BoundedByteArrayOutputStream.java:45)
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.mapreduce.task.reduce.MapOutput.(MapOutput.java:104)
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.mapreduce.task.reduce.MergeManager.unconditionalReserve(MergeManager.java:267)
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.mapreduce.task.reduce.MergeManager.reserve(MergeManager.java:257)
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:305)
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:251)
attempt_201102181827_0113_r_00_1:   at 
org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:149)
attempt_201102181827_0113_r_00_1: 2011-02-22 10:24:28[INFO 
][Task.java]taskCleanup()(996) : Runnning cleanup for the task

11/02/22 10:24:44 INFO mapreduce.Job:  map 21% reduce 0%
11/02/22 10:24:54 INFO mapreduce.Job:  map 22% reduce 0%


thanks.

Junyoung Kim (juneng...@gmail.com)


On 02/21/2011 10:47 AM, Yifeng Jiang wrote:
We were using 0.20.2 when the issue occurred, then we set it to 2048, 
and the failure was fixed.

Now we are using 0.20-append (HBase requires it), it works well too.

On 2011/02/21 10:35, Jun Young Kim wrote:

hi, yifeng.

Coung I know which version of a hadoop you are using?

thanks for your response.

Junyoung Kim (juneng...@gmail.com)


On 02/21/2011 10:28 AM, Yifeng Jiang wrote:

Hi,

We have met the same issue.
It seems that this error occurs, when the threads connected to the 
Datanode reaches the maximum

Re: benchmark choices

2011-02-21 Thread Shrinivas Joshi
I wonder what companies like Amazon, Cloudera, RackSpace, Facebook, Yahoo
etc. look at for the purpose of benchmarking. I guess GridMix v3 might be of
more interest to Yahoo.

I would appreciate if someone can comment more on this.

Thanks,
-Shrinivas

On Fri, Feb 18, 2011 at 4:50 PM, Konstantin Boudnik  wrote:

> On Fri, Feb 18, 2011 at 14:35, Ted Dunning  wrote:
> > I just read the malstone report.  They report times for a Java version
> that
> > is many (5x) times slower than for a streaming implementation.  That
> single
> > fact indicates that the Java code is so appallingly bad that this is a
> very
> > bad benchmark.
>
> Slow Java code? That's funny ;) Running with Hotspot on by any chance?
>
> > On Fri, Feb 18, 2011 at 2:27 PM, Jim Falgout  >wrote:
> >
> >> We use MalStone and TeraSort. For Hive, you can use TPC-H, at least the
> >> data and the queries, if not the query generator. There is a Jira issue
> in
> >> Hive that discusses the TPC-H "benchmark" if you're interested. Sorry, I
> >> don't remember the issue number offhand.
> >>
> >> -Original Message-
> >> From: Shrinivas Joshi [mailto:jshrini...@gmail.com]
> >> Sent: Friday, February 18, 2011 3:32 PM
> >> To: common-user@hadoop.apache.org
> >> Subject: benchmark choices
> >>
> >> Which workloads are used for serious benchmarking of Hadoop clusters? Do
> >> you care about any of the following workloads :
> >> TeraSort, GridMix v1, v2, or v3, MalStone, CloudBurst, MRBench, NNBench,
> >> sample apps shipped with Hadoop distro like PiEstimator, dbcount etc.
> >>
> >> Thanks,
> >> -Shrinivas
> >>
> >>
> >
>


Re: ObjectWritable

2011-02-21 Thread Weishung Chung
Thank you for the explanation. Avro is a good serialization tool. I haven't
looked at the codes yet. I will probably dig into the codes very soon.

On Mon, Feb 21, 2011 at 10:20 AM, Harsh J  wrote:

> Hello,
>
> On Mon, Feb 21, 2011 at 9:33 PM, Weishung Chung 
> wrote:
> > What is the main use of org.apache.hadoop.io.ObjectWritable ? Thank you
> :)
>
> To use any primitive Java object as a Writable without requiring it to
> be implementing that interface. It will write out a class name for
> every type of object you put into it along with the object itself,
> when serializing - to deserialize properly.
>
> Maybe not so offtopic: The more I see Writables being used, the more I
> feel like promoting the use of Apache's Avro instead.
>
> --
> Harsh J
> www.harshj.com
>


Re: Quick question

2011-02-21 Thread maha
How can then I produce an output/file per mapper not map-task?

Thank you,
Maha

On Feb 20, 2011, at 10:22 PM, Ted Dunning wrote:

> This is the most important thing that you have said. The map function
> is called once per unit of input but the mapper object persists for
> many input units of input.
> 
> You have a little bit of control over how many mapper objects there
> are and how many machines they are created on and how many pieces your
> input is broken into.  That control is limited, however, unless you
> build your own input format. The standard input formats are optimized
> for very large inputs and may not give you the flexibility that you
> want for your experiments. That is unfortunate for the purpose of
> learning about hadoop but hadoop is designed mostly for dealing with
> very large data and isn't usually designed to be easy to understand.
> Where easy coincides with powerful then easy is good but powerful
> isn't always easy.
> 
> On Sunday, February 20, 2011, maha  wrote:
>> So first question: is there a difference between Mappers and maps ?



measure the time taken by stragglers

2011-02-21 Thread bikash sharma
Hi,
Is there a way in which we can measure the execution time for stragglers and
non-stragglers tasks separately in Hadoop mapreduce?
-bikash


Re: ObjectWritable

2011-02-21 Thread Harsh J
Hello,

On Mon, Feb 21, 2011 at 9:33 PM, Weishung Chung  wrote:
> What is the main use of org.apache.hadoop.io.ObjectWritable ? Thank you :)

To use any primitive Java object as a Writable without requiring it to
be implementing that interface. It will write out a class name for
every type of object you put into it along with the object itself,
when serializing - to deserialize properly.

Maybe not so offtopic: The more I see Writables being used, the more I
feel like promoting the use of Apache's Avro instead.

-- 
Harsh J
www.harshj.com


ObjectWritable

2011-02-21 Thread Weishung Chung
What is the main use of org.apache.hadoop.io.ObjectWritable ? Thank you :)


Re: Quick question

2011-02-21 Thread maha
Thanks for your answers Ted and Jim :)

Maha

On Feb 21, 2011, at 6:41 AM, Jim Falgout wrote:

> You're scenario matches the capability of NLineInputFormat exactly, so that 
> looks to be the best solution. If you wrote your own input format, it would 
> have to basically do what NLineInputFormat is already doing for you.
> 
> -Original Message-
> From: maha [mailto:m...@umail.ucsb.edu] 
> Sent: Sunday, February 20, 2011 2:00 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Quick question
> 
> Actually the following solved my problem ... but I'm a little suspicious of 
> the side effect of doing the following instead of using my own InputSplit to 
> be 5 lines.
> 
> conf.setInputFormat(org.apache.hadoop.mapred.lib.NLineInputFormat.class); // 
> # of maps = # lines  conf.setInt("mapred.line.input.format.linespermap", 5); 
> //# of lines per mapper = 5
> 
> If you have any thought of whether the upper solution is worst that writing 
> my own inputSplit to be about 5 lines, let me know.
> 
> Thanks everyone !
> 
> Maha
>   
> On Feb 20, 2011, at 11:47 AM, maha wrote:
> 
>> Hi again Jim and Ted,
>> 
>> I understood that each mapper will be getting a block of lines... but even 
>> thought I had only 2 mappers for a 16 lines of input file and 
>> TextInputFormat is used. A map-function is processed for each of those 16 
>> lines!
>> 
>> I wanted a block of lines per map ... hence something like map1 has 8 lines 
>> and map2 has 8 lines. 
>> 
>> So first question: is there a difference between Mappers and maps ?
>> 
>> Second: Does that mean I need to write my own inputFormat to make the 
>> InputSplit equal to multipleLines ???
>> 
>> Thank you,
>> 
>> Maha
>> 
>> 
>> On Feb 18, 2011, at 11:55 AM, Jim Falgout wrote:
>> 
>>> That's right. The TextInputFormat handles situations where records cross 
>>> split boundaries. What your mapper will see is "whole" records. 
>>> 
>>> -Original Message-
>>> From: maha [mailto:m...@umail.ucsb.edu]
>>> Sent: Friday, February 18, 2011 1:14 PM
>>> To: common-user
>>> Subject: Quick question
>>> 
>>> Hi all,
>>> 
>>> I want to check if the following statement is right:
>>> 
>>> If I use TextInputFormat to process a text file with 2000 lines (each 
>>> ending with \n) with 20 mappers. Then each map will have a sequence of 
>>> COMPLETE LINES . 
>>> 
>>> In other words,  the input is not split byte-wise but by lines. 
>>> 
>>> Is that right?
>>> 
>>> 
>>> Thank you,
>>> Maha
>>> 
>> 
> 
> 



task scheduling based on slots in Hadoop

2011-02-21 Thread bikash sharma
Hi,
Can anyone throw some more light on resource based scheduling in Hadoop.
Specifically, are resources like CPU, Memory partitioned across slots?
>From the blog by Arun on capacity scheduler,
http://developer.yahoo.com/blogs/hadoop/posts/2011/02/capacity-scheduler/
I understand that memory is the only resource supported, does that mean both
memory and CPU are partitioned across map/reduce tasks in slots?

Thanks in advance.

-bikash


RE: Quick question

2011-02-21 Thread Jim Falgout
You're scenario matches the capability of NLineInputFormat exactly, so that 
looks to be the best solution. If you wrote your own input format, it would 
have to basically do what NLineInputFormat is already doing for you.

-Original Message-
From: maha [mailto:m...@umail.ucsb.edu] 
Sent: Sunday, February 20, 2011 2:00 PM
To: common-user@hadoop.apache.org
Subject: Re: Quick question

Actually the following solved my problem ... but I'm a little suspicious of the 
side effect of doing the following instead of using my own InputSplit to be 5 
lines.

 conf.setInputFormat(org.apache.hadoop.mapred.lib.NLineInputFormat.class); // # 
of maps = # lines  conf.setInt("mapred.line.input.format.linespermap", 5); //# 
of lines per mapper = 5

If you have any thought of whether the upper solution is worst that writing my 
own inputSplit to be about 5 lines, let me know.

Thanks everyone !

Maha

On Feb 20, 2011, at 11:47 AM, maha wrote:

> Hi again Jim and Ted,
> 
> I understood that each mapper will be getting a block of lines... but even 
> thought I had only 2 mappers for a 16 lines of input file and TextInputFormat 
> is used. A map-function is processed for each of those 16 lines!
> 
> I wanted a block of lines per map ... hence something like map1 has 8 lines 
> and map2 has 8 lines. 
> 
> So first question: is there a difference between Mappers and maps ?
> 
> Second: Does that mean I need to write my own inputFormat to make the 
> InputSplit equal to multipleLines ???
> 
> Thank you,
> 
> Maha
> 
> 
> On Feb 18, 2011, at 11:55 AM, Jim Falgout wrote:
> 
>> That's right. The TextInputFormat handles situations where records cross 
>> split boundaries. What your mapper will see is "whole" records. 
>> 
>> -Original Message-
>> From: maha [mailto:m...@umail.ucsb.edu]
>> Sent: Friday, February 18, 2011 1:14 PM
>> To: common-user
>> Subject: Quick question
>> 
>> Hi all,
>> 
>> I want to check if the following statement is right:
>> 
>> If I use TextInputFormat to process a text file with 2000 lines (each ending 
>> with \n) with 20 mappers. Then each map will have a sequence of COMPLETE 
>> LINES . 
>> 
>> In other words,  the input is not split byte-wise but by lines. 
>> 
>> Is that right?
>> 
>> 
>> Thank you,
>> Maha
>> 
>