Behaviour of reducer's Iterable in MR unit.

2010-12-09 Thread James Hammerton
Hi,

This relates to a bug we had a while back.

When running a reducer, if you want to buffer the values, you normally need
to take a copy of each value as you iterate through them. This is because
the iterator always returns the same object but the contents of the object
get filled with each value as the iterator steps through.

However *this behaviour is not reproduced by the reducer drivers in MR unit*.
Even if you give the reduce driver a List (why do we have to give a List
when reducer specifies merely an Iterable?) designed to behave this way, MR
unit copies the values into a normal List before presenting them to the
reducer. At least this is the case with the 0.20.1 install we have.

Anyway, in order to test our bug fix we extended the ReduceDriver class to
actually copy the values into an iterable that does reproduce the behaviour
so that we can test for bugs caused by failing to copy the values. In more
recent versions of Hadoop (we use 0.20.1) is the behaviour of the reduce
drivers altered to match that of actual running reducers in this respect?
Are there any plans to do this? Alternatively, I'd be willing to fix this in
the Hadoop codebase myself if necessary.

Regards,

James

-- 
James Hammerton | Senior Data Mining Engineer
www.mendeley.com/profiles/james-hammerton

Mendeley Limited | London, UK | www.mendeley.com
Registered in England and Wales | Company Number 6419015


Memory Manager in Hadoop MR

2010-12-09 Thread Pedro Costa
Hi,

1 - Hadoop MR contains a TaskMemoryManagerThread class that is used to
manage memory usage of tasks running under a TaskTracker. Why Hadoop
MR needs a class to manage memory? Why it couldn't rely on the JVM, or
this class is here for another purpose?

2 - How the JT knows that a Map or Reduce Task finished? Is through
the heartbeat?

Thanks

-- 
Pedro


Map-Reduce Applicability With All-In Memory Data

2010-12-09 Thread Narinder Kumar
Hi All,

We have a problem in hand which we would like to solve using Distributed and
Parallel Processing.

*Problem context* : We have a Map (Entity, Value). The entity can have a
parent which in turn will have its parent and so on till we reach the head.
I have to traverse this tree and do some calculations at every step using
the value of the Map. The final output will again be a map containing the
aggregated results of the computation (Entity, Computed Value). The tree
structure can be quite deep and we have a huge number of entries in Map to
process before coming to the final result. Processing them sequentially
takes quite long time. We were thinking of using Map-Reduce to split the
computation across multiple nodes in a Hadoop Cluster and then aggregate the
results to get the final output.

Having a quick read at the documentation and the samples, I see that both
Mapper and Reducer work with implementations of InputFormat and OutPutFormat
respectively. Most of the implementations appeared to me to be either File
or DB based. Do we have some input-output format which directly
takes/updates things from/into Memory ? or I need to provide my own Custom
Input/Output Format and Record Reader/Writer implementations for the purpose
?

Based upon your experiences, do you think whether Map-Reduce is the
appropriate platform for these kind of scenarios or we should think of it
more for huge File based data only ?

Best Regards
Narinder


Re: Memory Manager in Hadoop MR

2010-12-09 Thread Ted Yu
For 1, TMMT uses ProcessTree to check for task that is running beyond
memory-limits and kills it.

On Thu, Dec 9, 2010 at 3:05 AM, Pedro Costa  wrote:

> Hi,
>
> 1 - Hadoop MR contains a TaskMemoryManagerThread class that is used to
> manage memory usage of tasks running under a TaskTracker. Why Hadoop
> MR needs a class to manage memory? Why it couldn't rely on the JVM, or
> this class is here for another purpose?
>
> 2 - How the JT knows that a Map or Reduce Task finished? Is through
> the heartbeat?
>
> Thanks
>
> --
> Pedro
>


Re: Map-Reduce Applicability With All-In Memory Data

2010-12-09 Thread Jason
Take a look at NLineInputFormat. You might want to use it in combination with 
DistributedCache. 

Sent from my iPhone

On Dec 9, 2010, at 5:02 AM, Narinder Kumar  wrote:

> Hi All,
> 
> We have a problem in hand which we would like to solve using Distributed and 
> Parallel Processing. 
> 
> Problem context : We have a Map (Entity, Value). The entity can have a parent 
> which in turn will have its parent and so on till we reach the head. I have 
> to traverse this tree and do some calculations at every step using the value 
> of the Map. The final output will again be a map containing the aggregated 
> results of the computation (Entity, Computed Value). The tree structure can 
> be quite deep and we have a huge number of entries in Map to process before 
> coming to the final result. Processing them sequentially takes quite long 
> time. We were thinking of using Map-Reduce to split the computation across 
> multiple nodes in a Hadoop Cluster and then aggregate the results to get the 
> final output.
> 
> Having a quick read at the documentation and the samples, I see that both 
> Mapper and Reducer work with implementations of InputFormat and OutPutFormat 
> respectively. Most of the implementations appeared to me to be either File or 
> DB based. Do we have some input-output format which directly takes/updates 
> things from/into Memory ? or I need to provide my own Custom Input/Output 
> Format and Record Reader/Writer implementations for the purpose ?
> 
> Based upon your experiences, do you think whether Map-Reduce is the 
> appropriate platform for these kind of scenarios or we should think of it 
> more for huge File based data only ?
> 
> Best Regards
> Narinder


MultipleInputs and Paths Containing Commas

2010-12-09 Thread Ghigliotti, Matthew
Hello.

I'm unsure of if this is a bug or an oversight, but since I've not found any 
reference anywhere to this, I figured I might bring it to light.

I've been using MultipleInputs for several of my MapReduce jobs, where I am 
joining together different forms of data. However, I have encountered the 
following exception with some uses of MultipleInputs in Hadoop 0.20.2:

java.lang.ArrayIndexOutOfBoundsException: 1
 at 
org.apache.hadoop.mapred.lib.MultipleInputs.getInputFormatMap(MultipleInputs.java:94)
 at 
org.apache.hadoop.mapred.lib.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:51)
 at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
 at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
 at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)

After tracing through the source code, it appears that this occurs when the 
input path specified in MultipleInputs#addInputPath() contains a comma, which 
most often using globs (for example, "/months/{March,April,May}.txt"). Because 
the path itself contains commas, one of the two special delimiters used in 
MultipleInputs#getInputFormatMap(), when the input format map is being created, 
it parses the path-inputformat data incorrectly.

Could someone verify this behavior in other versions of Hadoop? And possibly 
the more important question, should this actually be considered a bug in 
MultipleInputs?

Thanks.


This e-mail and any attachments may contain confidential material for the sole 
use of the intended recipient. If you are not the intended recipient, please be 
aware that any disclosure, copying, distribution or use of this e-mail or any 
attachment is prohibited. If you have received this e-mail in error, please 
contact the sender and delete all copies.

Thank you for your cooperation.


RE: distcp just fails (was:distcp fails with ConnectException)

2010-12-09 Thread Deepika Khera
Thanks everyone. It turned out that I was using the wrong port. The issue was 
resolved.

From: hadoopman [mailto:hadoop...@gmail.com]
Sent: Monday, December 06, 2010 6:26 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: distcp just fails (was:distcp fails with ConnectException)

On 12/06/2010 06:52 PM, jingguo yao wrote:
A few things worth of a check.

1. Can   hadoop-A connect to hadoop-B usng 8020 port? You can check firewall 
settings. Maybe your firewall settings allow ssh and ping. But port 8020 is 
disallowed.
2. Which user are you using to run  distcp? Does this user has access rights to 
 hadoop-A?
I'm curious.  I've run into a problem with distcp also and we're using ports 
9000, 9001 and 50010 for two hadoop clusters we're running.  It looks like 
we're connecting however when it gets to 80% it bails.  I've validated our 
firewall is open to those ports between the two clusters.  What are you 
referring to regarding access rights to hadoop-A?  I'm a little confused by 
that statement I'm afraid :-)

Below is what we're seeing.  Thanks!!

--==--==--==--

# u...@hnn1:~$ hadoop distcp hdfs://hnn1:9000/user/testing hdfs://hnn2:9000/user
10/12/03 15:58:10 INFO tools.DistCp: srcPaths=[hdfs://hnn1:9000/user/testing]
10/12/03 15:58:10 INFO tools.DistCp: destPath=hdfs://hnn2:9000/user
10/12/03 15:58:11 INFO tools.DistCp: srcCount=6
10/12/03 15:58:11 INFO mapred.JobClient: Running job: job_201011221457_0019
10/12/03 15:58:12 INFO mapred.JobClient:  map 0% reduce 0%
 10/12/03 15:58:36 INFO mapred.JobClient:  map 19% reduce 0%
10/12/03 15:58:45 INFO mapred.JobClient:  map 39% reduce 0%
10/12/03 15:59:03 INFO mapred.JobClient:  map 60% reduce 0%
10/12/03 15:59:12 INFO mapred.JobClient:  map 80% reduce 0%
10/12/03 15:59:32 INFO mapred.JobClient: Task Id : 
attempt_201011221457_0019_m_00_0, Status : FAILED
java.io.IOException: Copied: 0 Skipped: 0 Failed: 5
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.close(DistCp.java:572)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

10/12/03 15:59:33 INFO mapred.JobClient:  map 0% reduce 0%
10/12/03 15:59:55 INFO mapred.JobClient:  map 19% reduce 0%
10/12/03 16:00:04 INFO mapred.JobClient:  map 39% reduce 0%
10/12/03 16:00:22 INFO mapred.JobClient:  map 60% reduce 0%
10/12/03 16:00:31 INFO mapred.JobClient:  map 80% reduce 0%
10/12/03 16:00:51 INFO mapred.JobClient: Task Id : 
attempt_201011221457_0019_m_00_1, Status : FAILED
java.io.IOException: Copied: 0 Skipped: 0 Failed: 5
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.close(DistCp.java:572)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)



Re: Memory Manager in Hadoop MR

2010-12-09 Thread Greg Roelofs
> 2 - How the JT knows that a Map or Reduce Task finished? Is through
> the heartbeat?

Exactly.  Tasks communicate with their TTs through the umbilical, and each
TT communicates with the JT via heartbeat (and heartbeat response).

Greg


Re: Behaviour of reducer's Iterable in MR unit.

2010-12-09 Thread Aaron Kimball
Hi James,

The ReduceDriver is configured to receive a list of inputs because
lists have ordering guarantees whereas other Iterables/Collections do
not; for determinism's sake, it is best to guarantee that you're
calling reduce() with an ordered set of values when testing.

It would be stellar if you could improve the ReduceDriver to reuse a
writable instance between calls. You'll need to infer the appropriate
container class type from the first instance you see in the reducer's
output, and use the serialization API to make a copy. If you look at
o.a.h.mrunit.mock.MockOutputCollector, this will show a pattern you
can work on.

Cheers,
- Aaron

On Thu, Dec 9, 2010 at 2:21 AM, James Hammerton
 wrote:
> Hi,
>
> This relates to a bug we had a while back.
>
> When running a reducer, if you want to buffer the values, you normally need
> to take a copy of each value as you iterate through them. This is because
> the iterator always returns the same object but the contents of the object
> get filled with each value as the iterator steps through.
>
> However this behaviour is not reproduced by the reducer drivers in MR unit.
> Even if you give the reduce driver a List (why do we have to give a List
> when reducer specifies merely an Iterable?) designed to behave this way, MR
> unit copies the values into a normal List before presenting them to the
> reducer. At least this is the case with the 0.20.1 install we have.
>
> Anyway, in order to test our bug fix we extended the ReduceDriver class to
> actually copy the values into an iterable that does reproduce the behaviour
> so that we can test for bugs caused by failing to copy the values. In more
> recent versions of Hadoop (we use 0.20.1) is the behaviour of the reduce
> drivers altered to match that of actual running reducers in this respect?
> Are there any plans to do this? Alternatively, I'd be willing to fix this in
> the Hadoop codebase myself if necessary.
>
> Regards,
>
> James
>
> --
> James Hammerton | Senior Data Mining Engineer
> www.mendeley.com/profiles/james-hammerton
>
> Mendeley Limited | London, UK | www.mendeley.com
> Registered in England and Wales | Company Number 6419015
>
>
>
>


Re: Memory Manager in Hadoop MR

2010-12-09 Thread Hemanth Yamijala
Hi,

On Thu, Dec 9, 2010 at 4:35 PM, Pedro Costa  wrote:
> Hi,
>
> 1 - Hadoop MR contains a TaskMemoryManagerThread class that is used to
> manage memory usage of tasks running under a TaskTracker. Why Hadoop
> MR needs a class to manage memory? Why it couldn't rely on the JVM, or
> this class is here for another purpose?
>

There are streaming and pipes map/reduce applications that launch
native processes from the map/reduce tasks that are outside the
control of the JVM. Indeed, even regular Java map/reduce programs
could fork/exec other programs. All of these processes could consume
memory that would not be accounted for if we relied only on the JVM to
get the memory usage. Hence a separate class that looks at the entire
process tree of the map/reduce task to account for memory consumed.

> 2 - How the JT knows that a Map or Reduce Task finished? Is through
> the heartbeat?
>

Yes.

> Thanks
>
> --
> Pedro
>


How to share Same Counter in Multiple Jobs?

2010-12-09 Thread Savannah Beckett
Hi,
  I chain multiple jobs in my program.  Job 1's reduce function has a counter.  
I want job 3's reduce function to read this Job 1's counter.  How?  

Thanks.


  

Re: How to share Same Counter in Multiple Jobs?

2010-12-09 Thread Ted Yu
I wrote the following code today. We have our own flow execution logic which
calls the following to collect counters.

enum COUNT_COLLECTION {
LOG,// log the counters
ADD_TO_CONF// add counters to JobConf
}

protected static void collectCounters(RunningJob running, JobConf
jobConf, EnumSet collFlags) {
try {
Counters counters = running.getCounters();
Collection counterGroupNames = counters.getGroupNames();
if (counterGroupNames == null) {
LOG.warn("No counters returned from job " +
running.getJobName());
} else {
String[] groupsToCollect = { "Map-Reduce Framework",
"FileSystemCounters" };
for (String counterGroupName : groupsToCollect) {
for (Iterator iterator =
counters.getGroup(counterGroupName).iterator(); iterator.hasNext();) {
Counter counter = iterator.next();
String counterName =
counters.getGroup(counterGroupName).getDisplayName()+"."+
counter.getDisplayName();
if (collFlags.contains(COUNT_COLLECTION.LOG)) {
LOG.info(counterName + ": " +
counter.getCounter());
}
}
}
}
} catch (IOException e) {
LOG.error("unable to retrieve counters", e);
}
}

You can pass the counter from Job 1 to Job 3 via JobConf.

On Thu, Dec 9, 2010 at 9:45 PM, Savannah Beckett <
savannah_becket...@yahoo.com> wrote:

> Hi,
>   I chain multiple jobs in my program.  Job 1's reduce function has a
> counter.  I want job 3's reduce function to read this Job 1's counter.
> How?
> Thanks.
>
>


-libjars?

2010-12-09 Thread Vipul Pandey
disclaimer : a newbie!!!

Howdy?

Got a quick question. -libjars option doesn't seem to work for me in - 
prettymuch - my first (or mayby second) mapreduce job. 
Here's what i'm doing :
$bin/hadoop jar  sherlock.jar somepkg.FindSchoolsJob -libjars  HStats-1A18.jar 
input output

sherlock.jar has my main class (ofcourse)  FindSchoolsJob, which runs just fine 
by itself till I add a dependency on a class in HStats-1A18.jar.
When I run the above command with -libjars specified - it fails to find my 
classes that 'are' inside HStats jar file.

Exception in thread "main" java.lang.NoClassDefFoundError: com/*/HAgent
at com.*.FindSchoolsJob.run(FindSchoolsJob.java:46)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at com.**.FindSchoolsJob.main(FindSchoolsJob.java:101)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.lang.ClassNotFoundException:com/*/HAgent
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
... 8 more


My main class is defined as below : 
public class FindSchoolsJob extends Configured implements Tool {
:
public int run(String[] args) throws Exception {
:
:
  }
:
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new 
FindSchoolsJob(),
args);
System.exit(res);
}
}

Any hint would be highly appreciated.

Thank You!
~V