Re: Multiple data centre in Hadoop

2012-04-11 Thread Abhishek Pratap Singh
Thanks Robert.
Is there a best practice or design than can address the High Availability
to certain extent?

~Abhishek

On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans  wrote:

> No it does not. Sorry
>
>
> On 4/11/12 1:44 PM, "Abhishek Pratap Singh"  wrote:
>
> Hi All,
>
> Just wanted if hadoop supports more than one data centre. This is basically
> for DR purposes and High Availability where one centre goes down other can
> bring up.
>
>
> Regards,
> Abhishek
>
>


Re: Multiple data centre in Hadoop

2012-04-11 Thread Robert Evans
No it does not. Sorry


On 4/11/12 1:44 PM, "Abhishek Pratap Singh"  wrote:

Hi All,

Just wanted if hadoop supports more than one data centre. This is basically
for DR purposes and High Availability where one centre goes down other can
bring up.


Regards,
Abhishek



Hadoop map task initialization takes too long (3 minutes, 10 seconds to be exact)

2012-04-11 Thread Nikos Stasinopoulos
Greetings people,

Well, lately, in any Hadoop flow I'm running, I encounter a 3 minutes, 10
second delay for a certain map node (master working as slave). After that
initialization delay, it goes back to normal and executes instantly.

For example, when running QuasiMonteCarlo example:

Task Id Start Time Finish Time
attempt_201204101957_0006_m_03_0 10/04 20:14:54 10/04 20:18:05 (3mins,
10sec) /default-rack/master

2012-04-10 20:18:04,470 INFO org.apache.hadoop.util.NativeCodeLoader:
Loaded the native-hadoop library
2012-04-10 20:18:04,646 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=
2012-04-10 20:18:04,647 WARN org.apache.hadoop.conf.Configuration:
user.nameis deprecated. Instead, use
mapreduce.job.user.name
2012-04-10 20:18:04,751 INFO org.apache.hadoop.mapreduce.util.ProcessTree:
setsid exited with exit code 0
2012-04-10 20:18:04,754 INFO org.apache.hadoop.mapred.Task: Using
ResourceCalculatorPlugin :
org.apache.hadoop.mapreduce.util.LinuxResourceCalculatorPlugin@79ee2c2c
2012-04-10 20:18:04,912 INFO org.apache.hadoop.mapred.MapTask: (EQUATOR) 0
kvi 26214396(104857584)
2012-04-10 20:18:04,912 INFO org.apache.hadoop.mapred.MapTask:
mapreduce.task.io.sort.mb: 100
2012-04-10 20:18:04,912 INFO org.apache.hadoop.mapred.MapTask: soft limit
at 83886080
2012-04-10 20:18:04,912 INFO org.apache.hadoop.mapred.MapTask: bufstart =
0; bufvoid = 104857600
2012-04-10 20:18
:04,912 INFO org.apache.hadoop.mapred.MapTask: kvstart = 26214396; length =
6553600 2012-04-10 20:18:04,939 INFO org.apache.hadoop.mapred.MapTask:
Starting flush of map output
2012-04-10 20:18:04,940 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output
2012-04-10 20:18:04,940 INFO org.apache.hadoop.mapred.MapTask: bufstart =
0; bufend = 18; bufvoid = 104857600
2012-04-10 20:18:04,940 INFO org.apache.hadoop.mapred.MapTask: kvstart =
26214396(104857584); kvend = 26214392(104857568); length = 5/6553600
2012-04-10 20:18:04,972 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 0
2012-04-10 20:18:04,975 INFO org.apache.hadoop.mapred.Task:
Task:attempt_201204101957_0006_m_03_0 is done. And is in the process of
commiting
2012-04-10 20:18:05,058 INFO org.apache.hadoop.mapred.Task: Task
'attempt_201204101957_0006_m_03_0' done.

Task tracker log is more telling:

2012-04-10 *20:14:54,615* INFO org.apache.hadoop.mapred.TaskTracker: In
TaskLauncher, current free slots : 1 and trying to launch
attempt_201204101957_0006_m_03_0 which needs 1 slots
2012-04-10 20:14:54,685 INFO org.apache.hadoop.mapred.JvmManager: JVM
Runner jvm_201204101957_0006_m_377512887 spawned.
2012-04-10 20:16:34,041 INFO org.apache.hadoop.mapred.TaskTracker:
addFreeSlot : current free slots : 1
2012-04-10 *20:18:04,433* INFO org.apache.hadoop.mapred.TaskTracker: JVM
with ID: jvm_201204101957_0006_m_377512887 given task:
attempt_201204101957_0006_m_03_0
2012-04-10 20:18:04,938 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201204101957_0006_m_03_0 0.0%
2012-04-10 20:18:05,056 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201204101957_0006_m_03_0 0.667% Generated 1000 samples.

sort
2012-04-10 20:18:05,058 INFO org.apache.hadoop.mapred.TaskTracker: Task
attempt_201204101957_0006_m_03_0 is done.
2012-04-10 20:18:05,058 INFO org.apache.hadoop.mapred.TaskTracker: reported
output size for attempt_201204101957_0006_m_03_0 was 28
2012-04-10 20:18:05,058 INFO org.apache.hadoop.mapred.TaskTracker:
addFreeSlot : current free slots : 2
2012-04-10 20:18:05,213 INFO org.apache.hadoop.mapreduce.util.ProcessTree:
Sending signal to all members of process group -23030: SIGTERM. Exit code 1
2012-04-10 20:18:08,478 INFO org.apache.hadoop.mapred.TaskTracker: Sent out
28 bytes to reduce 0 from map: attempt_201204101957_0006_m_03_0 given
28/24
2012-04-10 20:18:08,478 INFO org.apache.hadoop.mapred.TaskTracker: Shuffled
1maps (mapIds=attempt_201204101957_0006_m_03_0) to reduce 0 in 29s
2012-04-10 20:18:08,478 INFO
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 147.102.7.173:50060,
dest: 147.102.7.175:57289, maps: 1, op: MAPRED_SHUFFLE, reduceID: 0,
duration: 29
2012-04-10 20:18:10,217 INFO org.apache.hadoop.mapred.JvmManager: JVM :
jvm_201204101957_0006_m_377512887 exited with exit code 0. Number of tasks
it ran: 1

I suspect a network issue here, but I can ping and ssh with no problem.


Thank you in advance,

Nikos Stasinopoulos


Re: Map Reduce Job Help

2012-04-11 Thread Raj Vishwanathan
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ 




>
> From: hellooperator 
>To: core-u...@hadoop.apache.org 
>Sent: Wednesday, April 11, 2012 11:15 AM
>Subject: Map Reduce Job Help
> 
>
>Hello,
>
>I'm just starting out with Hadoop and writing some Map Reduce jobs.  I was
>looking for help on writing a MR job in python that allows me to take some
>emails and put them into HDFS so I can search on the text or attachments of
>the email?
>
>Thank you!
>-- 
>View this message in context: 
>http://old.nabble.com/Map-Reduce-Job-Help-tp33670645p33670645.html
>Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>
>
>

Map Reduce Job Help

2012-04-11 Thread hellooperator

Hello,

I'm just starting out with Hadoop and writing some Map Reduce jobs.  I was
looking for help on writing a MR job in python that allows me to take some
emails and put them into HDFS so I can search on the text or attachments of
the email?

Thank you!
-- 
View this message in context: 
http://old.nabble.com/Map-Reduce-Job-Help-tp33670645p33670645.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: How do I include the newer version of Commons-lang in my jar?

2012-04-11 Thread John George
Have you tried setting 'mapreduce.user.classpath.first'? It allows user
jars to be put in the classpath before hadoop jars.

-Original Message-
From: Sky USC 
Reply-To: "common-user@hadoop.apache.org" 
Date: Mon, 9 Apr 2012 15:46:52 -0500
To: "common-user@hadoop.apache.org" 
Subject: RE: How do I include the newer version of Commons-lang in my jar?

>
>
>
>
>Thanks for the reply. I appreciate your helpfulness. I created Jars by
>following instructions at
>"http://blog.mafr.de/2010/07/24/maven-hadoop-job/";. So external Jars are
>stored in lib/ folder within a jar.
>
>Am I summarizing this correctly:
>1. If hadoop version = 0.20.203 or lower - then, there is not possible
>for me to use an external jar such as "commons-lang" from apache in my
>application. Any external jars packaged within my jar under "lib"
>directory are not captured. This appears like a huge limitation to me?
>2. If hadoop version >  0.20.204 to 1.0.x - then use
>"HADOOP_USER_CLASSPATH_FIRST=true" environment variable before launching
>"hadoop jar" might help. I tried this for version 0.20.205 but it didnt
>work. 
>3. If hadoop version > 2.x or formerly 0.23.x - then this can be set via
>API?
>
>Is there a working version of testable jar that has these dependencies
>that I can try to figure out if its my way of packaging jar or something
>else??
>
>Thx
>
>> From: ha...@cloudera.com
>> Date: Mon, 9 Apr 2012 13:50:37 +0530
>> Subject: Re: How do I include the newer version of Commons-lang in my
>>jar?
>> To: common-user@hadoop.apache.org
>> 
>> Answer is a bit messy.
>> 
>> Perhaps you can set the environment variable "export
>> HADOOP_USER_CLASSPATH_FIRST=true" before you do a "hadoop jar Š" to
>> launch your job. However, although this approach is present in
>> 0.20.204+ (0.20.205, and 1.0.x), am not sure if it makes an impact on
>> the tasks as well. I don't see it changing anything but for the driver
>> CP. I've not tested it - please let us know if it works in your
>> environment.
>> 
>> In higher versions (2.x or formerly 0.23.x), this is doable from
>> within your job if you set "mapreduce.job.user.classpath.first" to
>> true inside your job, and ship your replacement jars along.
>> 
>> Some versions would also let you set this via
>> "JobConf/Job.setUserClassesTakesPrecedence(true/false)" API calls.
>> 
>> On Mon, Apr 9, 2012 at 11:14 AM, Sky  wrote:
>> > Hi.
>> >
>> > I am new to Hadoop and I am working on project on AWS Elastic
>>MapReduce.
>> >
>> > The problem I am facing is:
>> > * org.apache.commons.lang.time.DateUtils: parseDate() works OK but
>> > parseDateStrictly() fails.
>> > I think parseDateStrictly might be new in lang 2.5. I thought I
>>included all
>> > dependencies. However, for some reason, during runtime, my app is not
>> > picking up the newer commons-lang.
>> >
>> > Would love some help.
>> >
>> > Thx
>> > - sky
>> >
>> >
>> 
>> 
>> 
>> -- 
>> Harsh J
>
> 



Re: Hadoopp_ClassPath issue.

2012-04-11 Thread John George
Dharin,
I believe the properties you are looking for are the following:
HADOOP_USER_CLASSPATH_FIRST: When defined, this will let the user
suggested classpath to the beginning of global classpath. So, you would
have to do something like 'export HADOOP_USER_CLASSPATH_FIRST=true'. If
you are on 2.0 (or 0.23), please refer bin/hadoop-config.sh for more
information. If you are on 1.0 (or 0.20), refer to hadoop script.

Now, if you want to run an M/R job by passing your own jar and you want
that jar to be used first, you want to set the config parameter
'mapreduce.job.user.classpath.first' and then the user provided jar will
be put in before $HADOOP_CLASSPATH.

Hope this makes sense.

Also, these will work on 1.0 (or 0.23) above.

Refer:
https://issues.apache.org/jira/browse/MAPREDUCE-3696 (for 2.0, 0.23)

https://issues.apache.org/jira/browse/MAPREDUCE-1938 (1.0, 0.20)


Thanks,
John George



-Original Message-
From: dmaniar 
Reply-To: "common-user@hadoop.apache.org" 
Date: Tue, 10 Apr 2012 21:09:10 -0700
To: "core-u...@hadoop.apache.org" 
Subject: Hadoopp_ClassPath issue.

>
>Hi,
>
>I am new to hadoop and its not very familiar with internal working. I had
>some questions about HADOOP_CLASSPATH.
>
>We are currently suppose to use a Hadoop cluster with 4 machines and its
>HADOOP_CLASSPATH in hadoop-env.sh is as below.
>export
>HADOOP_CLASSPATH="/home/user/app/www/WEB-INF/classes:$HADOOP_CLASSPATH"
>
>Now my,
>/home/user/app/www/WEB-INF/classes has a class called Application.class
>
>From a remote machine I submit a map-reduce job to this cluster, with a
>jar
>called MyJar.jar. [This has a Application.class too, but with some
>modifications]
>
>When the TaskTracker spawns a child Java process for the Mapper the
>classpath I see is as below in that order,
>
>Lets say my hadoop is installed at: /home/user/hadoop/
>/home/user/hadoop/jar1,
>/home/user/hadoop/jar2,
>.
>.
>.
>/home/user/hadoop/jarN,
>/home/user/hadoop/lib/jar1,
>/home/user/hadoop/lib/jar2,
>/home/user/hadoop/lib/jarN,
>1./home/user/app/www/WEB-INF/classes,
>2/${mapred.local.dir}/taskTracker/{user}/jobcache/{jobid}/jars/Myjar.jar
>[note:- basically this has the modified class that I need to use for my
>Map-Reduce job]
>
>Well its clear from this classpath that i will end up using the
>Application.class from the classes folder. with gives me incorrect
>results.
>
>Now my Question is, how do I make sure i reverse the order of 1 & 2.
>
>Some pointer that I found was,
>1) if MyJar.jar is not changing much then I can put in a shared location
>and
>modify my hadoop-env.sh to
>export
>HADOOP_CLASSPATH="/some/share/location/lib:/home/user/app/www/WEB-INF/clas
>ses:$HADOOP_CLASSPATH"
>
>2) get rid of /home/user/app/www/WEB-INF/classes, from my hadoop-env.sh
>
>3) is there any property taht suggest to add before classpath ?
>
>Any help is greatly appreciated.
>
>To Summarize,
>If I have HADOOP_CLASSPTH in hadoop-env.sh already set, then how do I add
>application jar before this classpath.
>
>Again. I saw the DistributedCache.java [hadoop src] and the code looks
>like.
>
>public static void addFileToClassPath(Path file, Configuration conf)
>   throws IOException {
>   String classpath = conf.get("mapred.job.classpath.files");
>   conf.set("mapred.job.classpath.files", classpath == null ? file
>   .toString() : classpath + System.getProperty("path.separator")
>   + file.toString());
>   .
>}
>
>basically new files are added to the end of existing classpath.
>
>
>Thanks,
>Dharin.
>
>
>
>
>-- 
>View this message in context:
>http://old.nabble.com/Hadoopp_ClassPath-issue.-tp33666009p33666009.html
>Sent from the Hadoop core-user mailing list archive at Nabble.com.
>



Re: Hadoopp_ClassPath issue.

2012-04-11 Thread John George
Dharin,
I believe the properties you are looking for are the following:
HADOOP_USER_CLASSPATH_FIRST: When defined, this will let the user
suggested classpath to the beginning of global classpath. So, you would
have to do something like 'export HADOOP_USER_CLASSPATH_FIRST=true'. If
you are on 2.0 (or 0.23), please refer bin/hadoop-config.sh for more
information. If you are on 1.0 (or 0.20), refer to hadoop script.

Now, if you want to run an M/R job by passing your own jar and you want
that jar to be used first, you want to set the config parameter
'mapreduce.job.user.classpath.first' and then the user provided jar will
be put in before $HADOOP_CLASSPATH.

Hope this makes sense.

Also, these will work on 1.0 (or 0.23) above.

Refer:
https://issues.apache.org/jira/browse/MAPREDUCE-3696 (for 2.0, 0.23)

https://issues.apache.org/jira/browse/MAPREDUCE-1938 (1.0, 0.20)


Thanks,
John George

-Original Message-
From: dmaniar 
Reply-To: "common-user@hadoop.apache.org" 
Date: Tue, 10 Apr 2012 21:09:10 -0700
To: "core-u...@hadoop.apache.org" 
Subject: Hadoopp_ClassPath issue.

>
>Hi,
>
>I am new to hadoop and its not very familiar with internal working. I had
>some questions about HADOOP_CLASSPATH.
>
>We are currently suppose to use a Hadoop cluster with 4 machines and its
>HADOOP_CLASSPATH in hadoop-env.sh is as below.
>export
>HADOOP_CLASSPATH="/home/user/app/www/WEB-INF/classes:$HADOOP_CLASSPATH"
>
>Now my,
>/home/user/app/www/WEB-INF/classes has a class called Application.class
>
>From a remote machine I submit a map-reduce job to this cluster, with a
>jar
>called MyJar.jar. [This has a Application.class too, but with some
>modifications]
>
>When the TaskTracker spawns a child Java process for the Mapper the
>classpath I see is as below in that order,
>
>Lets say my hadoop is installed at: /home/user/hadoop/
>/home/user/hadoop/jar1,
>/home/user/hadoop/jar2,
>.
>.
>.
>/home/user/hadoop/jarN,
>/home/user/hadoop/lib/jar1,
>/home/user/hadoop/lib/jar2,
>/home/user/hadoop/lib/jarN,
>1./home/user/app/www/WEB-INF/classes,
>2/${mapred.local.dir}/taskTracker/{user}/jobcache/{jobid}/jars/Myjar.jar
>[note:- basically this has the modified class that I need to use for my
>Map-Reduce job]
>
>Well its clear from this classpath that i will end up using the
>Application.class from the classes folder. with gives me incorrect
>results.
>
>Now my Question is, how do I make sure i reverse the order of 1 & 2.
>
>Some pointer that I found was,
>1) if MyJar.jar is not changing much then I can put in a shared location
>and
>modify my hadoop-env.sh to
>export
>HADOOP_CLASSPATH="/some/share/location/lib:/home/user/app/www/WEB-INF/clas
>ses:$HADOOP_CLASSPATH"
>
>2) get rid of /home/user/app/www/WEB-INF/classes, from my hadoop-env.sh
>
>3) is there any property taht suggest to add before classpath ?
>
>Any help is greatly appreciated.
>
>To Summarize,
>If I have HADOOP_CLASSPTH in hadoop-env.sh already set, then how do I add
>application jar before this classpath.
>
>Again. I saw the DistributedCache.java [hadoop src] and the code looks
>like.
>
>public static void addFileToClassPath(Path file, Configuration conf)
>   throws IOException {
>   String classpath = conf.get("mapred.job.classpath.files");
>   conf.set("mapred.job.classpath.files", classpath == null ? file
>   .toString() : classpath + System.getProperty("path.separator")
>   + file.toString());
>   .
>}
>
>basically new files are added to the end of existing classpath.
>
>
>Thanks,
>Dharin.
>
>
>
>
>-- 
>View this message in context:
>http://old.nabble.com/Hadoopp_ClassPath-issue.-tp33666009p33666009.html
>Sent from the Hadoop core-user mailing list archive at Nabble.com.
>



Testing Map reduce code

2012-04-11 Thread madhu phatak
Hi,
 I am working on a Hadoop project where I want to make automated build to
run M/R test cases  on real hadoop cluster. As of now it seems we can only
unit test M/R through MiniDFSCluster /MiniMRCluster/MRUnit. None of this
runs the test cases on Hadoop cluster. Is any other framework or any other
way to make test cases to run on Hadoop cluster??

Thanks in Advance

-- 
https://github.com/zinnia-phatak-dev/Nectar