Re: [Doubt]: Submission of Mapreduce from outside Hadoop Cluster

2011-07-01 Thread Yaozhen Pan
Narayanan,

Regarding the client installation, you should make sure that client and
server use same version hadoop for submitting jobs and transfer data.
if you use a different user in client than the one runs hadoop job, config
the hadoop ugi property (sorry i forget the exact name).

在 2011 7 1 15:28,"Narayanan K" 写道:
> Hi Harsh
>
> Thanks for the quick response...
>
> Have a few clarifications regarding the 1st point :
>
> Let me tell the background first..
>
> We have actually set up a Hadoop cluster with HBase installed. We are
> planning to load Hbase with data and perform some
> computations with the data and show up the data in a report format.
> The report should be accessible from outside the cluster and the report
> accepts certain parameters to show data, that will in turn pass on these
> parameters to the hadoop master server where a mapreduce job will be run
> that queries HBase to retrieve the data.
>
> So the report will be run from a different machine outside the cluster. So
> we need a way to pass on the parameters to the hadoop cluster (master) and
> initiate a mapreduce job dynamically. Similarly the output of mapreduce
job
> needs to tunneled into the machine from where the report was run.
>
> Some more clarification I need is : Does the machine (outside of cluster)
> which ran the report, require something like a Client installation which
> will talk with the Hadoop Master Server via TCP??? Or can it can run a job
> in hadoop server by using a passworldless scp to the master machine or
> something of the like.
>
>
> Regards,
> Narayanan
>
>
>
>
> On Fri, Jul 1, 2011 at 11:41 AM, Harsh J  wrote:
>
>> Narayanan,
>>
>>
>> On Fri, Jul 1, 2011 at 11:28 AM, Narayanan K 
>> wrote:
>> > Hi all,
>> >
>> > We are basically working on a research project and I require some help
>> > regarding this.
>>
>> Always glad to see research work being done! What're you working on? :)
>>
>> > How do I submit a mapreduce job from outside the cluster i.e from a
>> > different machine outside the Hadoop cluster?
>>
>> If you use Java APIs, use the Job#submit(…) method and/or
>> JobClient.runJob(…) method.
>> Basically Hadoop will try to create a jar with all requisite classes
>> within and will push it out to the JobTracker's filesystem (HDFS, if
>> you run HDFS). From there on, its like a regular operation.
>>
>> This even happens on the Hadoop nodes itself, so doing so from an
>> external place as long as that place has access to Hadoop's JT and
>> HDFS, should be no different at all.
>>
>> If you are packing custom libraries along, don't forget to use
>> DistributedCache. If you are packing custom MR Java code, don't forget
>> to use Job#setJarByClass/JobClient#setJarByClass and other appropriate
>> API methods.
>>
>> > If the above can be done, How can I schedule map reduce jobs to run in
>> > hadoop like crontab from a different machine?
>> > Are there any webservice APIs that I can leverage to access a hadoop
>> cluster
>> > from outside and submit jobs or read/write data from HDFS.
>>
>> For scheduling jobs, have a look at Oozie: http://yahoo.github.com/oozie/
>> It is well supported and is very useful in writing MR workflows (which
>> is a common requirement). You also get coordinator features and can
>> schedule similar to crontab functionalities.
>>
>> For HDFS r/w over web, not sure of an existing web app specifically
>> for this purpose without limitations, but there is a contrib/thriftfs
>> you can leverage upon (if not writing your own webserver in Java, in
>> which case its as simple as using HDFS APIs).
>>
>> Also have a look at the pretty mature Hue project which aims to
>> provide a great frontend that lets you design jobs, submit jobs,
>> monitor jobs and upload files or browse the filesystem (among several
>> other things): http://cloudera.github.com/hue/
>>
>> --
>> Harsh J
>>


Re: mapred.tasktracker.map.tasks.maximum is not taking into effect

2011-07-01 Thread Mostafa Gaber
If your datanode has 2 HDFS-chunks (blocks) of the input file, the scheduler
will first prefer to run 2 map tasks on the tasktracker where this datanode
resides.

On Fri, Jul 1, 2011 at 10:33 PM, Juwei Shi  wrote:

> I think that Anthony is right.  Task capacity has to been set at
> mapred-default.html, and restart the cluster.
>
> Anthony Urso
>
>
>
> 2011/7/2 
>
> Are you sure? AFAIK all mapred.xxx properties can be set via job config. I
>> also read on yahoo tutorial that this property can be either set in
>> hadoop-site.XML or job config. May be someone can confirm this who have
>> really used this property.
>>
>> Praveen
>>
>> On Jul 1, 2011, at 4:46 PM, "ext Anthony Urso" 
>> wrote:
>>
>> > On Fri, Jul 1, 2011 at 1:03 PM,   wrote:
>> >> Hi all,
>> >>
>> >> I am using hadoop 0.20.2. I am setting the property
>> >> mapred.tasktracker.map.tasks.maximum = 4 (same for reduce also) on my
>> job
>> >> conf but I am still seeing max of only 2 map and reduce tasks on each
>> node.
>> >> I know my machine can run 4 maps and 4 reduce tasks in parallel. Is
>> this a
>> >> bug in 0.20.2 or am I doing something wrong?
>> >>
>> >>
>> >
>> > If I remember correctly, you have to set this in your hadoop-site.xml
>> > and restart your job tracker and task trackers.
>> >
>> >>
>> >> Configuration conf = new Configuration();
>> >>
>> >>   conf.set("mapred.tasktracker.map.tasks.maximum", "4");
>> >>
>> >>   conf.set("mapred.tasktracker.reduce.tasks.maximum", "4");
>> >>
>> >>
>> >>
>> >> Thanks
>> >>
>> >> Praveen
>>
>
>
>
> --
> - Juwei
>



-- 
Best Regards,
Mostafa Ead


Re: mapred.tasktracker.map.tasks.maximum is not taking into effect

2011-07-01 Thread Juwei Shi
I think that Anthony is right.  Task capacity has to been set at
mapred-default.html, and restart the cluster.

Anthony Urso



2011/7/2 

> Are you sure? AFAIK all mapred.xxx properties can be set via job config. I
> also read on yahoo tutorial that this property can be either set in
> hadoop-site.XML or job config. May be someone can confirm this who have
> really used this property.
>
> Praveen
>
> On Jul 1, 2011, at 4:46 PM, "ext Anthony Urso" 
> wrote:
>
> > On Fri, Jul 1, 2011 at 1:03 PM,   wrote:
> >> Hi all,
> >>
> >> I am using hadoop 0.20.2. I am setting the property
> >> mapred.tasktracker.map.tasks.maximum = 4 (same for reduce also) on my
> job
> >> conf but I am still seeing max of only 2 map and reduce tasks on each
> node.
> >> I know my machine can run 4 maps and 4 reduce tasks in parallel. Is this
> a
> >> bug in 0.20.2 or am I doing something wrong?
> >>
> >>
> >
> > If I remember correctly, you have to set this in your hadoop-site.xml
> > and restart your job tracker and task trackers.
> >
> >>
> >> Configuration conf = new Configuration();
> >>
> >>   conf.set("mapred.tasktracker.map.tasks.maximum", "4");
> >>
> >>   conf.set("mapred.tasktracker.reduce.tasks.maximum", "4");
> >>
> >>
> >>
> >> Thanks
> >>
> >> Praveen
>



-- 
- Juwei


Re: mapred.tasktracker.map.tasks.maximum is not taking into effect

2011-07-01 Thread Joey Echeverria
This property applies to a tasktraker rather that an individual job.
Therefore it needs to be set in the mapred-site.xml and the daemon
restarted.

-Joey
On Jul 1, 2011 7:01 PM,  wrote:
> Are you sure? AFAIK all mapred.xxx properties can be set via job config. I
also read on yahoo tutorial that this property can be either set in
hadoop-site.XML or job config. May be someone can confirm this who have
really used this property.
>
> Praveen
>
> On Jul 1, 2011, at 4:46 PM, "ext Anthony Urso" 
wrote:
>
>> On Fri, Jul 1, 2011 at 1:03 PM,  wrote:
>>> Hi all,
>>>
>>> I am using hadoop 0.20.2. I am setting the property
>>> mapred.tasktracker.map.tasks.maximum = 4 (same for reduce also) on my
job
>>> conf but I am still seeing max of only 2 map and reduce tasks on each
node.
>>> I know my machine can run 4 maps and 4 reduce tasks in parallel. Is this
a
>>> bug in 0.20.2 or am I doing something wrong?
>>>
>>>
>>
>> If I remember correctly, you have to set this in your hadoop-site.xml
>> and restart your job tracker and task trackers.
>>
>>>
>>> Configuration conf = new Configuration();
>>>
>>> conf.set("mapred.tasktracker.map.tasks.maximum", "4");
>>>
>>> conf.set("mapred.tasktracker.reduce.tasks.maximum", "4");
>>>
>>>
>>>
>>> Thanks
>>>
>>> Praveen


Re: mapred.tasktracker.map.tasks.maximum is not taking into effect

2011-07-01 Thread praveen.peddi
Are you sure? AFAIK all mapred.xxx properties can be set via job config. I also 
read on yahoo tutorial that this property can be either set in hadoop-site.XML 
or job config. May be someone can confirm this who have really used this 
property.

Praveen

On Jul 1, 2011, at 4:46 PM, "ext Anthony Urso"  wrote:

> On Fri, Jul 1, 2011 at 1:03 PM,   wrote:
>> Hi all,
>> 
>> I am using hadoop 0.20.2. I am setting the property
>> mapred.tasktracker.map.tasks.maximum = 4 (same for reduce also) on my job
>> conf but I am still seeing max of only 2 map and reduce tasks on each node.
>> I know my machine can run 4 maps and 4 reduce tasks in parallel. Is this a
>> bug in 0.20.2 or am I doing something wrong?
>> 
>> 
> 
> If I remember correctly, you have to set this in your hadoop-site.xml
> and restart your job tracker and task trackers.
> 
>> 
>> Configuration conf = new Configuration();
>> 
>>   conf.set("mapred.tasktracker.map.tasks.maximum", "4");
>> 
>>   conf.set("mapred.tasktracker.reduce.tasks.maximum", "4");
>> 
>> 
>> 
>> Thanks
>> 
>> Praveen


Re: hadoop job is run slow in multicluster configuration

2011-07-01 Thread Laurent Hatier
Check your /etc/hosts. I've had this problem and i change the 127.0.1.1 or
127.0.0.1 into the real IP of the machine. Just try it :)

2011/7/1 Devaraj K 

>  Can you check the logs in the task tracker machine, what is happening to
> the task execution and status of the task?
>
> ** **
>
> Devaraj K 
>
>
> -
> This e-mail and its attachments contain confidential information from
> HUAWEI, which
> is intended only for the person or entity whose address is listed above.
> Any use of the
> information contained herein in any way (including, but not limited to,
> total or partial
> disclosure, reproduction, or dissemination) by persons other than the
> intended
> recipient(s) is prohibited. If you receive this e-mail in error, please
> notify the sender by
> phone or email immediately and delete it!ss
>
>  
>   --
>
> *From:* ranjith k [mailto:ranjith...@gmail.com]
> *Sent:* Friday, July 01, 2011 1:39 PM
> *To:* mapreduce-user@hadoop.apache.org
> *Subject:* hadoop job is run slow in multicluster configuration
>
> ** **
>
> hello,
>   My mapreduce program is slow in multi-cluster configuration.  Reduce
> task is stuck at 16% . But the same program is running more faster in
> sudo-mode(single node). what can i do ???  I have only two machine. pls help
> me..
>
> --
> Ranjith k
>



-- 
Laurent HATIER
Étudiant en 2e année du Cycle Ingénieur à l'EISTI


Re: mapred.tasktracker.map.tasks.maximum is not taking into effect

2011-07-01 Thread Anthony Urso
On Fri, Jul 1, 2011 at 1:03 PM,   wrote:
> Hi all,
>
> I am using hadoop 0.20.2. I am setting the property
> mapred.tasktracker.map.tasks.maximum = 4 (same for reduce also) on my job
> conf but I am still seeing max of only 2 map and reduce tasks on each node.
> I know my machine can run 4 maps and 4 reduce tasks in parallel. Is this a
> bug in 0.20.2 or am I doing something wrong?
>
>

If I remember correctly, you have to set this in your hadoop-site.xml
and restart your job tracker and task trackers.

>
> Configuration conf = new Configuration();
>
>   conf.set("mapred.tasktracker.map.tasks.maximum", "4");
>
>   conf.set("mapred.tasktracker.reduce.tasks.maximum", "4");
>
>
>
> Thanks
>
> Praveen


mapred.tasktracker.map.tasks.maximum is not taking into effect

2011-07-01 Thread praveen.peddi
Hi all,
I am using hadoop 0.20.2. I am setting the property 
mapred.tasktracker.map.tasks.maximum = 4 (same for reduce also) on my job conf 
but I am still seeing max of only 2 map and reduce tasks on each node. I know 
my machine can run 4 maps and 4 reduce tasks in parallel. Is this a bug in 
0.20.2 or am I doing something wrong?

Configuration conf = new Configuration();
  conf.set("mapred.tasktracker.map.tasks.maximum", "4");
  conf.set("mapred.tasktracker.reduce.tasks.maximum", "4");

Thanks
Praveen


Re: Jobs are still in running state after executing "hadoop job -kill jobId"

2011-07-01 Thread Juwei Shi
Harsh,

It works. Thanks a lot!!!

2011/7/2 Harsh J 

> Juwei,
>
> Its odd that a killed job should get "recovered" back into running
> state. Can you not simply disable the JT recovery feature (I believe
> its turned off by default)?
>
> On Fri, Jul 1, 2011 at 10:47 PM, Juwei Shi  wrote:
> > Thanks Harsh.
> >
> > There are "recover" jobs after I re-boot mapreduce/hdfs.
> >
> > Is there any other way to delete the status records of the running jobs?
> > Then they will not recover after restarting JT?
> >
> > 2011/7/2 Harsh J 
> >>
> >> Juwei,
> >>
> >> Please do not cross-post to multiple lists. I believe this question
> >> suits the mapreduce-user@ list so am replying only on there.
> >>
> >> On Fri, Jul 1, 2011 at 9:22 PM, Juwei Shi  wrote:
> >> > Hi,
> >> >
> >> > I faced a problem that the jobs are still running after executing
> >> > "hadoop
> >> > job -kill jobId". I rebooted the cluster but the job still can not be
> >> > killed.
> >>
> >> What does the JT logs say after you attempt to kill a job ID? Does the
> >> same Job ID keep running even after or are you seeing other jobs
> >> continue to launch?
> >>
> >> --
> >> Harsh J
> >
> > --
> > - Juwei
> >
>
> --
> Harsh J
>

-- 
- Juwei


Re: Jobs are still in running state after executing "hadoop job -kill jobId"

2011-07-01 Thread Harsh J
Juwei,

Its odd that a killed job should get "recovered" back into running
state. Can you not simply disable the JT recovery feature (I believe
its turned off by default)?

On Fri, Jul 1, 2011 at 10:47 PM, Juwei Shi  wrote:
> Thanks Harsh.
>
> There are "recover" jobs after I re-boot mapreduce/hdfs.
>
> Is there any other way to delete the status records of the running jobs?
> Then they will not recover after restarting JT?
>
> 2011/7/2 Harsh J 
>>
>> Juwei,
>>
>> Please do not cross-post to multiple lists. I believe this question
>> suits the mapreduce-user@ list so am replying only on there.
>>
>> On Fri, Jul 1, 2011 at 9:22 PM, Juwei Shi  wrote:
>> > Hi,
>> >
>> > I faced a problem that the jobs are still running after executing
>> > "hadoop
>> > job -kill jobId". I rebooted the cluster but the job still can not be
>> > killed.
>>
>> What does the JT logs say after you attempt to kill a job ID? Does the
>> same Job ID keep running even after or are you seeing other jobs
>> continue to launch?
>>
>> --
>> Harsh J
>
> --
> - Juwei
>



-- 
Harsh J


Re: Jobs are still in running state after executing "hadoop job -kill jobId"

2011-07-01 Thread Juwei Shi
Thanks Harsh.

There are "recover" jobs after I re-boot mapreduce/hdfs.

Is there any other way to delete the status records of the running jobs?
Then they will not recover after restarting JT?

2011/7/2 Harsh J 

> Juwei,
>
> Please do not cross-post to multiple lists. I believe this question
> suits the mapreduce-user@ list so am replying only on there.
>
> On Fri, Jul 1, 2011 at 9:22 PM, Juwei Shi  wrote:
> > Hi,
> >
> > I faced a problem that the jobs are still running after executing "hadoop
> > job -kill jobId". I rebooted the cluster but the job still can not be
> > killed.
>
> What does the JT logs say after you attempt to kill a job ID? Does the
> same Job ID keep running even after or are you seeing other jobs
> continue to launch?
>
> --
> Harsh J
>

-- 
- Juwei


Re: Jobs are still in running state after executing "hadoop job -kill jobId"

2011-07-01 Thread Harsh J
Juwei,

Please do not cross-post to multiple lists. I believe this question
suits the mapreduce-user@ list so am replying only on there.

On Fri, Jul 1, 2011 at 9:22 PM, Juwei Shi  wrote:
> Hi,
>
> I faced a problem that the jobs are still running after executing "hadoop
> job -kill jobId". I rebooted the cluster but the job still can not be
> killed.

What does the JT logs say after you attempt to kill a job ID? Does the
same Job ID keep running even after or are you seeing other jobs
continue to launch?

-- 
Harsh J


Jobs are still in running state after executing "hadoop job -kill jobId"

2011-07-01 Thread Juwei Shi
Hi,

I faced a problem that the jobs are still running after executing "hadoop
job -kill jobId". I rebooted the cluster but the job still can not be
killed.

The hadoop version is 0.20.2.

Any idea?

Thanks in advance!

-- 
- Juwei


Re: Relation between Mapper and Combiner

2011-07-01 Thread Lucian Iordache
Ok, that is what I wanted to know.

Thanks you!

Best Regards,
Lucian

On Fri, Jul 1, 2011 at 2:47 PM, Devaraj K  wrote:

>  Hi Lucian,
>
> ** **
>
> For every map task, combiner will be executed multiple times
> before writing the map output. Combine step is not a separate task and it is
> part of map task execution. Reducer will copy the output of the map task
> which is reduced by the combiner.
>
> ** **
>
> >For example:
> >If I have *2 map tasks* ran on the same machine, will I have *1 combine
> task* on that machine to combine the maps outputs, *or 2 combine tasks*?**
> **
>
> ** **
>
> In this case, combiner will be executed for each map task independent to
> each other. This combiner step will execute multiple times till it gets same
> output for 1 or more runs of combiner.
>
> 
>
> ** **
>
> You can go through Combiner section here for more info :
> http://wiki.apache.org/hadoop/HadoopMapReduce
>
> ** **
>
> Devaraj K 
>
>
> -
> This e-mail and its attachments contain confidential information from
> HUAWEI, which
> is intended only for the person or entity whose address is listed above.
> Any use of the
> information contained herein in any way (including, but not limited to,
> total or partial
> disclosure, reproduction, or dissemination) by persons other than the
> intended
> recipient(s) is prohibited. If you receive this e-mail in error, please
> notify the sender by
> phone or email immediately and delete it!ss
>
>  
>   --
>
> *From:* Lucian Iordache [mailto:lucian.george.iorda...@gmail.com]
> *Sent:* Friday, July 01, 2011 2:25 PM
> *To:* mapreduce-user@hadoop.apache.org
> *Subject:* Relation between Mapper and Combiner
>
> ** **
>
> Hello guys,
>
> Can anybody tell me which is the relation between map task and combine
> tasks?
> I would like to know if there is a 1:1 relation between them, or is a *:1
> (many to one).
>
> For example:
> If I have *2 map tasks* ran on the same machine, will I have *1 combine
> task* on that machine to combine the maps outputs, *or 2 combine tasks*?
>
> Best Regards,
> --
> Lucian
>


RE: hadoop job is run slow in multicluster configuration

2011-07-01 Thread Devaraj K
Can you check the logs in the task tracker machine, what is happening to the
task execution and status of the task?

 

Devaraj K 


-
This e-mail and its attachments contain confidential information from
HUAWEI, which 
is intended only for the person or entity whose address is listed above. Any
use of the 
information contained herein in any way (including, but not limited to,
total or partial 
disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please
notify the sender by 
phone or email immediately and delete it!ss

 

  _  

From: ranjith k [mailto:ranjith...@gmail.com] 
Sent: Friday, July 01, 2011 1:39 PM
To: mapreduce-user@hadoop.apache.org
Subject: hadoop job is run slow in multicluster configuration

 

hello,
  My mapreduce program is slow in multi-cluster configuration.  Reduce
task is stuck at 16% . But the same program is running more faster in
sudo-mode(single node). what can i do ???  I have only two machine. pls help
me..

-- 
Ranjith k



RE: Relation between Mapper and Combiner

2011-07-01 Thread Devaraj K
Hi Lucian,

 

For every map task, combiner will be executed multiple times
before writing the map output. Combine step is not a separate task and it is
part of map task execution. Reducer will copy the output of the map task
which is reduced by the combiner.

 

>For example:
>If I have 2 map tasks ran on the same machine, will I have 1 combine task
on that machine to combine the maps outputs, or 2 combine tasks?

 

In this case, combiner will be executed for each map task independent to
each other. This combiner step will execute multiple times till it gets same
output for 1 or more runs of combiner.



 

You can go through Combiner section here for more info :
http://wiki.apache.org/hadoop/HadoopMapReduce

 

Devaraj K 


-
This e-mail and its attachments contain confidential information from
HUAWEI, which 
is intended only for the person or entity whose address is listed above. Any
use of the 
information contained herein in any way (including, but not limited to,
total or partial 
disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please
notify the sender by 
phone or email immediately and delete it!ss

 

  _  

From: Lucian Iordache [mailto:lucian.george.iorda...@gmail.com] 
Sent: Friday, July 01, 2011 2:25 PM
To: mapreduce-user@hadoop.apache.org
Subject: Relation between Mapper and Combiner

 

Hello guys,

Can anybody tell me which is the relation between map task and combine
tasks?
I would like to know if there is a 1:1 relation between them, or is a *:1
(many to one).

For example:
If I have 2 map tasks ran on the same machine, will I have 1 combine task on
that machine to combine the maps outputs, or 2 combine tasks?

Best Regards,
-- 
Lucian



Relation between Mapper and Combiner

2011-07-01 Thread Lucian Iordache
Hello guys,

Can anybody tell me which is the relation between map task and combine
tasks?
I would like to know if there is a 1:1 relation between them, or is a *:1
(many to one).

For example:
If I have *2 map tasks* ran on the same machine, will I have *1 combine task
* on that machine to combine the maps outputs, *or 2 combine tasks*?

Best Regards,
-- 
Lucian


hadoop job is run slow in multicluster configuration

2011-07-01 Thread ranjith k
hello,
  My mapreduce program is slow in multi-cluster configuration.  Reduce
task is stuck at 16% . But the same program is running more faster in
sudo-mode(single node). what can i do ???  I have only two machine. pls help
me..

-- 
Ranjith k


Re: [Doubt]: Submission of Mapreduce from outside Hadoop Cluster

2011-07-01 Thread Harsh J
Narayanan,

On Fri, Jul 1, 2011 at 12:57 PM, Narayanan K  wrote:
> So the report will be run from a different machine outside the cluster. So
> we need a way to pass on the parameters to the hadoop cluster (master) and
> initiate a mapreduce job dynamically. Similarly the output of mapreduce job
> needs to tunneled into the machine from where the report was run.
>
> Some more clarification I need is : Does the machine (outside of cluster)
> which ran the report, require something like a Client installation which
> will talk with the Hadoop Master Server via TCP???  Or can it can run a job
> in hadoop server by using a passworldless scp to the master machine or
> something of the like.

Regular way is to let the client talk to your nodes over tcp ports.
This is what Hadoop's plain ol' submitter process does for you.

Have you tried running any simple "hadoop jar " from a
remote client machine?

If that works, so should invoking the same from your code (with
appropriate configurations set) cause its basically the plain ol'
runjar submission process in both ways.

If not, maybe you need to think of opening ports to let things happen
(if there's a firewall here).

Hadoop does not use SSH/SCP to move code around. Please give this a
read if you believe you're confused about how SSH+Hadoop is integrated
(or not): http://wiki.apache.org/hadoop/FAQ#Does_Hadoop_require_SSH.3F

-- 
Harsh J


Re: [Doubt]: Submission of Mapreduce from outside Hadoop Cluster

2011-07-01 Thread Narayanan K
Hi Harsh

Thanks for the quick response...

Have a few clarifications regarding the 1st point :

Let me tell the background first..

We have actually set up a Hadoop cluster with HBase installed. We are
planning to load Hbase with data and perform some
computations with the data and show up the data in a report format.
The report should be accessible from outside the cluster and the report
accepts certain parameters to show data, that will in turn pass on these
parameters to the hadoop master server where a mapreduce job will be run
that queries HBase to retrieve the data.

So the report will be run from a different machine outside the cluster. So
we need a way to pass on the parameters to the hadoop cluster (master) and
initiate a mapreduce job dynamically. Similarly the output of mapreduce job
needs to tunneled into the machine from where the report was run.

Some more clarification I need is : Does the machine (outside of cluster)
which ran the report, require something like a Client installation which
will talk with the Hadoop Master Server via TCP???  Or can it can run a job
in hadoop server by using a passworldless scp to the master machine or
something of the like.


Regards,
Narayanan




On Fri, Jul 1, 2011 at 11:41 AM, Harsh J  wrote:

> Narayanan,
>
>
> On Fri, Jul 1, 2011 at 11:28 AM, Narayanan K 
> wrote:
> > Hi all,
> >
> > We are basically working on a research project and I require some help
> > regarding this.
>
> Always glad to see research work being done! What're you working on? :)
>
> > How do I submit a mapreduce job from outside the cluster i.e from a
> > different machine outside the Hadoop cluster?
>
> If you use Java APIs, use the Job#submit(…) method and/or
> JobClient.runJob(…) method.
> Basically Hadoop will try to create a jar with all requisite classes
> within and will push it out to the JobTracker's filesystem (HDFS, if
> you run HDFS). From there on, its like a regular operation.
>
> This even happens on the Hadoop nodes itself, so doing so from an
> external place as long as that place has access to Hadoop's JT and
> HDFS, should be no different at all.
>
> If you are packing custom libraries along, don't forget to use
> DistributedCache. If you are packing custom MR Java code, don't forget
> to use Job#setJarByClass/JobClient#setJarByClass and other appropriate
> API methods.
>
> > If the above can be done, How can I schedule map reduce jobs to run in
> > hadoop like crontab from a different machine?
> > Are there any webservice APIs that I can leverage to access a hadoop
> cluster
> > from outside and submit jobs or read/write data from HDFS.
>
> For scheduling jobs, have a look at Oozie: http://yahoo.github.com/oozie/
> It is well supported and is very useful in writing MR workflows (which
> is a common requirement). You also get coordinator features and can
> schedule similar to crontab functionalities.
>
> For HDFS r/w over web, not sure of an existing web app specifically
> for this purpose without limitations, but there is a contrib/thriftfs
> you can leverage upon (if not writing your own webserver in Java, in
> which case its as simple as using HDFS APIs).
>
> Also have a look at the pretty mature Hue project which aims to
> provide a great frontend that lets you design jobs, submit jobs,
> monitor jobs and upload files or browse the filesystem (among several
> other things): http://cloudera.github.com/hue/
>
> --
> Harsh J
>