Re: [Doubt]: Submission of Mapreduce from outside Hadoop Cluster

2011-07-01 Thread Harsh J
Narayanan,


On Fri, Jul 1, 2011 at 11:28 AM, Narayanan K knarayana...@gmail.com wrote:
 Hi all,

 We are basically working on a research project and I require some help
 regarding this.

Always glad to see research work being done! What're you working on? :)

 How do I submit a mapreduce job from outside the cluster i.e from a
 different machine outside the Hadoop cluster?

If you use Java APIs, use the Job#submit(…) method and/or
JobClient.runJob(…) method.
Basically Hadoop will try to create a jar with all requisite classes
within and will push it out to the JobTracker's filesystem (HDFS, if
you run HDFS). From there on, its like a regular operation.

This even happens on the Hadoop nodes itself, so doing so from an
external place as long as that place has access to Hadoop's JT and
HDFS, should be no different at all.

If you are packing custom libraries along, don't forget to use
DistributedCache. If you are packing custom MR Java code, don't forget
to use Job#setJarByClass/JobClient#setJarByClass and other appropriate
API methods.

 If the above can be done, How can I schedule map reduce jobs to run in
 hadoop like crontab from a different machine?
 Are there any webservice APIs that I can leverage to access a hadoop cluster
 from outside and submit jobs or read/write data from HDFS.

For scheduling jobs, have a look at Oozie: http://yahoo.github.com/oozie/
It is well supported and is very useful in writing MR workflows (which
is a common requirement). You also get coordinator features and can
schedule similar to crontab functionalities.

For HDFS r/w over web, not sure of an existing web app specifically
for this purpose without limitations, but there is a contrib/thriftfs
you can leverage upon (if not writing your own webserver in Java, in
which case its as simple as using HDFS APIs).

Also have a look at the pretty mature Hue project which aims to
provide a great frontend that lets you design jobs, submit jobs,
monitor jobs and upload files or browse the filesystem (among several
other things): http://cloudera.github.com/hue/

-- 
Harsh J


Re: [Doubt]: Submission of Mapreduce from outside Hadoop Cluster

2011-07-01 Thread Harsh J
Narayanan,

On Fri, Jul 1, 2011 at 12:57 PM, Narayanan K knarayana...@gmail.com wrote:
 So the report will be run from a different machine outside the cluster. So
 we need a way to pass on the parameters to the hadoop cluster (master) and
 initiate a mapreduce job dynamically. Similarly the output of mapreduce job
 needs to tunneled into the machine from where the report was run.

 Some more clarification I need is : Does the machine (outside of cluster)
 which ran the report, require something like a Client installation which
 will talk with the Hadoop Master Server via TCP???  Or can it can run a job
 in hadoop server by using a passworldless scp to the master machine or
 something of the like.

Regular way is to let the client talk to your nodes over tcp ports.
This is what Hadoop's plain ol' submitter process does for you.

Have you tried running any simple hadoop jar your jar from a
remote client machine?

If that works, so should invoking the same from your code (with
appropriate configurations set) cause its basically the plain ol'
runjar submission process in both ways.

If not, maybe you need to think of opening ports to let things happen
(if there's a firewall here).

Hadoop does not use SSH/SCP to move code around. Please give this a
read if you believe you're confused about how SSH+Hadoop is integrated
(or not): http://wiki.apache.org/hadoop/FAQ#Does_Hadoop_require_SSH.3F

-- 
Harsh J


Re: [Doubt]: Submission of Mapreduce from outside Hadoop Cluster

2011-07-01 Thread Yaozhen Pan
Narayanan,

Regarding the client installation, you should make sure that client and
server use same version hadoop for submitting jobs and transfer data.
if you use a different user in client than the one runs hadoop job, config
the hadoop ugi property (sorry i forget the exact name).

在 2011 7 1 15:28,Narayanan K knarayana...@gmail.com写道:
 Hi Harsh

 Thanks for the quick response...

 Have a few clarifications regarding the 1st point :

 Let me tell the background first..

 We have actually set up a Hadoop cluster with HBase installed. We are
 planning to load Hbase with data and perform some
 computations with the data and show up the data in a report format.
 The report should be accessible from outside the cluster and the report
 accepts certain parameters to show data, that will in turn pass on these
 parameters to the hadoop master server where a mapreduce job will be run
 that queries HBase to retrieve the data.

 So the report will be run from a different machine outside the cluster. So
 we need a way to pass on the parameters to the hadoop cluster (master) and
 initiate a mapreduce job dynamically. Similarly the output of mapreduce
job
 needs to tunneled into the machine from where the report was run.

 Some more clarification I need is : Does the machine (outside of cluster)
 which ran the report, require something like a Client installation which
 will talk with the Hadoop Master Server via TCP??? Or can it can run a job
 in hadoop server by using a passworldless scp to the master machine or
 something of the like.


 Regards,
 Narayanan




 On Fri, Jul 1, 2011 at 11:41 AM, Harsh J ha...@cloudera.com wrote:

 Narayanan,


 On Fri, Jul 1, 2011 at 11:28 AM, Narayanan K knarayana...@gmail.com
 wrote:
  Hi all,
 
  We are basically working on a research project and I require some help
  regarding this.

 Always glad to see research work being done! What're you working on? :)

  How do I submit a mapreduce job from outside the cluster i.e from a
  different machine outside the Hadoop cluster?

 If you use Java APIs, use the Job#submit(…) method and/or
 JobClient.runJob(…) method.
 Basically Hadoop will try to create a jar with all requisite classes
 within and will push it out to the JobTracker's filesystem (HDFS, if
 you run HDFS). From there on, its like a regular operation.

 This even happens on the Hadoop nodes itself, so doing so from an
 external place as long as that place has access to Hadoop's JT and
 HDFS, should be no different at all.

 If you are packing custom libraries along, don't forget to use
 DistributedCache. If you are packing custom MR Java code, don't forget
 to use Job#setJarByClass/JobClient#setJarByClass and other appropriate
 API methods.

  If the above can be done, How can I schedule map reduce jobs to run in
  hadoop like crontab from a different machine?
  Are there any webservice APIs that I can leverage to access a hadoop
 cluster
  from outside and submit jobs or read/write data from HDFS.

 For scheduling jobs, have a look at Oozie: http://yahoo.github.com/oozie/
 It is well supported and is very useful in writing MR workflows (which
 is a common requirement). You also get coordinator features and can
 schedule similar to crontab functionalities.

 For HDFS r/w over web, not sure of an existing web app specifically
 for this purpose without limitations, but there is a contrib/thriftfs
 you can leverage upon (if not writing your own webserver in Java, in
 which case its as simple as using HDFS APIs).

 Also have a look at the pretty mature Hue project which aims to
 provide a great frontend that lets you design jobs, submit jobs,
 monitor jobs and upload files or browse the filesystem (among several
 other things): http://cloudera.github.com/hue/

 --
 Harsh J



[Doubt]: Submission of Mapreduce from outside Hadoop Cluster

2011-06-30 Thread Narayanan K
Hi all,


We are basically working on a research project and I require some help
regarding this.



I had a few basic doubts regarding submission of Map-Reduce jobs in Hadoop.



   1. How do I submit a mapreduce job from outside the cluster i.e from a
   different machine outside the Hadoop cluster?
   2. If the above can be done, How can I schedule map reduce jobs to run in
   hadoop like crontab from a different machine?
   3. Are there any webservice APIs that I can leverage to access a hadoop
   cluster from outside and submit jobs or read/write data from HDFS.


Many Thanks,

Narayanan