Re: doubt on Hadoop job submission process

2012-08-13 Thread Harsh J
Hi Manoj,

Reply inline.

On Mon, Aug 13, 2012 at 3:42 PM, Manoj Babu manoj...@gmail.com wrote:
 Hi All,

 Normal Hadoop job submission process involves:

 Checking the input and output specifications of the job.
 Computing the InputSplits for the job.
 Setup the requisite accounting information for the DistributedCache of the
 job, if necessary.
 Copying the job's jar and configuration to the map-reduce system directory
 on the distributed file-system.
 Submitting the job to the JobTracker and optionally monitoring it's status.

 I have a doubt in 4th point of  job execution flow could any of you explain
 it?

 What is job's jar?

The job.jar is the jar you supply via hadoop jar jar. Technically
though, it is the jar pointed by JobConf.getJar() (Set via setJar or
setJarByClass calls).

 Is it job's jar is the one we submitted to hadoop or hadoop will build based
 on the job configuration object?

It is the former, as explained above.

-- 
Harsh J


Locks in M/R framework

2012-08-13 Thread David Ginzburg
Hi,

I have an HDFS folder and M/R job that periodically updates it by replacing the 
data with newly generated data.

I have a different M/R job that periodically or ad-hoc process the data in the 
folder.

The second job ,naturally, fails sometime, when the data is replaced by newly 
generated data and the job plan including the input paths have already been 
submitted.

Is there an elegant solution ?

My current though is to query the jobtracker for running jobs and go over all 
the input files, in the job XML to know if The swap should block until the 
input path is no longer in any current executed input path job.



 
  

Re: Locks in M/R framework

2012-08-13 Thread Tim Robertson
How about introducing a distributed coordination and locking mechanism?
ZooKeeper would be a good candidate for that kind of thing.



On Mon, Aug 13, 2012 at 12:52 PM, David Ginzburg ginz...@hotmail.comwrote:

 Hi,

 I have an HDFS folder and M/R job that periodically updates it by
 replacing the data with newly generated data.

 I have a different M/R job that periodically or ad-hoc process the data in
 the folder.

 The second job ,naturally, fails sometime, when the data is replaced by
 newly generated data and the job plan including the input paths have
 already been submitted.

 Is there an elegant solution ?

 My current though is to query the jobtracker for running jobs and go over
 all the input files, in the job XML to know if The swap should block until
 the input path is no longer in any current executed input path job.







Re: doubt on Hadoop job submission process

2012-08-13 Thread Manoj Babu
Hi Harsh,

Thanks for your reply.

Consider from my main program i am doing so
many activities(Reading/writing/updating non hadoop activities) before
invoking JobClient.runJob(conf);
Is it anyway to separate the process flow by programmatic instead of going
for any workflow engine?

Cheers!
Manoj.



On Mon, Aug 13, 2012 at 4:10 PM, Harsh J ha...@cloudera.com wrote:

 Hi Manoj,

 Reply inline.

 On Mon, Aug 13, 2012 at 3:42 PM, Manoj Babu manoj...@gmail.com wrote:
  Hi All,
 
  Normal Hadoop job submission process involves:
 
  Checking the input and output specifications of the job.
  Computing the InputSplits for the job.
  Setup the requisite accounting information for the DistributedCache of
 the
  job, if necessary.
  Copying the job's jar and configuration to the map-reduce system
 directory
  on the distributed file-system.
  Submitting the job to the JobTracker and optionally monitoring it's
 status.
 
  I have a doubt in 4th point of  job execution flow could any of you
 explain
  it?
 
  What is job's jar?

 The job.jar is the jar you supply via hadoop jar jar. Technically
 though, it is the jar pointed by JobConf.getJar() (Set via setJar or
 setJarByClass calls).

  Is it job's jar is the one we submitted to hadoop or hadoop will build
 based
  on the job configuration object?

 It is the former, as explained above.

 --
 Harsh J



Re: doubt on Hadoop job submission process

2012-08-13 Thread Harsh J
Sure, you may separate the logic as you want it to be, but just ensure
the configuration object has a proper setJar or setJarByClass done on
it before you submit the job.

On Mon, Aug 13, 2012 at 4:43 PM, Manoj Babu manoj...@gmail.com wrote:
 Hi Harsh,

 Thanks for your reply.

 Consider from my main program i am doing so many
 activities(Reading/writing/updating non hadoop activities) before invoking
 JobClient.runJob(conf);
 Is it anyway to separate the process flow by programmatic instead of going
 for any workflow engine?

 Cheers!
 Manoj.



 On Mon, Aug 13, 2012 at 4:10 PM, Harsh J ha...@cloudera.com wrote:

 Hi Manoj,

 Reply inline.

 On Mon, Aug 13, 2012 at 3:42 PM, Manoj Babu manoj...@gmail.com wrote:
  Hi All,
 
  Normal Hadoop job submission process involves:
 
  Checking the input and output specifications of the job.
  Computing the InputSplits for the job.
  Setup the requisite accounting information for the DistributedCache of
  the
  job, if necessary.
  Copying the job's jar and configuration to the map-reduce system
  directory
  on the distributed file-system.
  Submitting the job to the JobTracker and optionally monitoring it's
  status.
 
  I have a doubt in 4th point of  job execution flow could any of you
  explain
  it?
 
  What is job's jar?

 The job.jar is the jar you supply via hadoop jar jar. Technically
 though, it is the jar pointed by JobConf.getJar() (Set via setJar or
 setJarByClass calls).

  Is it job's jar is the one we submitted to hadoop or hadoop will build
  based
  on the job configuration object?

 It is the former, as explained above.

 --
 Harsh J





-- 
Harsh J


Re: Locks in M/R framework

2012-08-13 Thread Harsh J
David,

While ZK can solve this, locking may only make you slower. Lets try to
keep it simple?

Have you considered keeping two directories? One where the older data
is moved to (by the first job, instead of replacing files), for
consumption by the second job, which triggers by watching this
directory?

That is,
MR Job #1 (the producer), moves existing data to /path/b/timestamp,
and writes new data to /path/a.
MR Job #2 (the consumer), uses latest /path/b/timestamp (or the whole
of available set of timestamps under /path/b at that point) for its
input, and deletes it afterwards. Hence the #2 can monitor this
directory for triggering itself.

On Mon, Aug 13, 2012 at 4:22 PM, David Ginzburg ginz...@hotmail.com wrote:
 Hi,

 I have an HDFS folder and M/R job that periodically updates it by replacing
 the data with newly generated data.

 I have a different M/R job that periodically or ad-hoc process the data in
 the folder.

 The second job ,naturally, fails sometime, when the data is replaced by
 newly generated data and the job plan including the input paths have already
 been submitted.

 Is there an elegant solution ?

 My current though is to query the jobtracker for running jobs and go over
 all the input files, in the job XML to know if The swap should block until
 the input path is no longer in any current executed input path job.







-- 
Harsh J