Re: doubt on Hadoop job submission process
Hi Manoj, Reply inline. On Mon, Aug 13, 2012 at 3:42 PM, Manoj Babu manoj...@gmail.com wrote: Hi All, Normal Hadoop job submission process involves: Checking the input and output specifications of the job. Computing the InputSplits for the job. Setup the requisite accounting information for the DistributedCache of the job, if necessary. Copying the job's jar and configuration to the map-reduce system directory on the distributed file-system. Submitting the job to the JobTracker and optionally monitoring it's status. I have a doubt in 4th point of job execution flow could any of you explain it? What is job's jar? The job.jar is the jar you supply via hadoop jar jar. Technically though, it is the jar pointed by JobConf.getJar() (Set via setJar or setJarByClass calls). Is it job's jar is the one we submitted to hadoop or hadoop will build based on the job configuration object? It is the former, as explained above. -- Harsh J
Locks in M/R framework
Hi, I have an HDFS folder and M/R job that periodically updates it by replacing the data with newly generated data. I have a different M/R job that periodically or ad-hoc process the data in the folder. The second job ,naturally, fails sometime, when the data is replaced by newly generated data and the job plan including the input paths have already been submitted. Is there an elegant solution ? My current though is to query the jobtracker for running jobs and go over all the input files, in the job XML to know if The swap should block until the input path is no longer in any current executed input path job.
Re: Locks in M/R framework
How about introducing a distributed coordination and locking mechanism? ZooKeeper would be a good candidate for that kind of thing. On Mon, Aug 13, 2012 at 12:52 PM, David Ginzburg ginz...@hotmail.comwrote: Hi, I have an HDFS folder and M/R job that periodically updates it by replacing the data with newly generated data. I have a different M/R job that periodically or ad-hoc process the data in the folder. The second job ,naturally, fails sometime, when the data is replaced by newly generated data and the job plan including the input paths have already been submitted. Is there an elegant solution ? My current though is to query the jobtracker for running jobs and go over all the input files, in the job XML to know if The swap should block until the input path is no longer in any current executed input path job.
Re: doubt on Hadoop job submission process
Hi Harsh, Thanks for your reply. Consider from my main program i am doing so many activities(Reading/writing/updating non hadoop activities) before invoking JobClient.runJob(conf); Is it anyway to separate the process flow by programmatic instead of going for any workflow engine? Cheers! Manoj. On Mon, Aug 13, 2012 at 4:10 PM, Harsh J ha...@cloudera.com wrote: Hi Manoj, Reply inline. On Mon, Aug 13, 2012 at 3:42 PM, Manoj Babu manoj...@gmail.com wrote: Hi All, Normal Hadoop job submission process involves: Checking the input and output specifications of the job. Computing the InputSplits for the job. Setup the requisite accounting information for the DistributedCache of the job, if necessary. Copying the job's jar and configuration to the map-reduce system directory on the distributed file-system. Submitting the job to the JobTracker and optionally monitoring it's status. I have a doubt in 4th point of job execution flow could any of you explain it? What is job's jar? The job.jar is the jar you supply via hadoop jar jar. Technically though, it is the jar pointed by JobConf.getJar() (Set via setJar or setJarByClass calls). Is it job's jar is the one we submitted to hadoop or hadoop will build based on the job configuration object? It is the former, as explained above. -- Harsh J
Re: doubt on Hadoop job submission process
Sure, you may separate the logic as you want it to be, but just ensure the configuration object has a proper setJar or setJarByClass done on it before you submit the job. On Mon, Aug 13, 2012 at 4:43 PM, Manoj Babu manoj...@gmail.com wrote: Hi Harsh, Thanks for your reply. Consider from my main program i am doing so many activities(Reading/writing/updating non hadoop activities) before invoking JobClient.runJob(conf); Is it anyway to separate the process flow by programmatic instead of going for any workflow engine? Cheers! Manoj. On Mon, Aug 13, 2012 at 4:10 PM, Harsh J ha...@cloudera.com wrote: Hi Manoj, Reply inline. On Mon, Aug 13, 2012 at 3:42 PM, Manoj Babu manoj...@gmail.com wrote: Hi All, Normal Hadoop job submission process involves: Checking the input and output specifications of the job. Computing the InputSplits for the job. Setup the requisite accounting information for the DistributedCache of the job, if necessary. Copying the job's jar and configuration to the map-reduce system directory on the distributed file-system. Submitting the job to the JobTracker and optionally monitoring it's status. I have a doubt in 4th point of job execution flow could any of you explain it? What is job's jar? The job.jar is the jar you supply via hadoop jar jar. Technically though, it is the jar pointed by JobConf.getJar() (Set via setJar or setJarByClass calls). Is it job's jar is the one we submitted to hadoop or hadoop will build based on the job configuration object? It is the former, as explained above. -- Harsh J -- Harsh J
Re: Locks in M/R framework
David, While ZK can solve this, locking may only make you slower. Lets try to keep it simple? Have you considered keeping two directories? One where the older data is moved to (by the first job, instead of replacing files), for consumption by the second job, which triggers by watching this directory? That is, MR Job #1 (the producer), moves existing data to /path/b/timestamp, and writes new data to /path/a. MR Job #2 (the consumer), uses latest /path/b/timestamp (or the whole of available set of timestamps under /path/b at that point) for its input, and deletes it afterwards. Hence the #2 can monitor this directory for triggering itself. On Mon, Aug 13, 2012 at 4:22 PM, David Ginzburg ginz...@hotmail.com wrote: Hi, I have an HDFS folder and M/R job that periodically updates it by replacing the data with newly generated data. I have a different M/R job that periodically or ad-hoc process the data in the folder. The second job ,naturally, fails sometime, when the data is replaced by newly generated data and the job plan including the input paths have already been submitted. Is there an elegant solution ? My current though is to query the jobtracker for running jobs and go over all the input files, in the job XML to know if The swap should block until the input path is no longer in any current executed input path job. -- Harsh J