[ 
https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879567#action_12879567
 ] 

Dick King commented on MAPREDUCE-323:
-------------------------------------

I believe that it is agreed that we need a directory structure other than a 
single directory holding all of the history files.

That being said, the question is how the directory tree should be organized.

The use cases are:

1: There is a job history web API, implemented by {{jobhistory.jsp}}, that 
allows users to search the job history files to retrieve information on single 
or multiple jobs meeting certain criteria.  In particular, web users can search 
for jobs with a certain user, and jobs whose job name contains a certain 
substring.  

After a search, the current API allows the user to page through the data.  They 
get told the total number of matching jobs, and they can browse pages of data, 
with 100 jobs per page.  They can access the first and last page from any page, 
and from any pages they can access any of the previous or following five pages 
[if there are that many].

2: During restart, we perform searches for specific quadruples of jobtracker 
IDs, job-ID, username and jobname.  This may be redundant but that's what we do 
in the current code base.

3: I understand that some installations archive tranches of job history files 
periodically, usually by date.

Here is how I support the claim that we support these use cases, with 
considerable scaling and responsiveness improvements:

1: If I use a subdirectory structure based on jobtracker IDs and then dates and 
then high order digits of the jobid serial number, then the performance of each 
of these three usage cases can be improved.  I described potential improvements 
of use case 1 on 14/Jun/10 at 09:38 PM .  To summarize, you will be able to 
browse by dates and time ranges as well as by the other criteria, and 
performance will be improved as we only search the subset of the directories we 
need to satisfy the query or to present the first page of the results.  

If we make changes along these lines we will no longer present to the user the 
total number of matching jobs.  One of the complaints that lead to this jira 
is, after all, the possibility of a scaling problem if there are too many jobs.

2: Because of directory restrictions, the namenode will have to generate alot 
fewer data, and there will be a lot less client side filtering as well if you 
have directories consisting of only 1000 jobs [2000 files].

3: We could archive a day's results by harchiving a date subdirectory.

> Improve the way job history files are managed
> ---------------------------------------------
>
>                 Key: MAPREDUCE-323
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-323
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>    Affects Versions: 0.21.0, 0.22.0
>            Reporter: Amar Kamat
>            Assignee: Dick King
>            Priority: Critical
>
> Today all the jobhistory files are dumped in one _job-history_ folder. This 
> can cause problems when there is a need to search the history folder 
> (job-recovery etc). It would be nice if we group all the jobs under a _user_ 
> folder. So all the jobs for user _amar_ will go in _history-folder/amar/_. 
> Jobs can be categorized using various features like _jobid, date, jobname_ 
> etc but using _username_ will make the search much more efficient and also 
> will not result into namespace explosion. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to