[ 
https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891928#action_12891928
 ] 

Dick King commented on MAPREDUCE-323:
-------------------------------------

                                   A PROPOSAL

                                   introduction

The way the completed job history file system now works is that when a job is 
started, an empty history file is created by the job tracker.  The name of the 
file contains nough information about the job to let an application tell 
whether the file documents a job that satisfies a search criterion.  In 
particular, it includes the job tracker instance ID, the job ID, the user name, 
and the job name.

As the job progresses, records get added to the file, and when it's finished 
[either successfully or failed] the file is moved to another directory, the 
completed job history files directory [the "DONE directory"]. Currently this 
directory has a simple flat structure.  If an application [in particular, the 
job history browser] wants some job histories, it reads this directory and 
chooses the files with names that indicate that the files will meet the 
criteria.  In practical cases this can includes hundreds of thousands or even a 
million files.  Note that each job is represented by two files, the history 
file and the config file, doubling the burden on the name node.

                                     proposal

I would like to implement a simple data base to solve this problem.  My 
proposal has the following features:

1: The DONE directory will contain subdirectories, each containing a few 
hundred or a thousand files.

2: At any time, the job tracker will be filling one of the DONE directory's 
subdirectories.  All the rest are closed out, never to be added to again.

3: The subdirectories have a naming scheme so they're created in 
lexicographical  rder.  We would like to use subdirectory names like 
2010-07-23--0000, etc [the four digits are a serial number, not an HHMM field].

4: When the job tracker decides to bind off a subdirectory and start a new one, 
it creates a new index file in the subdirectory it's closing out.  That index 
is a simple list of the history files the directory contains.

4a: The job tracker starts a new subdirectory whenever the first history file 
is copied on a given day, and whenever the current subdirectory would otherwise 
contain more than a certain number of files. 

4b: Perhaps the files can be renamed?  These files' names are a few dozen 
characters each, and in a system that has run a half million jobs the names 
collectively occupy 100+ megabytes in the name node.  Significant, but not 
decisive. 

4b1: 4b would require that rumen understand indices.

5: The processing is:

5a: [optional] create a new short name for every file in the subdirectory 
that's being closed out

5a1: The job tracker keeps this information in memory.  It doesn't need to read 
the directory

5b: Write out the index file in a temporary location {{temp-index}} within the 
directory it's indexing.

5b1: The index contains all of the names in text form [if 5a is not use] or all 
pairs of { long name, short name } in text form, if we are shortening the names.

5c: rename the temp-index file to {{index}} when it's done

5d: [optional] If we chose file renaming, delete all of the long names.

6: When doing a search, we 

6a: determine all subdirectories of the DONE directory

6b: see which ones have an index

6c: read each index that exists, and

6d: read all of the files, for the subdirectories that don't have indices yet.

7: To aid retirement of old job history files, the job tracker always binds off 
the current subdirectory when the date changes, even if it doesn't have very 
many files, and we retire files on date boundaries, a subdirectory at a time. 
The relevant date is the date that the file is being moved, which is normally a 
short time after the job is completed.

8: [optional] We may want to consolidate the indices of a completed day in a 
per-day index written as a file directly under the done directory. 

> Improve the way job history files are managed
> ---------------------------------------------
>
>                 Key: MAPREDUCE-323
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-323
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>    Affects Versions: 0.21.0, 0.22.0
>            Reporter: Amar Kamat
>            Assignee: Dick King
>            Priority: Critical
>
> Today all the jobhistory files are dumped in one _job-history_ folder. This 
> can cause problems when there is a need to search the history folder 
> (job-recovery etc). It would be nice if we group all the jobs under a _user_ 
> folder. So all the jobs for user _amar_ will go in _history-folder/amar/_. 
> Jobs can be categorized using various features like _jobid, date, jobname_ 
> etc but using _username_ will make the search much more efficient and also 
> will not result into namespace explosion. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to