[jira] Commented: (MAPREDUCE-323) Improve the way job history files are managed

Dick King (JIRA) Thu, 10 Jun 2010 11:06:44 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877513#action_12877513
 ]


Dick King commented on MAPREDUCE-323:
-------------------------------------

I've given this some more thought and I've devised a new design.

I don't think that the subdirectory _per se_ is the important issue, except to 
keep the directory sizes manageable.  However, the important operations should 
be supported, with good performance, preferably in the {{jobhistory.jsp}} 
interface.  We have to support reasonable searches in the {{jsp}} .  To that 
end, I would to do the following:

1: let the done jobs' directory structure be 
{{DONE/jobtracker-timestamp/123/456/789}} where {{123456789}} is the job ID 
serial number.  Leading zeros are depicted in the directory even if they're not 
in the serial number.  Perhaps {{jobtracker-timestamp}} should be 
{{jobtracker-id}} ?

2: In the {{jsp}}, we could present newest jobs first.  This is probably what 
people want, and in common cases it speeds up the presentation when the user 
displays an early page.  With the current naming convention,these are the jobs 
with the lexicographically latest file names.

3: All the URLs in the {{jsp}} pages [including those behind forms] would have 
a starting job tracker ID and serial number encoded, so we can continue from 
where we left off, even though we keep adding new jobs to the beginning because 
of 2: .  Subsequent pages will not overlap previous pages just because new jobs 
have been added at the beginning.

4: When we do searches, we work back through the directories in reverse order, 
so we can stop when we populate a page rather than reading all of the history 
files' names.

5: For low-yield searches we'll consider offering to stop after, say, 10K 
non-matching jobs have been ignored.  This lets us process mistyped queries in 
a reasonable time.

6: The start time is of interest.  Inside the {{JobHistory}} code, as the 
cached history files are being copied to the {{DONE}} directory, an 
approximation of the start time is available in the modification time of the 
{{conf.xml}} file.  We can copy that, either to the modification time of the 
new job history file [using {{setTime}}], or encode it into the filename in 
some manner [as we do with the job name].  Either way, we can then present it 
in the {{jsp}} result, or filter based on time ranges.  What does the community 
think?

7: Perhaps there needs to be a programmatic API as well, reducing the need for 
people to read directories.



> Improve the way job history files are managed
> ---------------------------------------------
>
>                 Key: MAPREDUCE-323
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-323
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>    Affects Versions: 0.21.0, 0.22.0
>            Reporter: Amar Kamat
>            Assignee: Dick King
>            Priority: Critical
>
> Today all the jobhistory files are dumped in one _job-history_ folder. This 
> can cause problems when there is a need to search the history folder 
> (job-recovery etc). It would be nice if we group all the jobs under a _user_ 
> folder. So all the jobs for user _amar_ will go in _history-folder/amar/_. 
> Jobs can be categorized using various features like _jobid, date, jobname_ 
> etc but using _username_ will make the search much more efficient and also 
> will not result into namespace explosion. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-323) Improve the way job history files are managed

Reply via email to