[ https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877513#action_12877513 ]
Dick King commented on MAPREDUCE-323: ------------------------------------- I've given this some more thought and I've devised a new design. I don't think that the subdirectory _per se_ is the important issue, except to keep the directory sizes manageable. However, the important operations should be supported, with good performance, preferably in the {{jobhistory.jsp}} interface. We have to support reasonable searches in the {{jsp}} . To that end, I would to do the following: 1: let the done jobs' directory structure be {{DONE/jobtracker-timestamp/123/456/789}} where {{123456789}} is the job ID serial number. Leading zeros are depicted in the directory even if they're not in the serial number. Perhaps {{jobtracker-timestamp}} should be {{jobtracker-id}} ? 2: In the {{jsp}}, we could present newest jobs first. This is probably what people want, and in common cases it speeds up the presentation when the user displays an early page. With the current naming convention,these are the jobs with the lexicographically latest file names. 3: All the URLs in the {{jsp}} pages [including those behind forms] would have a starting job tracker ID and serial number encoded, so we can continue from where we left off, even though we keep adding new jobs to the beginning because of 2: . Subsequent pages will not overlap previous pages just because new jobs have been added at the beginning. 4: When we do searches, we work back through the directories in reverse order, so we can stop when we populate a page rather than reading all of the history files' names. 5: For low-yield searches we'll consider offering to stop after, say, 10K non-matching jobs have been ignored. This lets us process mistyped queries in a reasonable time. 6: The start time is of interest. Inside the {{JobHistory}} code, as the cached history files are being copied to the {{DONE}} directory, an approximation of the start time is available in the modification time of the {{conf.xml}} file. We can copy that, either to the modification time of the new job history file [using {{setTime}}], or encode it into the filename in some manner [as we do with the job name]. Either way, we can then present it in the {{jsp}} result, or filter based on time ranges. What does the community think? 7: Perhaps there needs to be a programmatic API as well, reducing the need for people to read directories. > Improve the way job history files are managed > --------------------------------------------- > > Key: MAPREDUCE-323 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-323 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobtracker > Affects Versions: 0.21.0, 0.22.0 > Reporter: Amar Kamat > Assignee: Dick King > Priority: Critical > > Today all the jobhistory files are dumped in one _job-history_ folder. This > can cause problems when there is a need to search the history folder > (job-recovery etc). It would be nice if we group all the jobs under a _user_ > folder. So all the jobs for user _amar_ will go in _history-folder/amar/_. > Jobs can be categorized using various features like _jobid, date, jobname_ > etc but using _username_ will make the search much more efficient and also > will not result into namespace explosion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.