[ https://issues.apache.org/jira/browse/MAPREDUCE-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832608#action_12832608 ]
Aaron Kimball commented on MAPREDUCE-1434: ------------------------------------------ Owen, The {{getNewInputSplits}} method proposed above requires the InputFormat to maintain state containing the previously-enumerated InputSplits. The proposed command-line tools suggest independent user-side processes performing the addition of files to the job, making this challenging. Given that splits are calculated on the client, but the "true" list of input splits is held by the JobTracker (or is/could the splits file be written to HDFS?), calculating just the delta might be challenging. I think it might be more reasonable if one of the following things were true: * The client code just calls {{getInputSplits()}} again. The same algorithm is run as in "initial" job submission, but the output list may be longer than the previous list returned by this method. The InputFormat is responsible for ensuring that it doesn't return any fewer splits than it did before (i.e., don't drop inputs) * For that matter, if the input queue for a job is dynamic, I don't see why this same mechanism couldn't be used to drop splits that are, for whatever reason, irrelevant. * {{getNewInputSplits()}} should have the signature: {{InputSplit [] getNewInputSplits(JobContext job, List<InputSplit> existingSplits) throws IOException, InterruptedException}}. The latter case would present to the user a list of the existing inputs read from the existing 'splits' file for the job. That way state-tracking is unnecessary; you can just use (e.g.) a PathFilter to disregard things already in {{existingSplits}}. A final proposition is that users must manually specify new paths (or other arbitrary arguments like database table names, URLs, etc) to include, in addition to the InputFormat. In which case, it might look more sane to have: * {{getNewInputSplits()}} should have the signature: {{InputSplit [] getNewInputSplits(JobContext job, String... newSplitHints) throws IOException, InterruptedException}}. The {{newSplitHints}} is effectively a user-specified argv; it can be decoded as a list of Paths, database tables, etc., and used appropriately by the InputFormat to generate new splits. Other question: What are the semantics of a doubly-specified split? (Especially curious about the inexact match case, where the same file in HDFS is enumerated twice but the splits are at different offsets) Can/should the same file be processed twice in a job? Finally: Why does a user-disconnect timeout kill the job? That's different than the usual case in MapReduce, where a user disconnect is not noticed by the server-side processes at all. I would think that after a user-disconnect timeout, that declares that all the input is added, and that the reduce phase can begin -- not that it should kill things. > Dynamic add input for one job > ----------------------------- > > Key: MAPREDUCE-1434 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1434 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Environment: 0.19.0 > Reporter: Xing Shi > > Always we should firstly upload the data to hdfs, then we can analize the > data using hadoop mapreduce. > Sometimes, the upload process takes long time. So if we can add input during > one job, the time can be saved. > WHAT? > Client: > a) hadoop job -add-input jobId inputFormat ... > Add the input to jobid > b) hadoop job -add-input done > Tell the JobTracker, the input has been prepared over. > c) hadoop job -add-input status jobid > Show how many input the jobid has. > HOWTO? > Mainly, I think we should do three things: > 1. JobClinet: here JobClient should support add input to a job, indeed, > JobClient generate the split, and submit to JobTracker. > 2. JobTracker: JobTracker support addInput, and add the new tasks to the > original mapTasks. Because the uploaded data will be > processed quickly, so it also should update the scheduler to support pending > a map task till Client tells the Job input done. > 3. Reducer: the reducer should also update the mapNums, so it will shuffle > right. > This is the rough idea, and I will update it . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.