[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832608#action_12832608
 ] 

Aaron Kimball commented on MAPREDUCE-1434:
------------------------------------------

Owen,

The {{getNewInputSplits}} method proposed above requires the InputFormat to 
maintain state containing the previously-enumerated InputSplits. The proposed 
command-line tools suggest independent user-side processes performing the 
addition of files to the job, making this challenging. Given that splits are 
calculated on the client, but the "true" list of input splits is held by the 
JobTracker (or is/could the splits file be written to HDFS?), calculating just 
the delta might be challenging.

I think it might be more reasonable if one of the following things were true:
* The client code just calls {{getInputSplits()}} again. The same algorithm is 
run as in "initial" job submission, but the output list may be longer than the 
previous list returned by this method. The InputFormat is responsible for 
ensuring that it doesn't return any fewer splits than it did before (i.e., 
don't drop inputs)
* For that matter, if the input queue for a job is dynamic, I don't see why 
this same mechanism couldn't be used to drop splits that are, for whatever 
reason, irrelevant.
* {{getNewInputSplits()}} should have the signature: {{InputSplit [] 
getNewInputSplits(JobContext job, List<InputSplit> existingSplits) throws 
IOException, InterruptedException}}.

The latter case would present to the user a list of the existing inputs read 
from the existing 'splits' file for the job. That way state-tracking is 
unnecessary; you can just use (e.g.) a PathFilter to disregard things already 
in {{existingSplits}}.

A final proposition is that users must manually specify new paths (or other 
arbitrary arguments like database table names, URLs, etc) to include, in 
addition to the InputFormat. In which case, it might look more sane to have:
* {{getNewInputSplits()}} should have the signature: {{InputSplit [] 
getNewInputSplits(JobContext job, String... newSplitHints) throws IOException, 
InterruptedException}}.

The {{newSplitHints}} is effectively a user-specified argv; it can be decoded 
as a list of Paths, database tables, etc., and used appropriately by the 
InputFormat to generate new splits.

Other question: What are the semantics of a doubly-specified split? (Especially 
curious about the inexact match case, where the same file in HDFS is enumerated 
twice but the splits are at different offsets) Can/should the same file be 
processed twice in a job?

Finally: Why does a user-disconnect timeout kill the job? That's different than 
the usual case in MapReduce, where a user disconnect is not noticed by the 
server-side processes at all. I would think that after a user-disconnect 
timeout, that declares that all the input is added, and that the reduce phase 
can begin -- not that it should kill things. 

> Dynamic add input for one job
> -----------------------------
>
>                 Key: MAPREDUCE-1434
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1434
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>         Environment: 0.19.0
>            Reporter: Xing Shi
>
> Always we should firstly upload the data to hdfs, then we can analize the 
> data using hadoop mapreduce.
> Sometimes, the upload process takes long time. So if we can add input during 
> one job, the time can be saved.
> WHAT?
> Client:
> a) hadoop job -add-input jobId inputFormat ...
> Add the input to jobid
> b) hadoop job -add-input done
> Tell the JobTracker, the input has been prepared over.
> c) hadoop job -add-input status jobid
> Show how many input the jobid has.
> HOWTO?
> Mainly, I think we should do three things:
> 1. JobClinet: here JobClient should support add input to a job, indeed, 
> JobClient generate the split, and submit to JobTracker.
> 2. JobTracker: JobTracker support addInput, and add the new tasks to the 
> original mapTasks. Because the uploaded data will be 
> processed quickly, so it also should update the scheduler to support pending 
> a map task till Client tells the Job input done.
> 3. Reducer: the reducer should also update the mapNums, so it will shuffle 
> right.
> This is the rough idea, and I will update it .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to