[ https://issues.apache.org/jira/browse/MAPREDUCE-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888332#action_12888332 ]
Steven Lewis commented on MAPREDUCE-1928: ----------------------------------------- Another possible use has to do with adjusting parameters to avoid failures. I have an issue where a reducer is running out of memory. If I was aware that certain keys lead to this failure I could take steps such as sampling data rather than processing the whole set do I would add access to data about failures > Dynamic information fed into Hadoop for controlling execution of a submitted > job > -------------------------------------------------------------------------------- > > Key: MAPREDUCE-1928 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1928 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: job submission, jobtracker, tasktracker > Affects Versions: 0.20.3 > Reporter: Raman Grover > Original Estimate: 2016h > Remaining Estimate: 2016h > > Currently the job submission protocol requires the job provider to put every > bit of information inside an instance of JobConf. The submitted information > includes the input data (hdfs path) , suspected resource requirement, number > of reducers etc. This information is read by JobTracker as part of job > initialization. Once initialized, job is moved into a running state. From > this point, there is no mechanism for any additional information to be fed > into Hadoop infrastructure for controlling the job execution. > The execution pattern for the job looks very much > static from this point. Using the size of input data and a few settings > inside JobConf, number of mappers is computed. Hadoop attempts at reading the > whole of data in parallel by launching parallel map tasks. Once map phase is > over, a known number of reduce tasks (supplied as part of JobConf) are > started. > Parameters that control the job execution were set in JobConf prior to > reading the input data. As the map phase progresses, useful information based > upon the content of the input data surfaces and can be used in controlling > the further execution of the job. Let us walk through some of the examples > where additional information can be fed to Hadoop subsequent to job > submission for optimal execution of the job. > I) "Process a part of the input , based upon the results decide if reading > more input is required " > In a huge data set, user is interested in finding 'k' records that > satisfy a predicate, essentially sampling the data. In current > implementation, as the data is huge, a large no of mappers would be launched > consuming a significant fraction of the available map slots in the cluster. > Each map task would attempt at emitting a max of 'k' records. With N > mappers, we get N*k records out of which one can pick any k to form the final > result. > This is not optimal as: > 1) A larger number of map slots get occupied initially, affecting other > jobs in the queue. > 2) If the selectivity of input data is very low, we essentially did not > need scanning the whole of data to form our result. > we could have finished by reading a fraction of input data, > monitoring the cardinality of the map output and determining if > more input needs to be processed. > > Optimal way: If reading the whole of input requires N mappers, launch only > 'M' initially. Allow them to complete. Based upon the statistics collected, > decide additional number of mappers to be launched next and so on until the > whole of input has been processed or enough records have been collected to > for the results, whichever is earlier. > > > II) "Here is some data, the remaining is yet to arrive, but you may start > with it, and receive more input later" > Consider a chain of 2 M-R jobs chained together such that the latter > reads the output of the former. The second MR job cannot be started until the > first has finished completely. This is essentially because Hadoop needs to be > told the complete information about the input before beginning the job. > The first M-R has produced enough data ( not finished yet) that can be > processed by another MR job and hence the other MR need not wait to grab the > whole of input before beginning. Input splits could be supplied later , but > ofcourse before the copy/shuffle phase. > > III) " Input data has undergone one round of processing by map phase, have > some stats, can now say of the resources > required further" > Mappers can produce useful stats about of their output, like the > cardinality or produce a histogram describing distribution of output . These > stats are available to the job provider (Hive/Pig/End User) who can > now determine with better accuracy of the resources (memory > requirements ) required in reduction phase, and even the number of > reducers or may even alter the reduction logic by altering the reducer class > parameter. > > In a nut shell, certain parameters about a job are governed by the input data > and the intermediate results produced and hence need to be overridden as job > progresses. Hadoop does not allow such information to be fed dynamically. > Hence job execution may not always be optimal. > I would like to get feedback from the Hadoop community about the above > proposal and if any similar effort is already underway. > If we agree, as a next step I would like to discuss the implementation > details that I have worked out end-to-end. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.