I'm largely at fault for the "user code running in the JobTracker" that
exists.

I support this change - but, I might reformulate it. Why not make this a
sort of special Job? It can even be formulated roughly like this:

input<JobDescription,FilePaths> -> map(Job,FilePath) ->
reduce(Job,FileSplits) -> SchedulableJob

It might even make sense to do an extra run that pre-computes cached
locations of FileSplits, although I think that is still bottlenecked by the
NameNode.

On 9/28/06, Doug Cutting <[EMAIL PROTECTED]> wrote:

Benjamin Reed wrote:
> One of the things that bothers me about the JobTracker is that it is
> running user code when it creates the FileSplits. In the long term this
> puts the JobTracker JVM at risk due to errors in the user code.

JVM's are supposed to be able to do this kind of stuff securely.  Still,
we don't currently leverage this much, and the JVM's security is
limited, so it is a valid concern.

Note that, while we do avoid running user code in tasktrackers (mapping,
sorting and reducing are done in a subprocess) they're still run as a
system user id.  So security issues are to some degree unavoidable.

But in terms of inadvertant denial of service, running user code in the
job tracker, a single-point-of-failure, does make the system more fragile.

> The JobTracker uses the InputFormat to create a set of tasks that it
> then schedules. The task creation does not need to happen at the
> JobTracker. If we allowed the clients to create the set of tasks, the
> JobTracker would not need to load and run any user generated code. It
> would also remove some of the processing load from the JobTracker. On
> the downside it does greatly increase the amount of information sent to
> the JobTracker when a job is submitted.

Right, so JobSubmissionProtocol.submitJob(String jobFile) could be
altered to be submitJob(StringJobFile, Split[]).  The RPC system can
handle reasonably large values like this, so I don't think that would be
a problem.  But the memory impact on the JobTracker could become
significant, since the splits for queued jobs would now be around.  This
could be mitigated by writing the splits to a temporary file.

The semantics would be subtly different: if you queue a job now, the
file listing is done just before the job is executed, not when its
submitted.  But programs shouldn't rely on that, so I don't think this
is a big worry.

Overall, I don't see any major problems with this.  It won't simplify
things much.  We can remove the code which computes splits in a separate
thread, but we'd have to add code to store splits to temporary files, so
codesize is a wash.  And it would remove a potential reliability problem.

Doug




--
Bryan A. P. Pendleton
Ph: (877) geek-1-bp

Reply via email to