[jira] Issue Comment Edited: (HADOOP-1622) Hadoop should provide a way to allow the user to specify jar file(s) the user job depends on

Dennis Kubes (JIRA) Thu, 27 Dec 2007 08:13:07 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554586
 ]


musepwizard edited comment on HADOOP-1622 at 12/27/07 8:11 AM:
----------------------------------------------------------------

I have only gotten a chance to design not to develop this as I have been 
launching the Search Wikia site.  Here is what I have come up with in terms of 
a more generalized design after talking with both Doug and Owen about this 
enhancement:

1.A runjob utility.  runjar is not affected as it is made to only run a single 
jar.

2.The options parser will be extended to to support resources, upload, 
classpath, noclasspath, compress, decompress, and cache.
  - Items that at cached are added to the distributed cache.
  - Items uploaded are by default not added to the classpath
  - Items cached are by default added to the classpath
  - Resources are by default added to the classpath
  - Compress will choose resources to compress before adding to job.jar file
  - Decompress will choose resources to be decompress before adding to job.jar 
file.
  - Compress and decompress will only take action on resources being added to 
job.  This will include non-local resources and will need to be handled in 
slave local job resources.
  - Classpath is ignored for any resource that is being uploaded as it will 
already be added to the classpath due to it being in resources.
  - All options support multiple elements in comma separated format.
  - No classpath will removed cached and non-cached resources from the 
classpath.  For example a jar can be added to resources, included in the local 
job.jar resources but not included in its local classpath.  (I don't know if 
this functionality is useful?)

3.Resources
  - Resources are one or more items that are jarred up into the single job.jar 
file
  - Resources can be files (compressed or uncompressed) or directories
  - Resources can be from any file system.
  - Resources paths support relative and absolute paths
  - Resources support  URL type paths to support multiple file systems
  - If the path in not in a URL format then it is assumed to be on the local 
file system as either an absolute or relative path.
  - Only resources that exist will be included.  This is true for any file 
system.  The resource must exist at the beginning of the job to be uploaded.  
If the resources exists at the beginning of the job but not when the local job 
starts its processing an error will be thrown and that task will cease 
operation.
  - A global configuration variable exists to choose to decompress any 
compressed file that is added as a resource.
  - Non-local resources will be pulled down into the local job resources from 
the resources given file system.  This can include DFS and S3 resources added 
dynamically.
  - Local resources that are added to the job.jar will be resources from the 
resources configuration variable passed to the local jobs.  Remaing resources 
will be the non-local resources that need to be added to local job resources.

4.Uploads
  - Uploads by default are put into the users home directory on the jobtracker 
file system.
  - Upload directories can be set either through a configuration variable for a 
global default upload folder or through a colon path structure in the upload.  
Something like path:uploadto.
  - Upload resources can be added to the classpath by the classpath option
  - If upload resources are added to the classpath, they will be pulled down 
into the resources for each job and added to the local job classpath.
  - Uploads are independent of resources.  An upload doesn't have to be a 
resource.  A resource can be an uploaded element.  In this case it would be 
uploaded (not included in local job.jar) and then pulled down from the job 
tracker file system as a resource.
  - Uploads will check modified date/time and size before uploading elements.  
If the upload is a directory, the upload will recursively check all files in 
that directory before upload and only upload modified files.  This should give 
an rsync type functionality to uploading resources and should decrease 
bandwidth consumption.
  - Upload will support URL type paths as well.  This will allow transferring 
resources from one type of file system (i.e. S3) to the job trackers file 
system.  Again resources without a URL type structure will be considered local 
file system and will support relative and absolute paths.  Only absolute paths 
will be supported on non-local file systems.

      was (Author: musepwizard):
    I have only gotten a chance to design not to develop this as I have been 
launching the Search Wikia site.  Here is what I have come up with in terms of 
a more generalized design after talking with both Doug and Owen about this 
enhancement:

1.A runjob utility.  runjar is not affected as it is made to only run a single 
jar.
2.The options parser will be extended to to support resources, upload, 
classpath, noclasspath, compress, decompress, and cache.
  - Items that at cached are added to the distributed cache.
  - Items uploaded are by default not added to the classpath
  - Items cached are by default added to the classpath
  - Resources are by default added to the classpath
  - Compress will choose resources to compress before adding to job.jar file
  - Decompress will choose resources to be decompress before adding to job.jar 
file.
  - Compress and decompress will only take action on resources being added to 
job.  This will include non-local resources and will need to be handled in 
slave local job resources.
  - Classpath is ignored for any resource that is being uploaded as it will 
already be added to the classpath due to it being in resources.
  - All options support multiple elements in comma separated format.
  - No classpath will removed cached and non-cached resources from the 
classpath.  For example a jar can be added to resources, included in the local 
job.jar resources but not included in its local classpath.  (I don't know if 
this functionality is useful?)
3.Resources
  - Resources are one or more items that are jarred up into the single job.jar 
file
  - Resources can be files (compressed or uncompressed) or directories
  - Resources can be from any file system.
  - Resources paths support relative and absolute paths
  - Resources support  URL type paths to support multiple file systems
  - If the path in not in a URL format then it is assumed to be on the local 
file system as either an absolute or relative path.
  - Only resources that exist will be included.  This is true for any file 
system.  The resource must exist at the beginning of the job to be uploaded.  
If the resources exists at the beginning of the job but not when the local job 
starts its processing an error will be thrown and that task will cease 
operation.
  - A global configuration variable exists to choose to decompress any 
compressed file that is added as a resource.
  - Non-local resources will be pulled down into the local job resources from 
the resources given file system.  This can include DFS and S3 resources added 
dynamically.
  - Local resources that are added to the job.jar will be resources from the 
resources configuration variable passed to the local jobs.  Remaing resources 
will be the non-local resources that need to be added to local job resources.
4.Uploads
  - Uploads by default are put into the users home directory on the jobtracker 
file system.
  - Upload directories can be set either through a configuration variable for a 
global default upload folder or through a colon path structure in the upload.  
Something like path:uploadto.
  - Upload resources can be added to the classpath by the classpath option
  - If upload resources are added to the classpath, they will be pulled down 
into the resources for each job and added to the local job classpath.
  - Uploads are independent of resources.  An upload doesn't have to be a 
resource.  A resource can be an uploaded element.  In this case it would be 
uploaded (not included in local job.jar) and then pulled down from the job 
tracker file system as a resource.
  - Uploads will check modified date/time and size before uploading elements.  
If the upload is a directory, the upload will recursively check all files in 
that directory before upload and only upload modified files.  This should give 
an rsync type functionality to uploading resources and should decrease 
bandwidth consumption.
  - Upload will support URL type paths as well.  This will allow transferring 
resources from one type of file system (i.e. S3) to the job trackers file 
system.  Again resources without a URL type structure will be considered local 
file system and will support relative and absolute paths.  Only absolute paths 
will be supported on non-local file systems.
  
> Hadoop should provide a way to allow the user to specify jar file(s) the user 
> job depends on
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1622
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1622
>             Project: Hadoop
>          Issue Type: Improvement
>            Reporter: Runping Qi
>            Assignee: Dennis Kubes
>             Fix For: 0.16.0
>
>         Attachments: hadoop-1622-4-20071008.patch, HADOOP-1622-5.patch, 
> HADOOP-1622-6.patch, HADOOP-1622-7.patch, HADOOP-1622-8.patch, 
> HADOOP-1622-9.patch, multipleJobJars.patch, multipleJobResources.patch, 
> multipleJobResources2.patch
>
>
> More likely than not, a user's job may depend on multiple jars.
> Right now, when submitting a job through bin/hadoop, there is no way for the 
> user to specify that. 
> A walk around for that is to re-package all the dependent jars into a new jar 
> or put the dependent jar files in the lib dir of the new jar.
> This walk around causes unnecessary inconvenience to the user. Furthermore, 
> if the user does not own the main function 
> (like the case when the user uses Aggregate, or datajoin, streaming), the 
> user has to re-package those system jar files too.
> It is much desired that hadoop provides a clean and simple way for the user 
> to specify a list of dependent jar files at the time 
> of job submission. Someting like:
> bin/hadoop .... --depending_jars j1.jar:j2.jar 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-1622) Hadoop should provide a way to allow the user to specify jar file(s) the user job depends on

Reply via email to