[ https://issues.apache.org/jira/browse/HADOOP-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554586 ]
musepwizard edited comment on HADOOP-1622 at 12/27/07 8:11 AM: ---------------------------------------------------------------- I have only gotten a chance to design not to develop this as I have been launching the Search Wikia site. Here is what I have come up with in terms of a more generalized design after talking with both Doug and Owen about this enhancement: 1.A runjob utility. runjar is not affected as it is made to only run a single jar. 2.The options parser will be extended to to support resources, upload, classpath, noclasspath, compress, decompress, and cache. - Items that at cached are added to the distributed cache. - Items uploaded are by default not added to the classpath - Items cached are by default added to the classpath - Resources are by default added to the classpath - Compress will choose resources to compress before adding to job.jar file - Decompress will choose resources to be decompress before adding to job.jar file. - Compress and decompress will only take action on resources being added to job. This will include non-local resources and will need to be handled in slave local job resources. - Classpath is ignored for any resource that is being uploaded as it will already be added to the classpath due to it being in resources. - All options support multiple elements in comma separated format. - No classpath will removed cached and non-cached resources from the classpath. For example a jar can be added to resources, included in the local job.jar resources but not included in its local classpath. (I don't know if this functionality is useful?) 3.Resources - Resources are one or more items that are jarred up into the single job.jar file - Resources can be files (compressed or uncompressed) or directories - Resources can be from any file system. - Resources paths support relative and absolute paths - Resources support URL type paths to support multiple file systems - If the path in not in a URL format then it is assumed to be on the local file system as either an absolute or relative path. - Only resources that exist will be included. This is true for any file system. The resource must exist at the beginning of the job to be uploaded. If the resources exists at the beginning of the job but not when the local job starts its processing an error will be thrown and that task will cease operation. - A global configuration variable exists to choose to decompress any compressed file that is added as a resource. - Non-local resources will be pulled down into the local job resources from the resources given file system. This can include DFS and S3 resources added dynamically. - Local resources that are added to the job.jar will be resources from the resources configuration variable passed to the local jobs. Remaing resources will be the non-local resources that need to be added to local job resources. 4.Uploads - Uploads by default are put into the users home directory on the jobtracker file system. - Upload directories can be set either through a configuration variable for a global default upload folder or through a colon path structure in the upload. Something like path:uploadto. - Upload resources can be added to the classpath by the classpath option - If upload resources are added to the classpath, they will be pulled down into the resources for each job and added to the local job classpath. - Uploads are independent of resources. An upload doesn't have to be a resource. A resource can be an uploaded element. In this case it would be uploaded (not included in local job.jar) and then pulled down from the job tracker file system as a resource. - Uploads will check modified date/time and size before uploading elements. If the upload is a directory, the upload will recursively check all files in that directory before upload and only upload modified files. This should give an rsync type functionality to uploading resources and should decrease bandwidth consumption. - Upload will support URL type paths as well. This will allow transferring resources from one type of file system (i.e. S3) to the job trackers file system. Again resources without a URL type structure will be considered local file system and will support relative and absolute paths. Only absolute paths will be supported on non-local file systems. was (Author: musepwizard): I have only gotten a chance to design not to develop this as I have been launching the Search Wikia site. Here is what I have come up with in terms of a more generalized design after talking with both Doug and Owen about this enhancement: 1.A runjob utility. runjar is not affected as it is made to only run a single jar. 2.The options parser will be extended to to support resources, upload, classpath, noclasspath, compress, decompress, and cache. - Items that at cached are added to the distributed cache. - Items uploaded are by default not added to the classpath - Items cached are by default added to the classpath - Resources are by default added to the classpath - Compress will choose resources to compress before adding to job.jar file - Decompress will choose resources to be decompress before adding to job.jar file. - Compress and decompress will only take action on resources being added to job. This will include non-local resources and will need to be handled in slave local job resources. - Classpath is ignored for any resource that is being uploaded as it will already be added to the classpath due to it being in resources. - All options support multiple elements in comma separated format. - No classpath will removed cached and non-cached resources from the classpath. For example a jar can be added to resources, included in the local job.jar resources but not included in its local classpath. (I don't know if this functionality is useful?) 3.Resources - Resources are one or more items that are jarred up into the single job.jar file - Resources can be files (compressed or uncompressed) or directories - Resources can be from any file system. - Resources paths support relative and absolute paths - Resources support URL type paths to support multiple file systems - If the path in not in a URL format then it is assumed to be on the local file system as either an absolute or relative path. - Only resources that exist will be included. This is true for any file system. The resource must exist at the beginning of the job to be uploaded. If the resources exists at the beginning of the job but not when the local job starts its processing an error will be thrown and that task will cease operation. - A global configuration variable exists to choose to decompress any compressed file that is added as a resource. - Non-local resources will be pulled down into the local job resources from the resources given file system. This can include DFS and S3 resources added dynamically. - Local resources that are added to the job.jar will be resources from the resources configuration variable passed to the local jobs. Remaing resources will be the non-local resources that need to be added to local job resources. 4.Uploads - Uploads by default are put into the users home directory on the jobtracker file system. - Upload directories can be set either through a configuration variable for a global default upload folder or through a colon path structure in the upload. Something like path:uploadto. - Upload resources can be added to the classpath by the classpath option - If upload resources are added to the classpath, they will be pulled down into the resources for each job and added to the local job classpath. - Uploads are independent of resources. An upload doesn't have to be a resource. A resource can be an uploaded element. In this case it would be uploaded (not included in local job.jar) and then pulled down from the job tracker file system as a resource. - Uploads will check modified date/time and size before uploading elements. If the upload is a directory, the upload will recursively check all files in that directory before upload and only upload modified files. This should give an rsync type functionality to uploading resources and should decrease bandwidth consumption. - Upload will support URL type paths as well. This will allow transferring resources from one type of file system (i.e. S3) to the job trackers file system. Again resources without a URL type structure will be considered local file system and will support relative and absolute paths. Only absolute paths will be supported on non-local file systems. > Hadoop should provide a way to allow the user to specify jar file(s) the user > job depends on > -------------------------------------------------------------------------------------------- > > Key: HADOOP-1622 > URL: https://issues.apache.org/jira/browse/HADOOP-1622 > Project: Hadoop > Issue Type: Improvement > Reporter: Runping Qi > Assignee: Dennis Kubes > Fix For: 0.16.0 > > Attachments: hadoop-1622-4-20071008.patch, HADOOP-1622-5.patch, > HADOOP-1622-6.patch, HADOOP-1622-7.patch, HADOOP-1622-8.patch, > HADOOP-1622-9.patch, multipleJobJars.patch, multipleJobResources.patch, > multipleJobResources2.patch > > > More likely than not, a user's job may depend on multiple jars. > Right now, when submitting a job through bin/hadoop, there is no way for the > user to specify that. > A walk around for that is to re-package all the dependent jars into a new jar > or put the dependent jar files in the lib dir of the new jar. > This walk around causes unnecessary inconvenience to the user. Furthermore, > if the user does not own the main function > (like the case when the user uses Aggregate, or datajoin, streaming), the > user has to re-package those system jar files too. > It is much desired that hadoop provides a clean and simple way for the user > to specify a list of dependent jar files at the time > of job submission. Someting like: > bin/hadoop .... --depending_jars j1.jar:j2.jar -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.