[ 
https://issues.apache.org/jira/browse/PIG-2672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13773507#comment-13773507
 ] 

Aniket Mokashi commented on PIG-2672:
-------------------------------------

I have attached a patch that that adds 2 configuration parameters- 
cluster.cache.location and user.cache.location.

Jars are copied to <cache.location>/a/b/c/checksum-jarname.jar where a, b, c 
are first 3 characters of the checksum. When a new jar is registered, checksum 
is calculated and we check whether a jar with same name/checksum exists in the 
cache. If yes, copy to hdfs is avoided.

Permissions to write to cache is managed by HDFS permissions. Also, its not 
possible to overwrite a jar using this mechanism. If jar changes, its checksum 
will also change and it will be a new jar in the cache. Removal of old jars is 
manual step- admins/users can list jars under the cache location and remove the 
ones that are very old. Alternatively, you can delete all the jars in the cache 
or change jar cache location and cache will be repopulated by running jobs.

If this approach looks reasonable, I can add few more tests. Comments welcome!
                
> Optimize the use of DistributedCache
> ------------------------------------
>
>                 Key: PIG-2672
>                 URL: https://issues.apache.org/jira/browse/PIG-2672
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.12
>
>         Attachments: PIG-2672.patch
>
>
> Pig currently copies jar files to a temporary location in hdfs and then adds 
> them to DistributedCache for each job launched. This is inefficient in terms 
> of 
>    * Space - The jars are distributed to task trackers for every job taking 
> up lot of local temporary space in tasktrackers.
>    * Performance - The jar distribution impacts the job launch time.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to