[ https://issues.apache.org/jira/browse/PIG-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alan Gates updated PIG-1752: ---------------------------- Status: Patch Available (was: Open) This patch changes the EvalFunc interface to allow UDFs to declare a list of files they want to put in the distributed cache. It adds a new method {code} /** * Allow a UDF to specify a list of files it would like placed in the distributed * cache. These files will be put in the cache for every job the UDF is used in. * The default implementation returns null. * @return A list of files */ public List<String> getCacheFiles() { return null; } {code} This change is backward compatible since EvalFunc is an abstract class and the default implementation returns null. Any files returned by getCacheFiles are captured and placed in the physical plan during logical->physical translation. The JobControlCompiler then visits each UDF and adds the files returned to the list of files to load into the distributed cache for this job. No special handling is provided for the files. Users have to assure they are already on HDFS. The filename should be of the form: hdfs://namenode/path#symlink where symlink is the name that the file will be linked into the tasks local directory under. The UDF can then access the file in the backend by opening that symlink as a local file. > UDFs should be able to indicate files to load in the distributed cache > ---------------------------------------------------------------------- > > Key: PIG-1752 > URL: https://issues.apache.org/jira/browse/PIG-1752 > Project: Pig > Issue Type: New Feature > Components: impl > Reporter: Alan Gates > Assignee: Alan Gates > Priority: Minor > Attachments: PIG-1752.patch > > > Currently there is no way for a UDF to load a file into the distributed cache. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.