[
https://issues.apache.org/jira/browse/PIG-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alan Gates updated PIG-1752:
----------------------------
Status: Patch Available (was: Open)
This patch changes the EvalFunc interface to allow UDFs to declare a list of
files they want to put in the distributed cache. It adds a new method
{code}
/**
* Allow a UDF to specify a list of files it would like placed in the
distributed
* cache. These files will be put in the cache for every job the UDF is
used in.
* The default implementation returns null.
* @return A list of files
*/
public List<String> getCacheFiles() {
return null;
}
{code}
This change is backward compatible since EvalFunc is an abstract class and the
default implementation returns null.
Any files returned by getCacheFiles are captured and placed in the physical
plan during logical->physical translation. The JobControlCompiler then visits
each UDF and adds the files returned to the list of files to load into the
distributed cache for this job.
No special handling is provided for the files. Users have to assure they are
already on HDFS. The filename should be of the form:
hdfs://namenode/path#symlink
where symlink is the name that the file will be linked into the tasks local
directory under. The UDF can then access the file in the backend by opening
that symlink as a local file.
> UDFs should be able to indicate files to load in the distributed cache
> ----------------------------------------------------------------------
>
> Key: PIG-1752
> URL: https://issues.apache.org/jira/browse/PIG-1752
> Project: Pig
> Issue Type: New Feature
> Components: impl
> Reporter: Alan Gates
> Assignee: Alan Gates
> Priority: Minor
> Attachments: PIG-1752.patch
>
>
> Currently there is no way for a UDF to load a file into the distributed cache.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.