Hi all, just updating this thread with the solution, since it doesn’t seem a working solution is documented anywhere.
The use case, to summarize again: You need to import non-standard-lib Python modules into a Pig streaming_python udf, and you want to manage dependencies by packaging a python installation into an archive that you will distribute at runtime by placing archive into DistributedCache. You want to use the streaming_python feature because it is superior to STREAM operator. You do not want to manage external dependencies by installing a python executable on every node of the cluster. (You probably want to do this because you have nice new PySpark functions that you’d like to extend to your older Pig code while avoiding being forced to port from Pig to Spark.) The interesting pig configurations are found here: http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/PigConfiguration.java?view=markup#l284 To understand what pig does when it calls streaming_python, it is running your python script in a subshell, passing data to it via stdin/out, while handling data serialization minutiae for you. It invokes your script by ‘bash -c exec python [yourscript.py]’. Your goal is to redirect ‘python’ away from the default python (usually /usr/bin/python) to the python in an archive package you provide in distributed cache. You need to prepend the PATH variable in that subshell with the path to your packaged python. First, to get your archive and python UDF file into distributed cache, you will have to use the configuration appropriate to your running environment. I’ll assume here that you just want to call pig from the command line. In such a case you could use the following pig settings to achieve this: set mapreduce.cache.archives hdfs://path/to/python-distributable.zip#venv; set pig.streaming.ship.files my_udf.py; Above we assume that a full installation of python with required dependencies is available on HDFS, and your udf that requires the python executable in that zip is in your local directory where you will call pig client. Note that we use the symlink #venv to alias the zipfile – DistributedCache will unzip our archive for us on the client nodes and contents will be available in the local job context at ./venv Now to redirect the python alias in the subprocess where your udf will run. You can set pig.streaming.environment property in order to pass environment variables into the streaming_python subshell. The key realization is that pig.streaming.environment takes a comma delimited list of other pig job settings (which are key:value) to propagate from pig’s job settings into the subshell where python will run. You ask that a job setting named PATH is passed from pig to the subshell like this in your pig script: set pig.streaming.environment PATH; And then pass the value for PATH from outside the script, for example, by passing it in as a property on command line: pig -DPATH=./venv/bin:$PATH -f test.pig Once you have done this you should get the execution environment that you need for my_udf.py, complete with dependencies that you packaged in your python archive. Notes: Here are the key objects involved in the streaming_python feature. See StreamingUtil and PigStreamingUDF. http://svn.apache.org/viewvc?view=revision&revision=1525632 You can learn more about packaging python for distribution from some of the projects that exist specifically for this purpose: https://conda.github.io/conda-pack/ and https://jcrist.github.io/venv-pack/index.html This turns out to be one of the more difficult pieces of getting this to work in a distributed environment, but you can save yourself lots of time by just using anaconda (or miniconda) to generate the executable you use in your package, and zip that. We use miniconda and zip the resulting distributable python environment ourselves since conda-pack generates large files. Alternately, you could create a standalone python from source with make altinstall, if you like pain. A good description of the PySpark analogue for this use case is found here: https://henning.kropponline.de/2016/09/17/running-pyspark-with-virtualenv/ A good description of a streaming CPython script with STREAM operator, which has a little more obvious mechanism, is found here: https://ihadanny.wordpress.com/2014/12/01/python-virtualenv-with-pig-streaming/ The downside is that you have to handle a lot more of the details of the stdin / stdout data flow. I would avoid STREAM where possible. From: Jeff Stephens <jsteph...@ten-x.com> Date: Saturday, June 8, 2019 at 2:53 PM To: "user@pig.apache.org" <user@pig.apache.org> Subject: Changing python executable referred by streaming_python udf Hey all, our team has an interesting problem. We have a set of Pig code we developed a few years ago that for various reasons I’d prefer not to convert over to pyspark immediately. I would like to share some UDF code between pig and pyspark for a little while. We can do this if we wrap our pure python functions with shim scripts for spark and pig. Where we ran into issues, was using specific python version / python libs in a virtualenv. Does anyone know how to influence the python executable that will be called by Pig’s streaming_python? We know how to ship a python installation with virtualenv around the cluster with Oozie, so it is just a matter of figuring out now how to point Pig to run our UDF wrapper script using the venv’s executable, instead of to whatever is in /usr/bin/python on the datanodes. Will updates thread for posterity if we figure it. Thanks! Notes: Examples of streaming_python I see seem to be using python with dependencies installed directly on each node of the cluster. This would work, but definitely not how we want to distribute python code. For reasons, I don’t want to use STREAM. I got this working with STREAM operator, and it is trivial to do what I want using STREAM since you explicitly control the invocation of the script. But it is not really fun to lose out on all the features of streaming_python, and makes me sad having to manually join script results back to the parent relation every time I need to send a field into my python script. ****************************************************************************************** This communication constitutes an electronic communication within the meaning of the Electronic Communications Privacy Act [18 USC 2510] and it is intended to be received and read only by certain individuals for their sole use and benefit. It may contain information that is privileged or protected from disclosure by law. Receipt by anyone other than the intended recipient does not constitute a loss of the confidential or privileged nature of the communication. Any review or distribution by others is strictly prohibited. If it has been misdirected, or if you suspect you have received this in error, please notify me by replying and then delete both the message and reply. Thank you. ***Consider the environment before printing.*** ******************************************************************************************