Hi all, just updating this thread with the solution, since it doesn’t seem a 
working solution is documented anywhere.

The use case, to summarize again: You need to import non-standard-lib Python 
modules into a Pig streaming_python udf, and you want to manage dependencies by 
packaging a python installation into an archive that you will distribute at 
runtime by placing archive into DistributedCache. You want to use the 
streaming_python feature because it is superior to STREAM operator. You do not 
want to manage external dependencies by installing a python executable on every 
node of the cluster. (You probably want to do this because you have nice new 
PySpark functions that you’d like to extend to your older Pig code while 
avoiding being forced to port from Pig to Spark.)

The interesting pig configurations are found here: 
http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/PigConfiguration.java?view=markup#l284

To understand what pig does when it calls streaming_python, it is running your 
python script in a subshell, passing data to it via stdin/out, while handling 
data serialization minutiae for you. It invokes your script by ‘bash -c exec 
python [yourscript.py]’.

Your goal is to redirect ‘python’ away from the default python (usually 
/usr/bin/python) to the python in an archive package you provide in distributed 
cache. You need to prepend the PATH variable in that subshell with the path to 
your packaged python.

First, to get your archive and python UDF file into distributed cache, you will 
have to use the configuration appropriate to your running environment. I’ll 
assume here that you just want to call pig from the command line. In such a 
case you could use the following pig settings to achieve this:

set mapreduce.cache.archives hdfs://path/to/python-distributable.zip#venv;
set pig.streaming.ship.files my_udf.py;

Above we assume that a full installation of python with required dependencies 
is available on HDFS, and your udf that requires the python executable in that 
zip is in your local directory where you will call pig client. Note that we use 
the symlink #venv to alias the zipfile – DistributedCache will unzip our 
archive for us on the client nodes and contents will be available in the local 
job context at ./venv

Now to redirect the python alias in the subprocess where your udf will run. You 
can set pig.streaming.environment property in order to pass environment 
variables into the streaming_python subshell. The key realization is that 
pig.streaming.environment takes a comma delimited list of other pig job 
settings (which are key:value) to propagate from pig’s job settings into the 
subshell where python will run. You ask that a job setting named PATH is passed 
from pig to the subshell like this in your pig script:

set pig.streaming.environment PATH;

And then pass the value for PATH from outside the script, for example, by 
passing it in as a property on command line:

pig -DPATH=./venv/bin:$PATH -f test.pig

Once you have done this you should get the execution environment that you need 
for my_udf.py, complete with dependencies that you packaged in your python 
archive.



Notes:
Here are the key objects involved in the streaming_python feature. See 
StreamingUtil and PigStreamingUDF. 
http://svn.apache.org/viewvc?view=revision&revision=1525632

You can learn more about packaging python for distribution from some of the 
projects that exist specifically for this purpose: 
https://conda.github.io/conda-pack/ and 
https://jcrist.github.io/venv-pack/index.html This turns out to be one of the 
more difficult pieces of getting this to work in a distributed environment, but 
you can save yourself lots of time by just using anaconda (or miniconda) to 
generate the executable you use in your package, and zip that. We use miniconda 
and zip the resulting distributable python environment ourselves since 
conda-pack generates large files. Alternately, you could create a standalone 
python from source with make altinstall, if you like pain.

A good description of the PySpark analogue for this use case is found here: 
https://henning.kropponline.de/2016/09/17/running-pyspark-with-virtualenv/

A good description of a streaming CPython script with STREAM operator, which 
has a little more obvious mechanism, is found here: 
https://ihadanny.wordpress.com/2014/12/01/python-virtualenv-with-pig-streaming/ 
The downside is that you have to handle a lot more of the details of the stdin 
/ stdout data flow. I would avoid STREAM where possible.

From: Jeff Stephens <jsteph...@ten-x.com>
Date: Saturday, June 8, 2019 at 2:53 PM
To: "user@pig.apache.org" <user@pig.apache.org>
Subject: Changing python executable referred by streaming_python udf

Hey all, our team has an interesting problem.

We have a set of Pig code we developed a few years ago that for various reasons 
I’d prefer not to convert over to pyspark immediately. I would like to share 
some UDF code between pig and pyspark for a little while. We can do this if we 
wrap our pure python functions with shim scripts for spark and pig. Where we 
ran into issues, was using specific python version / python libs in a 
virtualenv.

Does anyone know how to influence the python executable that will be called by 
Pig’s streaming_python? We know how to ship a python installation with 
virtualenv around the cluster with Oozie, so it is just a matter of figuring 
out now how to point Pig to run our UDF wrapper script using the venv’s 
executable, instead of to whatever is in /usr/bin/python on the datanodes.

Will updates thread for posterity if we figure it.

Thanks!


Notes:

Examples of streaming_python I see seem to be using python with dependencies 
installed directly on each node of the cluster. This would work, but definitely 
not how we want to distribute python code.

For reasons, I don’t want to use STREAM. I got this working with STREAM 
operator, and it is trivial to do what I want using STREAM since you explicitly 
control the invocation of the script. But it is not really fun to lose out on 
all the features of streaming_python, and makes me sad having to manually join 
script results back to the parent relation every time I need to send a field 
into my python script.

******************************************************************************************
This communication constitutes an electronic communication within the meaning 
of the Electronic Communications Privacy Act [18 USC 2510] and it is intended 
to be received and read only by certain individuals for their sole use and 
benefit.   It may contain information that is privileged or protected from 
disclosure by law.  Receipt by anyone other than the intended recipient does 
not constitute a loss of the confidential or privileged nature of the 
communication. Any review or distribution by others is strictly prohibited. If 
it has been misdirected, or if you suspect you have received this in error, 
please notify me by replying and then delete both the message and reply.  Thank 
you.
***Consider the environment before printing.***
******************************************************************************************

Reply via email to