Hey all, our team has an interesting problem.

We have a set of Pig code we developed a few years ago that for various reasons 
I’d prefer not to convert over to pyspark immediately. I would like to share 
some UDF code between pig and pyspark for a little while. We can do this if we 
wrap our pure python functions with shim scripts for spark and pig. Where we 
ran into issues, was using specific python version / python libs in a 
virtualenv.

Does anyone know how to influence the python executable that will be called by 
Pig’s streaming_python? We know how to ship a python installation with 
virtualenv around the cluster with Oozie, so it is just a matter of figuring 
out now how to point Pig to run our UDF wrapper script using the venv’s 
executable, instead of to whatever is in /usr/bin/python on the datanodes.

Will updates thread for posterity if we figure it.

Thanks!


Notes:

Examples of streaming_python I see seem to be using python with dependencies 
installed directly on each node of the cluster. This would work, but definitely 
not how we want to distribute python code.

For reasons, I don’t want to use STREAM. I got this working with STREAM 
operator, and it is trivial to do what I want using STREAM since you explicitly 
control the invocation of the script. But it is not really fun to lose out on 
all the features of streaming_python, and makes me sad having to manually join 
script results back to the parent relation every time I need to send a field 
into my python script.

******************************************************************************************
This communication constitutes an electronic communication within the meaning 
of the Electronic Communications Privacy Act [18 USC 2510] and it is intended 
to be received and read only by certain individuals for their sole use and 
benefit.   It may contain information that is privileged or protected from 
disclosure by law.  Receipt by anyone other than the intended recipient does 
not constitute a loss of the confidential or privileged nature of the 
communication. Any review or distribution by others is strictly prohibited. If 
it has been misdirected, or if you suspect you have received this in error, 
please notify me by replying and then delete both the message and reply.  Thank 
you.
***Consider the environment before printing.***
******************************************************************************************

Reply via email to