To demonstrate this is not necessarily a path issue - but instead an issue with the "archive" is not unpacked - I have created a zip file containing a python script in its root directory. The archive is added to hive and then an attempt is made to invoke the python script within a transform query. But we get a "file not found" from the map Task - indicating that the archive is not being exploded.
Show that the python script "classifier_wf.py" is resident in the *root *directory of the zip file: e$ jar -tvf py.zip | grep classifier_wf.py 11241 Tue Jun 18 19:37:02 UTC 2013 classifier_wf.py Add the archive to hive: hive> add archive /opt/am/ver/1.0/hive/py.zip; Added resource: /opt/am/ver/1.0/hive/py.zip Run a transform query: hive> from (select transform (aappname,qappname) using 'classifier_wf.py' as (aappname2 string, qappname2 string) from eqx ) o insert overwrite table c select o.aappname2, o.qappname2; Get an error: ;) Check the logs: Caused by: java.io.IOException: Cannot run program "classifier_wf.py": java.io.IOException: error=2, No such file or directory 2013/6/20 Stephen Boesch <java...@gmail.com> > > @Stephen: given the 'relative' path for hive is from a local downloads > directory on each local tasktracker in the cluster, it was my thought that > if the archive were actually being expanded then > somedir/somefileinthearchive should work. I will go ahead and test this > assumption. > > In the meantime, is there any facility available in hive for making > archived files available to hive jobs? archive or hadoop archive ("har") > etc? > > > 2013/6/20 Stephen Sprague <sprag...@gmail.com> > >> what would be interesting would be to run a little experiment and find >> out what the default PATH is on your data nodes. How much of a pain would >> it be to run a little python script to print to stderr the value of the >> environmental variable $PATH and $PWD (or the shell command 'pwd') ? >> >> that's of course going through normal channels of "add file". >> >> the thing is given you're using a relative path "hive/parse_qx.py" you >> need to know what the "current directory" is when the process runs on the >> data nodes. >> >> >> >> >> On Thu, Jun 20, 2013 at 5:32 AM, Stephen Boesch <java...@gmail.com>wrote: >> >>> >>> We have a few dozen files that need to be made available to all >>> mappers/reducers in the cluster while running hive transformation steps . >>> >>> It seems the "add archive" does not make the entries unarchived and >>> thus available directly on the default file path - and that is what we are >>> looking for. >>> >>> To illustrate: >>> >>> add file modelfile.1; >>> add file modelfile.2; >>> .. >>> add file modelfile.N; >>> >>> Then, our model that is invoked during the transformation step *does *have >>> correct access to its model files in the defaul path. >>> >>> But .. those model files take low *minutes* to all load.. >>> >>> instead when we try: >>> add archive modelArchive.tgz. >>> >>> The problem is the archive does not get exploded apparently .. >>> >>> I have an archive for example that contains shell scripts under the >>> "hive" directory stored inside. I am *not *able to access >>> hive/my-shell-script.sh after adding the archive. Specifically the >>> following fails: >>> >>> $ tar -tvf appm*.tar.gz | grep launch-quixey_to_xml >>> -rwxrwxr-x stephenb/stephenb 664 2013-06-18 17:46 >>> appminer/bin/launch-quixey_to_xml.sh >>> >>> from (select transform (aappname,qappname) >>> *using *'*hive/parse_qx.py*' as (aappname2 string, qappname2 string) >>> from eqx ) o insert overwrite table c select o.aappname2, o.qappname2; >>> >>> Cannot run program "hive/parse_qx.py": java.io.IOException: error=2, No >>> such file or directory >>> >>> >>> >>> >> >