thx for the tip on "add <file>" where <file> is directory. I will try that.
2013/6/20 Stephen Sprague <sprag...@gmail.com> > i personally only know of adding a .jar file via add archive but my > experience there is very limited. i believe if you 'add file' and the file > is a directory it'll recursively take everything underneath but i know of > nothing that inflates or un tars things on the remote end automatically. > > i would 'add file' your python script and then within that untar your > tarball to get at your model data. its just the matter of figuring out the > path to that tarball that's kinda up in the air when its added as 'add > file'. Yeah. "local downlooads directory". What's the literal path is > what i'd like to know. :) > > > On Thu, Jun 20, 2013 at 8:37 AM, Stephen Boesch <java...@gmail.com> wrote: > >> >> @Stephen: given the 'relative' path for hive is from a local downloads >> directory on each local tasktracker in the cluster, it was my thought that >> if the archive were actually being expanded then >> somedir/somefileinthearchive should work. I will go ahead and test this >> assumption. >> >> In the meantime, is there any facility available in hive for making >> archived files available to hive jobs? archive or hadoop archive ("har") >> etc? >> >> >> 2013/6/20 Stephen Sprague <sprag...@gmail.com> >> >>> what would be interesting would be to run a little experiment and find >>> out what the default PATH is on your data nodes. How much of a pain would >>> it be to run a little python script to print to stderr the value of the >>> environmental variable $PATH and $PWD (or the shell command 'pwd') ? >>> >>> that's of course going through normal channels of "add file". >>> >>> the thing is given you're using a relative path "hive/parse_qx.py" you >>> need to know what the "current directory" is when the process runs on the >>> data nodes. >>> >>> >>> >>> >>> On Thu, Jun 20, 2013 at 5:32 AM, Stephen Boesch <java...@gmail.com>wrote: >>> >>>> >>>> We have a few dozen files that need to be made available to all >>>> mappers/reducers in the cluster while running hive transformation steps . >>>> >>>> It seems the "add archive" does not make the entries unarchived and >>>> thus available directly on the default file path - and that is what we are >>>> looking for. >>>> >>>> To illustrate: >>>> >>>> add file modelfile.1; >>>> add file modelfile.2; >>>> .. >>>> add file modelfile.N; >>>> >>>> Then, our model that is invoked during the transformation step *does >>>> *have correct access to its model files in the defaul path. >>>> >>>> But .. those model files take low *minutes* to all load.. >>>> >>>> instead when we try: >>>> add archive modelArchive.tgz. >>>> >>>> The problem is the archive does not get exploded apparently .. >>>> >>>> I have an archive for example that contains shell scripts under the >>>> "hive" directory stored inside. I am *not *able to access >>>> hive/my-shell-script.sh after adding the archive. Specifically the >>>> following fails: >>>> >>>> $ tar -tvf appm*.tar.gz | grep launch-quixey_to_xml >>>> -rwxrwxr-x stephenb/stephenb 664 2013-06-18 17:46 >>>> appminer/bin/launch-quixey_to_xml.sh >>>> >>>> from (select transform (aappname,qappname) >>>> *using *'*hive/parse_qx.py*' as (aappname2 string, qappname2 string) >>>> from eqx ) o insert overwrite table c select o.aappname2, o.qappname2; >>>> >>>> Cannot run program "hive/parse_qx.py": java.io.IOException: error=2, No >>>> such file or directory >>>> >>>> >>>> >>>> >>> >> >