To demonstrate this is not necessarily a path issue - but instead an issue
with the "archive" is not unpacked - I have created a zip file containing
a python script in its root directory. The archive is added to hive and
then an attempt is made to invoke the python script within a transform
query. But we get a "file not found" from the map Task - indicating that
the archive is not being exploded.
Show that the python script "classifier_wf.py" is resident in the
*root *directory
of the zip file:
e$ jar -tvf py.zip | grep classifier_wf.py
11241 Tue Jun 18 19:37:02 UTC 2013 classifier_wf.py
Add the archive to hive:
hive> add archive /opt/am/ver/1.0/hive/py.zip;
Added resource: /opt/am/ver/1.0/hive/py.zip
Run a transform query:
hive> from (select transform (aappname,qappname) using
'classifier_wf.py' as (aappname2 string, qappname2 string) from eqx ) o
insert overwrite table c select o.aappname2, o.qappname2;
Get an error: ;)
Check the logs:
Caused by: java.io.IOException: Cannot run program "classifier_wf.py":
java.io.IOException: error=2, No such file or directory
2013/6/20 Stephen Boesch <[email protected]>
>
> @Stephen: given the 'relative' path for hive is from a local downloads
> directory on each local tasktracker in the cluster, it was my thought that
> if the archive were actually being expanded then
> somedir/somefileinthearchive should work. I will go ahead and test this
> assumption.
>
> In the meantime, is there any facility available in hive for making
> archived files available to hive jobs? archive or hadoop archive ("har")
> etc?
>
>
> 2013/6/20 Stephen Sprague <[email protected]>
>
>> what would be interesting would be to run a little experiment and find
>> out what the default PATH is on your data nodes. How much of a pain would
>> it be to run a little python script to print to stderr the value of the
>> environmental variable $PATH and $PWD (or the shell command 'pwd') ?
>>
>> that's of course going through normal channels of "add file".
>>
>> the thing is given you're using a relative path "hive/parse_qx.py" you
>> need to know what the "current directory" is when the process runs on the
>> data nodes.
>>
>>
>>
>>
>> On Thu, Jun 20, 2013 at 5:32 AM, Stephen Boesch <[email protected]>wrote:
>>
>>>
>>> We have a few dozen files that need to be made available to all
>>> mappers/reducers in the cluster while running hive transformation steps .
>>>
>>> It seems the "add archive" does not make the entries unarchived and
>>> thus available directly on the default file path - and that is what we are
>>> looking for.
>>>
>>> To illustrate:
>>>
>>> add file modelfile.1;
>>> add file modelfile.2;
>>> ..
>>> add file modelfile.N;
>>>
>>> Then, our model that is invoked during the transformation step *does *have
>>> correct access to its model files in the defaul path.
>>>
>>> But .. those model files take low *minutes* to all load..
>>>
>>> instead when we try:
>>> add archive modelArchive.tgz.
>>>
>>> The problem is the archive does not get exploded apparently ..
>>>
>>> I have an archive for example that contains shell scripts under the
>>> "hive" directory stored inside. I am *not *able to access
>>> hive/my-shell-script.sh after adding the archive. Specifically the
>>> following fails:
>>>
>>> $ tar -tvf appm*.tar.gz | grep launch-quixey_to_xml
>>> -rwxrwxr-x stephenb/stephenb 664 2013-06-18 17:46
>>> appminer/bin/launch-quixey_to_xml.sh
>>>
>>> from (select transform (aappname,qappname)
>>> *using *'*hive/parse_qx.py*' as (aappname2 string, qappname2 string)
>>> from eqx ) o insert overwrite table c select o.aappname2, o.qappname2;
>>>
>>> Cannot run program "hive/parse_qx.py": java.io.IOException: error=2, No
>>> such file or directory
>>>
>>>
>>>
>>>
>>
>