Re: Is there a mechanism similar to hadoop -archive in hive (add archive is not apparently)

Stephen Boesch Thu, 20 Jun 2013 08:58:25 -0700

To demonstrate this is not necessarily a path issue - but instead an issue
with the "archive" is not unpacked  -  I have created a zip file containing
a python script in its root directory.  The archive is added to hive and
then an attempt is made to invoke the python script within a transform
query. But we get a "file not found" from the map Task - indicating that
the archive is not being exploded.


Show that the python script "classifier_wf.py" is resident in the
*root *directory
of the zip file:
e$ jar -tvf py.zip | grep classifier_wf.py
 11241 Tue Jun 18 19:37:02 UTC 2013 classifier_wf.py

Add the archive to hive:
   hive> add archive /opt/am/ver/1.0/hive/py.zip;
   Added resource: /opt/am/ver/1.0/hive/py.zip

Run a transform query:

  hive>    from (select transform (aappname,qappname) using
'classifier_wf.py' as (aappname2 string, qappname2 string) from eqx ) o
insert overwrite table c select o.aappname2, o.qappname2;

Get an error:   ;)

Check the logs:

Caused by: java.io.IOException: Cannot run program "classifier_wf.py":
java.io.IOException: error=2, No such file or directory
        



2013/6/20 Stephen Boesch <java...@gmail.com>

>
> @Stephen:  given the  'relative' path for hive is from a local downloads
> directory on each local tasktracker in the cluster,  it was my thought that
> if the archive were actually being expanded then
> somedir/somefileinthearchive  should work.  I will go ahead and test this
> assumption.
>
> In the meantime, is there any facility available in hive for making
> archived files available to hive jobs?  archive or hadoop archive ("har")
> etc?
>
>
> 2013/6/20 Stephen Sprague <sprag...@gmail.com>
>
>> what would be interesting would be to run a little experiment and find
>> out what the default PATH is on your data nodes.  How much of a pain would
>> it be to run a little python script to print to stderr the value of the
>> environmental variable $PATH and $PWD (or the shell command 'pwd') ?
>>
>> that's of course going through normal channels of "add file".
>>
>> the thing is given you're using a relative path "hive/parse_qx.py"  you
>> need to know what the "current directory" is when the process runs on the
>> data nodes.
>>
>>
>>
>>
>> On Thu, Jun 20, 2013 at 5:32 AM, Stephen Boesch <java...@gmail.com>wrote:
>>
>>>
>>> We have a few dozen files that need to be made available to all
>>> mappers/reducers in the cluster while running  hive transformation steps .
>>>
>>> It seems the "add archive"  does not make the entries unarchived and
>>> thus available directly on the default file path - and that is what we are
>>> looking for.
>>>
>>> To illustrate:
>>>
>>>    add file modelfile.1;
>>>    add file modelfile.2;
>>>    ..
>>>     add file modelfile.N;
>>>
>>>   Then, our model that is invoked during the transformation step *does *have
>>> correct access to its model files in the defaul path.
>>>
>>> But .. those model files take low *minutes* to all load..
>>>
>>> instead when we try:
>>>    add archive  modelArchive.tgz.
>>>
>>> The problem is the archive does not get exploded apparently ..
>>>
>>> I have an archive for example that contains shell scripts under the
>>> "hive" directory stored inside.  I am *not *able to access
>>> hive/my-shell-script.sh  after adding the archive. Specifically the
>>> following fails:
>>>
>>> $ tar -tvf appm*.tar.gz | grep launch-quixey_to_xml
>>> -rwxrwxr-x stephenb/stephenb    664 2013-06-18 17:46
>>> appminer/bin/launch-quixey_to_xml.sh
>>>
>>> from (select transform (aappname,qappname)
>>> *using *'*hive/parse_qx.py*' as (aappname2 string, qappname2 string)
>>> from eqx ) o insert overwrite table c select o.aappname2, o.qappname2;
>>>
>>> Cannot run program "hive/parse_qx.py": java.io.IOException: error=2, No 
>>> such file or directory
>>>
>>>
>>>
>>>
>>
>

Re: Is there a mechanism similar to hadoop -archive in hive (add archive is not apparently)

Reply via email to