Re: Is there a mechanism similar to hadoop -archive in hive (add archive is not apparently)

2013-06-20 Thread Stephen Sprague
what would be interesting would be to run a little experiment and find out
what the default PATH is on your data nodes.  How much of a pain would it
be to run a little python script to print to stderr the value of the
environmental variable $PATH and $PWD (or the shell command 'pwd') ?

that's of course going through normal channels of add file.

the thing is given you're using a relative path hive/parse_qx.py  you
need to know what the current directory is when the process runs on the
data nodes.




On Thu, Jun 20, 2013 at 5:32 AM, Stephen Boesch java...@gmail.com wrote:


 We have a few dozen files that need to be made available to all
 mappers/reducers in the cluster while running  hive transformation steps .

 It seems the add archive  does not make the entries unarchived and thus
 available directly on the default file path - and that is what we are
 looking for.

 To illustrate:

add file modelfile.1;
add file modelfile.2;
..
 add file modelfile.N;

   Then, our model that is invoked during the transformation step *does *have
 correct access to its model files in the defaul path.

 But .. those model files take low *minutes* to all load..

 instead when we try:
add archive  modelArchive.tgz.

 The problem is the archive does not get exploded apparently ..

 I have an archive for example that contains shell scripts under the hive
 directory stored inside.  I am *not *able to access
 hive/my-shell-script.sh  after adding the archive. Specifically the
 following fails:

 $ tar -tvf appm*.tar.gz | grep launch-quixey_to_xml
 -rwxrwxr-x stephenb/stephenb664 2013-06-18 17:46
 appminer/bin/launch-quixey_to_xml.sh

 from (select transform (aappname,qappname)
 *using *'*hive/parse_qx.py*' as (aappname2 string, qappname2 string) from
 eqx ) o insert overwrite table c select o.aappname2, o.qappname2;

 Cannot run program hive/parse_qx.py: java.io.IOException: error=2, No such 
 file or directory






Re: Is there a mechanism similar to hadoop -archive in hive (add archive is not apparently)

2013-06-20 Thread Stephen Boesch
@Stephen:  given the  'relative' path for hive is from a local downloads
directory on each local tasktracker in the cluster,  it was my thought that
if the archive were actually being expanded then
somedir/somefileinthearchive  should work.  I will go ahead and test this
assumption.

In the meantime, is there any facility available in hive for making
archived files available to hive jobs?  archive or hadoop archive (har)
etc?


2013/6/20 Stephen Sprague sprag...@gmail.com

 what would be interesting would be to run a little experiment and find out
 what the default PATH is on your data nodes.  How much of a pain would it
 be to run a little python script to print to stderr the value of the
 environmental variable $PATH and $PWD (or the shell command 'pwd') ?

 that's of course going through normal channels of add file.

 the thing is given you're using a relative path hive/parse_qx.py  you
 need to know what the current directory is when the process runs on the
 data nodes.




 On Thu, Jun 20, 2013 at 5:32 AM, Stephen Boesch java...@gmail.com wrote:


 We have a few dozen files that need to be made available to all
 mappers/reducers in the cluster while running  hive transformation steps .

 It seems the add archive  does not make the entries unarchived and thus
 available directly on the default file path - and that is what we are
 looking for.

 To illustrate:

add file modelfile.1;
add file modelfile.2;
..
 add file modelfile.N;

   Then, our model that is invoked during the transformation step *does *have
 correct access to its model files in the defaul path.

 But .. those model files take low *minutes* to all load..

 instead when we try:
add archive  modelArchive.tgz.

 The problem is the archive does not get exploded apparently ..

 I have an archive for example that contains shell scripts under the
 hive directory stored inside.  I am *not *able to access
 hive/my-shell-script.sh  after adding the archive. Specifically the
 following fails:

 $ tar -tvf appm*.tar.gz | grep launch-quixey_to_xml
 -rwxrwxr-x stephenb/stephenb664 2013-06-18 17:46
 appminer/bin/launch-quixey_to_xml.sh

 from (select transform (aappname,qappname)
 *using *'*hive/parse_qx.py*' as (aappname2 string, qappname2 string)
 from eqx ) o insert overwrite table c select o.aappname2, o.qappname2;

 Cannot run program hive/parse_qx.py: java.io.IOException: error=2, No such 
 file or directory







Re: Is there a mechanism similar to hadoop -archive in hive (add archive is not apparently)

2013-06-20 Thread Stephen Sprague
i personally only know of adding a .jar file via add archive but my
experience there is very limited.  i believe if you 'add file' and the file
is a directory it'll recursively take everything underneath but i know of
nothing that inflates or un tars things on the remote end automatically.

i would 'add file' your python script and then within that untar your
tarball to get at your model data. its just the matter of figuring out the
path to that tarball that's kinda up in the air when its added as 'add
file'.  Yeah. local downlooads directory.  What's the literal path is
what i'd like to know. :)


On Thu, Jun 20, 2013 at 8:37 AM, Stephen Boesch java...@gmail.com wrote:


 @Stephen:  given the  'relative' path for hive is from a local downloads
 directory on each local tasktracker in the cluster,  it was my thought that
 if the archive were actually being expanded then
 somedir/somefileinthearchive  should work.  I will go ahead and test this
 assumption.

 In the meantime, is there any facility available in hive for making
 archived files available to hive jobs?  archive or hadoop archive (har)
 etc?


 2013/6/20 Stephen Sprague sprag...@gmail.com

 what would be interesting would be to run a little experiment and find
 out what the default PATH is on your data nodes.  How much of a pain would
 it be to run a little python script to print to stderr the value of the
 environmental variable $PATH and $PWD (or the shell command 'pwd') ?

 that's of course going through normal channels of add file.

 the thing is given you're using a relative path hive/parse_qx.py  you
 need to know what the current directory is when the process runs on the
 data nodes.




 On Thu, Jun 20, 2013 at 5:32 AM, Stephen Boesch java...@gmail.comwrote:


 We have a few dozen files that need to be made available to all
 mappers/reducers in the cluster while running  hive transformation steps .

 It seems the add archive  does not make the entries unarchived and
 thus available directly on the default file path - and that is what we are
 looking for.

 To illustrate:

add file modelfile.1;
add file modelfile.2;
..
 add file modelfile.N;

   Then, our model that is invoked during the transformation step *does *have
 correct access to its model files in the defaul path.

 But .. those model files take low *minutes* to all load..

 instead when we try:
add archive  modelArchive.tgz.

 The problem is the archive does not get exploded apparently ..

 I have an archive for example that contains shell scripts under the
 hive directory stored inside.  I am *not *able to access
 hive/my-shell-script.sh  after adding the archive. Specifically the
 following fails:

 $ tar -tvf appm*.tar.gz | grep launch-quixey_to_xml
 -rwxrwxr-x stephenb/stephenb664 2013-06-18 17:46
 appminer/bin/launch-quixey_to_xml.sh

 from (select transform (aappname,qappname)
 *using *'*hive/parse_qx.py*' as (aappname2 string, qappname2 string)
 from eqx ) o insert overwrite table c select o.aappname2, o.qappname2;

 Cannot run program hive/parse_qx.py: java.io.IOException: error=2, No 
 such file or directory








Re: Is there a mechanism similar to hadoop -archive in hive (add archive is not apparently)

2013-06-20 Thread Stephen Boesch
thx for the tip on add file where file is directory. I will try that.


2013/6/20 Stephen Sprague sprag...@gmail.com

 i personally only know of adding a .jar file via add archive but my
 experience there is very limited.  i believe if you 'add file' and the file
 is a directory it'll recursively take everything underneath but i know of
 nothing that inflates or un tars things on the remote end automatically.

 i would 'add file' your python script and then within that untar your
 tarball to get at your model data. its just the matter of figuring out the
 path to that tarball that's kinda up in the air when its added as 'add
 file'.  Yeah. local downlooads directory.  What's the literal path is
 what i'd like to know. :)


 On Thu, Jun 20, 2013 at 8:37 AM, Stephen Boesch java...@gmail.com wrote:


 @Stephen:  given the  'relative' path for hive is from a local downloads
 directory on each local tasktracker in the cluster,  it was my thought that
 if the archive were actually being expanded then
 somedir/somefileinthearchive  should work.  I will go ahead and test this
 assumption.

 In the meantime, is there any facility available in hive for making
 archived files available to hive jobs?  archive or hadoop archive (har)
 etc?


 2013/6/20 Stephen Sprague sprag...@gmail.com

 what would be interesting would be to run a little experiment and find
 out what the default PATH is on your data nodes.  How much of a pain would
 it be to run a little python script to print to stderr the value of the
 environmental variable $PATH and $PWD (or the shell command 'pwd') ?

 that's of course going through normal channels of add file.

 the thing is given you're using a relative path hive/parse_qx.py  you
 need to know what the current directory is when the process runs on the
 data nodes.




 On Thu, Jun 20, 2013 at 5:32 AM, Stephen Boesch java...@gmail.comwrote:


 We have a few dozen files that need to be made available to all
 mappers/reducers in the cluster while running  hive transformation steps .

 It seems the add archive  does not make the entries unarchived and
 thus available directly on the default file path - and that is what we are
 looking for.

 To illustrate:

add file modelfile.1;
add file modelfile.2;
..
 add file modelfile.N;

   Then, our model that is invoked during the transformation step *does
 *have correct access to its model files in the defaul path.

 But .. those model files take low *minutes* to all load..

 instead when we try:
add archive  modelArchive.tgz.

 The problem is the archive does not get exploded apparently ..

 I have an archive for example that contains shell scripts under the
 hive directory stored inside.  I am *not *able to access
 hive/my-shell-script.sh  after adding the archive. Specifically the
 following fails:

 $ tar -tvf appm*.tar.gz | grep launch-quixey_to_xml
 -rwxrwxr-x stephenb/stephenb664 2013-06-18 17:46
 appminer/bin/launch-quixey_to_xml.sh

 from (select transform (aappname,qappname)
 *using *'*hive/parse_qx.py*' as (aappname2 string, qappname2 string)
 from eqx ) o insert overwrite table c select o.aappname2, o.qappname2;

 Cannot run program hive/parse_qx.py: java.io.IOException: error=2, No 
 such file or directory









Re: Is there a mechanism similar to hadoop -archive in hive (add archive is not apparently)

2013-06-20 Thread Stephen Sprague
yeah.  the archive isn't unpacked on the remote side. I think add archive
is mostly used for finding java packages since CLASSPATH will reference the
archive (and as such there is no need to expand it.)


On Thu, Jun 20, 2013 at 9:00 AM, Stephen Boesch java...@gmail.com wrote:

 thx for the tip on add file where file is directory. I will try that.


 2013/6/20 Stephen Sprague sprag...@gmail.com

 i personally only know of adding a .jar file via add archive but my
 experience there is very limited.  i believe if you 'add file' and the file
 is a directory it'll recursively take everything underneath but i know of
 nothing that inflates or un tars things on the remote end automatically.

 i would 'add file' your python script and then within that untar your
 tarball to get at your model data. its just the matter of figuring out the
 path to that tarball that's kinda up in the air when its added as 'add
 file'.  Yeah. local downlooads directory.  What's the literal path is
 what i'd like to know. :)


 On Thu, Jun 20, 2013 at 8:37 AM, Stephen Boesch java...@gmail.comwrote:


 @Stephen:  given the  'relative' path for hive is from a local downloads
 directory on each local tasktracker in the cluster,  it was my thought that
 if the archive were actually being expanded then
 somedir/somefileinthearchive  should work.  I will go ahead and test this
 assumption.

 In the meantime, is there any facility available in hive for making
 archived files available to hive jobs?  archive or hadoop archive (har)
 etc?


 2013/6/20 Stephen Sprague sprag...@gmail.com

 what would be interesting would be to run a little experiment and find
 out what the default PATH is on your data nodes.  How much of a pain would
 it be to run a little python script to print to stderr the value of the
 environmental variable $PATH and $PWD (or the shell command 'pwd') ?

 that's of course going through normal channels of add file.

 the thing is given you're using a relative path hive/parse_qx.py  you
 need to know what the current directory is when the process runs on the
 data nodes.




 On Thu, Jun 20, 2013 at 5:32 AM, Stephen Boesch java...@gmail.comwrote:


 We have a few dozen files that need to be made available to all
 mappers/reducers in the cluster while running  hive transformation steps .

 It seems the add archive  does not make the entries unarchived and
 thus available directly on the default file path - and that is what we are
 looking for.

 To illustrate:

add file modelfile.1;
add file modelfile.2;
..
 add file modelfile.N;

   Then, our model that is invoked during the transformation step *does
 *have correct access to its model files in the defaul path.

 But .. those model files take low *minutes* to all load..

 instead when we try:
add archive  modelArchive.tgz.

 The problem is the archive does not get exploded apparently ..

 I have an archive for example that contains shell scripts under the
 hive directory stored inside.  I am *not *able to access
 hive/my-shell-script.sh  after adding the archive. Specifically the
 following fails:

 $ tar -tvf appm*.tar.gz | grep launch-quixey_to_xml
 -rwxrwxr-x stephenb/stephenb664 2013-06-18 17:46
 appminer/bin/launch-quixey_to_xml.sh

 from (select transform (aappname,qappname)
 *using *'*hive/parse_qx.py*' as (aappname2 string, qappname2 string)
 from eqx ) o insert overwrite table c select o.aappname2, o.qappname2;

 Cannot run program hive/parse_qx.py: java.io.IOException: error=2, No 
 such file or directory










Re: Is there a mechanism similar to hadoop -archive in hive (add archive is not apparently)

2013-06-20 Thread Stephen Boesch
Stephen:  would you be willing to share an example of specifying a
directory as the  add file target?I have not seen this working

I have attempted to use it as follows:

*We will access a script within the hivetry directory located here:*
hive ! ls -l  /opt/am/ver/1.0/hive/hivetry/classifier_wf.py;
-rwxrwxr-x 1 hadoop hadoop 11241 Jun 18 19:37
/opt/am/ver/1.0/hive/hivetry/classifier_wf.py

*Add the directory  to hive:*
hive add file /opt/am/ver/1.0/hive/hivetry;
Added resource: /opt/am/ver/1.0/hive/hivetry

*Attempt to run transform query using that script:*
*
*
*Attempt one: use the script name unqualified:*

hivefrom (select transform (aappname,qappname) using
'classifier_wf.py' as (aappname2 string, qappname2 string) from eqx )
o insert overwrite table c select o.aappname2, o.qappname2;

(Failed:   Caused by: java.io.IOException: Cannot run program
classifier_wf.py: java.io.IOException: error=2, No such file or
directory)


*Attempt two: use the script name with the directory name prefix: *
hivefrom (select transform (aappname,qappname) using
'hive/classifier_wf.py' as (aappname2 string, qappname2 string) from
eqx ) o insert overwrite table c select o.aappname2, o.qappname2;

(Failed:   Caused by: java.io.IOException: Cannot run program
hive/classifier_wf.py: java.io.IOException: error=2, No such file or
directory)




2013/6/20 Stephen Sprague sprag...@gmail.com

 yeah.  the archive isn't unpacked on the remote side. I think add archive
 is mostly used for finding java packages since CLASSPATH will reference the
 archive (and as such there is no need to expand it.)


 On Thu, Jun 20, 2013 at 9:00 AM, Stephen Boesch java...@gmail.com wrote:

 thx for the tip on add file where file is directory. I will try
 that.


 2013/6/20 Stephen Sprague sprag...@gmail.com

 i personally only know of adding a .jar file via add archive but my
 experience there is very limited.  i believe if you 'add file' and the file
 is a directory it'll recursively take everything underneath but i know of
 nothing that inflates or un tars things on the remote end automatically.

 i would 'add file' your python script and then within that untar your
 tarball to get at your model data. its just the matter of figuring out the
 path to that tarball that's kinda up in the air when its added as 'add
 file'.  Yeah. local downlooads directory.  What's the literal path is
 what i'd like to know. :)


 On Thu, Jun 20, 2013 at 8:37 AM, Stephen Boesch java...@gmail.comwrote:


 @Stephen:  given the  'relative' path for hive is from a local
 downloads directory on each local tasktracker in the cluster,  it was my
 thought that if the archive were actually being expanded then
 somedir/somefileinthearchive  should work.  I will go ahead and test this
 assumption.

 In the meantime, is there any facility available in hive for making
 archived files available to hive jobs?  archive or hadoop archive (har)
 etc?


 2013/6/20 Stephen Sprague sprag...@gmail.com

 what would be interesting would be to run a little experiment and find
 out what the default PATH is on your data nodes.  How much of a pain would
 it be to run a little python script to print to stderr the value of the
 environmental variable $PATH and $PWD (or the shell command 'pwd') ?

 that's of course going through normal channels of add file.

 the thing is given you're using a relative path hive/parse_qx.py
 you need to know what the current directory is when the process runs on
 the data nodes.




 On Thu, Jun 20, 2013 at 5:32 AM, Stephen Boesch java...@gmail.comwrote:


 We have a few dozen files that need to be made available to all
 mappers/reducers in the cluster while running  hive transformation steps 
 .

 It seems the add archive  does not make the entries unarchived and
 thus available directly on the default file path - and that is what we 
 are
 looking for.

 To illustrate:

add file modelfile.1;
add file modelfile.2;
..
 add file modelfile.N;

   Then, our model that is invoked during the transformation step *does
 *have correct access to its model files in the defaul path.

 But .. those model files take low *minutes* to all load..

 instead when we try:
add archive  modelArchive.tgz.

 The problem is the archive does not get exploded apparently ..

 I have an archive for example that contains shell scripts under the
 hive directory stored inside.  I am *not *able to access
 hive/my-shell-script.sh  after adding the archive. Specifically the
 following fails:

 $ tar -tvf appm*.tar.gz | grep launch-quixey_to_xml
 -rwxrwxr-x stephenb/stephenb664 2013-06-18 17:46
 appminer/bin/launch-quixey_to_xml.sh

 from (select transform (aappname,qappname)
 *using *'*hive/parse_qx.py*' as (aappname2 string, qappname2 string)
 from eqx ) o insert overwrite table c select o.aappname2, o.qappname2;

 Cannot run program hive/parse_qx.py: java.io.IOException: error=2, No 
 such file or directory











Re: Is there a mechanism similar to hadoop -archive in hive (add archive is not apparently)

2013-06-20 Thread Ramki Palle
In the *Attempt two, *are you not supposed to use hivetry as the
directory?

May be you should try giving the full path 
/opt/am/ver/1.0/hive/hivetry/classifier_wf.py and see if it works.

Regards,
Ramki.


On Thu, Jun 20, 2013 at 9:28 AM, Stephen Boesch java...@gmail.com wrote:


 Stephen:  would you be willing to share an example of specifying a
 directory as the  add file target?I have not seen this working

 I have attempted to use it as follows:

 *We will access a script within the hivetry directory located here:*
 hive ! ls -l  /opt/am/ver/1.0/hive/hivetry/classifier_wf.py;
 -rwxrwxr-x 1 hadoop hadoop 11241 Jun 18 19:37
 /opt/am/ver/1.0/hive/hivetry/classifier_wf.py

 *Add the directory  to hive:*
 hive add file /opt/am/ver/1.0/hive/hivetry;
 Added resource: /opt/am/ver/1.0/hive/hivetry

 *Attempt to run transform query using that script:*
 *
 *
 *Attempt one: use the script name unqualified:*

 hivefrom (select transform (aappname,qappname) using 'classifier_wf.py' 
 as (aappname2 string, qappname2 string) from eqx ) o insert overwrite table c 
 select o.aappname2, o.qappname2;

 (Failed:   Caused by: java.io.IOException: Cannot run program 
 classifier_wf.py: java.io.IOException: error=2, No such file or directory)


 *Attempt two: use the script name with the directory name prefix: *

 hivefrom (select transform (aappname,qappname) using 
 'hive/classifier_wf.py' as (aappname2 string, qappname2 string) from eqx ) o 
 insert overwrite table c select o.aappname2, o.qappname2;

 (Failed:   Caused by: java.io.IOException: Cannot run program 
 hive/classifier_wf.py: java.io.IOException: error=2, No such file or 
 directory)




 2013/6/20 Stephen Sprague sprag...@gmail.com

 yeah.  the archive isn't unpacked on the remote side. I think add archive
 is mostly used for finding java packages since CLASSPATH will reference the
 archive (and as such there is no need to expand it.)


 On Thu, Jun 20, 2013 at 9:00 AM, Stephen Boesch java...@gmail.comwrote:

 thx for the tip on add file where file is directory. I will try
 that.


 2013/6/20 Stephen Sprague sprag...@gmail.com

 i personally only know of adding a .jar file via add archive but my
 experience there is very limited.  i believe if you 'add file' and the file
 is a directory it'll recursively take everything underneath but i know of
 nothing that inflates or un tars things on the remote end automatically.

 i would 'add file' your python script and then within that untar your
 tarball to get at your model data. its just the matter of figuring out the
 path to that tarball that's kinda up in the air when its added as 'add
 file'.  Yeah. local downlooads directory.  What's the literal path is
 what i'd like to know. :)


 On Thu, Jun 20, 2013 at 8:37 AM, Stephen Boesch java...@gmail.comwrote:


 @Stephen:  given the  'relative' path for hive is from a local
 downloads directory on each local tasktracker in the cluster,  it was my
 thought that if the archive were actually being expanded then
 somedir/somefileinthearchive  should work.  I will go ahead and test this
 assumption.

 In the meantime, is there any facility available in hive for making
 archived files available to hive jobs?  archive or hadoop archive (har)
 etc?


 2013/6/20 Stephen Sprague sprag...@gmail.com

 what would be interesting would be to run a little experiment and
 find out what the default PATH is on your data nodes.  How much of a pain
 would it be to run a little python script to print to stderr the value of
 the environmental variable $PATH and $PWD (or the shell command 'pwd') ?

 that's of course going through normal channels of add file.

 the thing is given you're using a relative path hive/parse_qx.py
 you need to know what the current directory is when the process runs on
 the data nodes.




 On Thu, Jun 20, 2013 at 5:32 AM, Stephen Boesch java...@gmail.comwrote:


 We have a few dozen files that need to be made available to all
 mappers/reducers in the cluster while running  hive transformation 
 steps .

 It seems the add archive  does not make the entries unarchived and
 thus available directly on the default file path - and that is what we 
 are
 looking for.

 To illustrate:

add file modelfile.1;
add file modelfile.2;
..
 add file modelfile.N;

   Then, our model that is invoked during the transformation step *does
 *have correct access to its model files in the defaul path.

 But .. those model files take low *minutes* to all load..

 instead when we try:
add archive  modelArchive.tgz.

 The problem is the archive does not get exploded apparently ..

 I have an archive for example that contains shell scripts under the
 hive directory stored inside.  I am *not *able to access
 hive/my-shell-script.sh  after adding the archive. Specifically the
 following fails:

 $ tar -tvf appm*.tar.gz | grep launch-quixey_to_xml
 -rwxrwxr-x stephenb/stephenb664 2013-06-18 17:46
 appminer/bin/launch-quixey_to_xml.sh

 from (select 

Re: Is there a mechanism similar to hadoop -archive in hive (add archive is not apparently)

2013-06-20 Thread Stephen Boesch
Good eyes Ramki!  thanks this directory in place of filename appears to
be working.  The script is getting loaded now using the Attempt two i.e.
 the hivetry/classification_wf.py as the script path.

thanks again.

stephenb


2013/6/20 Ramki Palle ramki.pa...@gmail.com

 In the *Attempt two, *are you not supposed to use hivetry as the
 directory?

 May be you should try giving the full path 
 /opt/am/ver/1.0/hive/hivetry/classifier_wf.py and see if it works.

 Regards,
 Ramki.


 On Thu, Jun 20, 2013 at 9:28 AM, Stephen Boesch java...@gmail.com wrote:


 Stephen:  would you be willing to share an example of specifying a
 directory as the  add file target?I have not seen this working

 I have attempted to use it as follows:

 *We will access a script within the hivetry directory located here:*
 hive ! ls -l  /opt/am/ver/1.0/hive/hivetry/classifier_wf.py;
 -rwxrwxr-x 1 hadoop hadoop 11241 Jun 18 19:37
 /opt/am/ver/1.0/hive/hivetry/classifier_wf.py

 *Add the directory  to hive:*
 hive add file /opt/am/ver/1.0/hive/hivetry;
 Added resource: /opt/am/ver/1.0/hive/hivetry

 *Attempt to run transform query using that script:*
 *
 *
 *Attempt one: use the script name unqualified:*

 hivefrom (select transform (aappname,qappname) using 'classifier_wf.py' 
 as (aappname2 string, qappname2 string) from eqx ) o insert overwrite table 
 c select o.aappname2, o.qappname2;


 (Failed:   Caused by: java.io.IOException: Cannot run program 
 classifier_wf.py: java.io.IOException: error=2, No such file or directory)


 *Attempt two: use the script name with the directory name prefix: *

 hivefrom (select transform (aappname,qappname) using 
 'hive/classifier_wf.py' as (aappname2 string, qappname2 string) from eqx ) o 
 insert overwrite table c select o.aappname2, o.qappname2;


 (Failed:   Caused by: java.io.IOException: Cannot run program 
 hive/classifier_wf.py: java.io.IOException: error=2, No such file or 
 directory)





 2013/6/20 Stephen Sprague sprag...@gmail.com

 yeah.  the archive isn't unpacked on the remote side. I think add
 archive is mostly used for finding java packages since CLASSPATH will
 reference the archive (and as such there is no need to expand it.)


 On Thu, Jun 20, 2013 at 9:00 AM, Stephen Boesch java...@gmail.comwrote:

 thx for the tip on add file where file is directory. I will try
 that.


 2013/6/20 Stephen Sprague sprag...@gmail.com

 i personally only know of adding a .jar file via add archive but my
 experience there is very limited.  i believe if you 'add file' and the 
 file
 is a directory it'll recursively take everything underneath but i know of
 nothing that inflates or un tars things on the remote end automatically.

 i would 'add file' your python script and then within that untar your
 tarball to get at your model data. its just the matter of figuring out the
 path to that tarball that's kinda up in the air when its added as 'add
 file'.  Yeah. local downlooads directory.  What's the literal path is
 what i'd like to know. :)


 On Thu, Jun 20, 2013 at 8:37 AM, Stephen Boesch java...@gmail.comwrote:


 @Stephen:  given the  'relative' path for hive is from a local
 downloads directory on each local tasktracker in the cluster,  it was my
 thought that if the archive were actually being expanded then
 somedir/somefileinthearchive  should work.  I will go ahead and test this
 assumption.

 In the meantime, is there any facility available in hive for making
 archived files available to hive jobs?  archive or hadoop archive (har)
 etc?


 2013/6/20 Stephen Sprague sprag...@gmail.com

 what would be interesting would be to run a little experiment and
 find out what the default PATH is on your data nodes.  How much of a 
 pain
 would it be to run a little python script to print to stderr the value 
 of
 the environmental variable $PATH and $PWD (or the shell command 'pwd') ?

 that's of course going through normal channels of add file.

 the thing is given you're using a relative path hive/parse_qx.py
 you need to know what the current directory is when the process runs 
 on
 the data nodes.




 On Thu, Jun 20, 2013 at 5:32 AM, Stephen Boesch 
 java...@gmail.comwrote:


 We have a few dozen files that need to be made available to all
 mappers/reducers in the cluster while running  hive transformation 
 steps .

 It seems the add archive  does not make the entries unarchived
 and thus available directly on the default file path - and that is 
 what we
 are looking for.

 To illustrate:

add file modelfile.1;
add file modelfile.2;
..
 add file modelfile.N;

   Then, our model that is invoked during the transformation step *does
 *have correct access to its model files in the defaul path.

 But .. those model files take low *minutes* to all load..

 instead when we try:
add archive  modelArchive.tgz.

 The problem is the archive does not get exploded apparently ..

 I have an archive for example that contains shell scripts under the
 hive directory stored