RE: GSS initiate failed exception

2015-02-26 Thread Bogala, Chandra Reddy
I have below code. This does the renewal right?. What I am not getting is, How 
drivermanager know this config/user  to use when establishing connection?

UserGroupInformation.setConfiguration(conf);
 if (loginUser == null) {
   
UserGroupInformation.loginUserFromKeytab(keytabPrincipal,
 keytabLocation)
   loginUser = UserGroupInformation.getLoginUser();

 } else {
   loginUser.checkTGTAndReloginFromKeytab();
 }
  con = DriverManager

.getConnection("jdbc:hive2://x:1/abc;principal=hive/@");
  stmt = con.createStatement();

Thanks,
Chandra

From: Vaibhav Gumashta [mailto:vgumas...@hortonworks.com]
Sent: Thursday, February 26, 2015 2:03 AM
To: user@hive.apache.org
Subject: Re: GSS initiate failed exception

Looks like your Kerberos authentication failed. Did you renew the ticket? 
Typical expiry is 24 hours.

From: , Chandra Reddy 
mailto:chandra.bog...@gs.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Date: Wednesday, February 25, 2015 at 3:42 AM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Subject: GSS initiate failed exception

Hi,
  My Hive jdbc client  queries ( hiveserver2)  to secured cluster fails with 
below exception after one or two days running fine from tomcat. Any idea what 
might be the issue? Is it a known issue?

2015-02-25 04:49:43,174 ERROR [org.apache.thrift.transport.TSaslTransport] SASL 
negotiation failure
javax.security.sasl.SaslException: GSS initiate failed
at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
 ~[na:1.7.0_06]
at 
org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
 [hive-exec-0.13.0.jar:0.13.0]

--
-
Caused by: org.apache.thrift.transport.TTransportException: GSS initiate failed
at 
org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:221)
at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:297)
at 
org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
at 
org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:203)

Thanks,
Chandra



GSS initiate failed exception

2015-02-25 Thread Bogala, Chandra Reddy
Hi,
  My Hive jdbc client  queries ( hiveserver2)  to secured cluster fails with 
below exception after one or two days running fine from tomcat. Any idea what 
might be the issue? Is it a known issue?

2015-02-25 04:49:43,174 ERROR [org.apache.thrift.transport.TSaslTransport] SASL 
negotiation failure
javax.security.sasl.SaslException: GSS initiate failed
at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
 ~[na:1.7.0_06]
at 
org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
 [hive-exec-0.13.0.jar:0.13.0]

--
-
Caused by: org.apache.thrift.transport.TTransportException: GSS initiate failed
at 
org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:221)
at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:297)
at 
org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
at 
org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:203)

Thanks,
Chandra



beeline client issue

2014-12-14 Thread Bogala, Chandra Reddy
Hi,
  I am running hql script in background by using beeline ( i.e nohup beeline -u 
"" -d org.apache.hive.jdbc.HiveDriver --fastConnect=true  -f 
.hql &). I have set 
HADOOP_CLIENT_OPTS="-Djline.terminal=jline.UnsupportedTerminal" as well in my 
shell.
Job completes successfully. Job inserts data into hbase successfully.  But I 
see full of below lines in nohup.out/console. Because of this I guess, the 
beeline job process still running on my client machine for long time even after 
job completion. How to stop these messages and make sure the backgroup process 
goes away as soon as the job completes?

jdbc:hive2://:1/> 0: jdbc:hive2:// :1/> 0: 
jdbc:hive2:// :1/>

Thanks,
Chandra



RE: beeline client

2014-07-13 Thread Bogala, Chandra Reddy
I submit 5 queries from 5 different scripts.  When I use hive cli, I could see 
all these 5 queries related jobs in application tracker in different states ( 
running/accepted etc). When I submit through beeline only one job I see in 
application tracker. Other job I can see only after first job finishes.

Ex: hive –f script1.hql &
  Hive –f script2.hql &
  Hive –f script3.hql &
  Hive –f script4.hql &
  Hive –f script5.hql &


From: Xuefu Zhang [mailto:xzh...@cloudera.com]
Sent: Friday, July 11, 2014 7:29 PM
To: user@hive.apache.org
Subject: Re: beeline client

Chaudra,
The difference you saw between Hive CLI and Beeline might indicate a bug. 
However, before making such a conclusion, could you give an example of your 
queries? Are the jobs you expect to run parallel for a single query? Please 
note that your script file is executed line by line in either case.
--Xuefu

On Thu, Jul 10, 2014 at 11:55 PM, Bogala, Chandra Reddy 
mailto:chandra.bog...@gs.com>> wrote:
Hi,
   Currently I am submitting multiple hive jobs using hive cli with “hive –f” 
from different scripts. All these jobs I could see in application tracker and 
these get processed in parallel.
Now I planned to switch to HiveServer2 and submitting jobs using beeline client 
from multiple scripts  example : “nohup beeline -u 
jdbc:hive2://:#=$epoch -n  -p  -d 
org.apache.hive.jdbc.HiveDriver -f .hql &”.
But I could see jobs get submitted in serial. Only one job appears in 
application tracker at a time even though cluster resources are available. Once 
that job finishes then only other job get submitted. Why is that? Is there any 
setting needs to be set to submit jobs in parallel?

Thanks,
Chandra




beeline client

2014-07-10 Thread Bogala, Chandra Reddy
Hi,
   Currently I am submitting multiple hive jobs using hive cli with "hive -f" 
from different scripts. All these jobs I could see in application tracker and 
these get processed in parallel.
Now I planned to switch to HiveServer2 and submitting jobs using beeline client 
from multiple scripts  example : "nohup beeline -u 
jdbc:hive2://:#=$epoch -n  -p  -d 
org.apache.hive.jdbc.HiveDriver -f .hql &".
But I could see jobs get submitted in serial. Only one job appears in 
application tracker at a time even though cluster resources are available. Once 
that job finishes then only other job get submitted. Why is that? Is there any 
setting needs to be set to submit jobs in parallel?

Thanks,
Chandra



beeline client

2014-06-22 Thread Bogala, Chandra Reddy
Hi,
  I was using hive client to submit jobs. I used to submit multiple jobs ( 
multiple hive clients ex: nohup hive -f job1.hql , nohup hive -f job2.hql etc). 
All these jobs I used to see in application tracker immediately and used to run 
parallel.

Same I am trying to do with beeline client. Submitted multiple beeline client 
jobs . Below are the examples. I can see only one job running/submitted to 
application tracker. But on my client machine I can see multiple beeline client 
processes. How can I submit multiple parallel jobs using beeline client. So 
that my cluster resources better utilized.


Examples:
nohup beeline -u jdbc:hive2://:1#ptotoCurrentTS=1403456400 -n 
 -p hive -d org.apache.hive.jdbc.HiveDriver   -f 
./hql-hbase-v2/job1_beeline.hql &
nohup beeline -u jdbc:hive2://:1#ptotoCurrentTS=1403457300 -n 
 -p hive -d org.apache.hive.jdbc.HiveDriver   -f 
./hql-hbase-v2/job1_beeline.hql &
nohup beeline -u jdbc:hive2://:1#ptotoCurrentTS=1403458200 -n 
 -p hive -d org.apache.hive.jdbc.HiveDriver   -f 
./hql-hbase-v2/job1_beeline.hql &

Thanks,
Chandra




hive variables

2014-06-20 Thread Bogala, Chandra Reddy
How does hive variables work?. if I have multiple Hive jobs running 
simultaneously? Will they end up picking up values from each other?
In automation I am constructing an HQL file by prepending it with some SET 
statements. I want to make sure if I submit two jobs at the same time that use 
the same variable names, one job won't pick up values from the other job.

Same question from stakeoverflow: 
http://stackoverflow.com/questions/12464636/how-to-set-variables-in-hive-scripts

Thanks,
Chandra



performance and cluster size required

2014-06-05 Thread Bogala, Chandra Reddy
Hi,
  I get 300MB compressed file (structured CSV data) in spool directory every 3 
minutes from collector. I have around 6 collectors. I move data from spool dir 
to HDFS directory and add as a hive partition for every 15 minutes data. Then I 
run different aggregation queries and post data to Hbase & Mongo. So the data 
is around 9 GB compressed for every query. For this much data I need to 
evaluate how many cluster nodes required to finish all the aggregation queries 
with in time ( within 15 minutes partition window).What is the best way to 
evaluate this?

Is there any way I can post aggregated data to both Mongo and 
Hbase ( same query result posting to multiple tables instead of running same 
query multiple times and insert only in single table at a time)?

Thanks,
Chandra


RE: Predicate pushdown optimisation not working for ORC

2014-04-03 Thread Bogala, Chandra Reddy
I thought ORC file can be generated only by running hive query on staging table 
and inserting into ORC table. If there is option to generate ORC file at client 
side by using java code then can you share that code or links related to that?
Thanks,
Chandra

From: Abhay Bansal [mailto:abhaybansal.1...@gmail.com]
Sent: Thursday, April 03, 2014 11:06 AM
To: user@hive.apache.org
Subject: Predicate pushdown optimisation not working for ORC

I am new to Hive, apologise for asking such a basic question.

Following exercise was done with hive .12 and hadoop 0.20.203

I created a ORC file form java, and pushed it into a table with the same 
schema. I checked the conf property 
hive.optimize.ppdtrue which 
should ideally use the ppd optimisation.

I ran a query "select sourceipv4address,sessionid,url from test where 
sourceipv4address="dummy";"

Just to see if the ppd optimization is working I checked the hadoop logs where 
I found

./userlogs/job_201404010833_0036/attempt_201404010833_0036_m_00_0/syslog:2014-04-03
 05:01:39,913 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: included 
column ids = 3,8,13
./userlogs/job_201404010833_0036/attempt_201404010833_0036_m_00_0/syslog:2014-04-03
 05:01:39,914 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: included 
columns names = sourceipv4address,sessionid,url
./userlogs/job_201404010833_0036/attempt_201404010833_0036_m_00_0/syslog:2014-04-03
 05:01:39,914 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: No ORC 
pushdown predicate

I am not sure which part of it I missed. Any help would be appreciated.

Thanks,
-Abhay


RE: Hadoop streaming with insert dynamic partition generate many small files

2014-02-26 Thread Bogala, Chandra Reddy
Hi,
  I tried with absolute path. It works fine. But I have another issue. I have 
below data in string format. I am trying to convert/cast that data to other 
format using transform. But it's not doing properly. What might be the issue?


10347   
[{"protocol":"udp","sum_bytes":20897,"sum_packets":61,"sum_flows":35,"rank":1}, 
{"protocol":"tcp","sum_bytes":20469,"sum_packets":229,"sum_flows":10,"rank":2}, 
{"protocol":"icmp","sum_bytes":828,"sum_packets":13,"sum_flows":9,"rank":3}]


Transform Query:
from mytable2 select transform(x1,x2) using '/bin/cat' as (x1 int,x2 
ARRAY>);

10347   
[{"protocol":"[{\"protocol\":\"udp\",\"sum_bytes\":20897,\"sum_packets\":61,\"sum_flows\":35,\"rank\":1},
 
{\"protocol\":\"tcp\",\"sum_bytes\":20469,\"sum_packets\":229,\"sum_flows\":10,\"rank\":2},
 
{\"protocol\":\"icmp\",\"sum_bytes\":828,\"sum_packets\":13,\"sum_flows\":9,\"rank\":3}]","sum_bytes":null,"sum_packets":null,"sum_flows":null,"rank":null}]


From: Bogala, Chandra Reddy [Tech]
Sent: Monday, February 17, 2014 11:15 PM
To: 'user@hive.apache.org'
Subject: RE: Hadoop streaming with insert dynamic partition generate many small 
files

As I suspected, hive is expecting java in path. But in our clusters java is not 
in path. So I tried by specifying absolute path ('/xxx/yyy/jre/bin/java  -cp 
.:embeddedDoc.jar com.yy.xx.mapreduce.NestedDocReduce'). But it's throwing a 
different exception ( script error).Is it a bug?.  What's the reason absolute 
path not accepted in below stream reduce.

From: Bogala, Chandra Reddy [Tech]
Sent: Thursday, February 13, 2014 10:42 PM
To: 'user@hive.apache.org'
Subject: RE: Hadoop streaming with insert dynamic partition generate many small 
files

Thanks Wang. I have implemented  reducer in java and trying to run with below 
job. But its failing with "java.io.IOException: error=2, No such file or 
directory" .
I am thinking it may be due to not able to find jar/java in path. Right?

Hive Job:
add jar /home/x/embeddedDoc.jar;

from (from xxx_aggregation_struct_type as mytable
map mytable.tag, mytable.proto_agg
using '/bin/cat' as c1,c2
cluster by c1) mo
insert overwrite table mytable2
reduce mo.c1, mo.c2
using 'java -cp .:embeddedDoc.jar com.yy.xx.mapreduce.NestedDocReduce'
as x1, x2;



Exception:

at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:258)
... 7 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error 2]: 
Unable to initialize custom script.
at 
org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:367)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
at 
org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
at 
org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:249)
... 7 more
Caused by: java.io.IOException: Cannot run program "java": error=2, No such 
file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
at 
org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:326)
... 15 more
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.(UNIXProcess.java:135)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1021)
... 16 more


FAILED: Execution Error, return code 2 from 
org.apache.hadoop.hive.ql.exec.mr.MapRedTask. Unable to initialize custom 
script.

Thanks,
Chandra

From: Chen Wang [mailto:chen.apache.s...@gmail.com]
Sent: Tuesday, February 04, 2014 3:00 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Hadoop streaming with insert dynamic partition generate many small 
files

Chandra,
You don't necessary need java to implement the mapper/reducer. Checkout the 
answer in this post:
http://stackoverflow.com/questions/6178614/custom-map-reduce-program-on-hi

RE: Hadoop streaming with insert dynamic partition generate many small files

2014-02-17 Thread Bogala, Chandra Reddy
As I suspected, hive is expecting java in path. But in our clusters java is not 
in path. So I tried by specifying absolute path ('/xxx/yyy/jre/bin/java  -cp 
.:embeddedDoc.jar com.yy.xx.mapreduce.NestedDocReduce'). But it's throwing a 
different exception ( script error).Is it a bug?.  What's the reason absolute 
path not accepted in below stream reduce.

From: Bogala, Chandra Reddy [Tech]
Sent: Thursday, February 13, 2014 10:42 PM
To: 'user@hive.apache.org'
Subject: RE: Hadoop streaming with insert dynamic partition generate many small 
files

Thanks Wang. I have implemented  reducer in java and trying to run with below 
job. But its failing with "java.io.IOException: error=2, No such file or 
directory" .
I am thinking it may be due to not able to find jar/java in path. Right?

Hive Job:
add jar /home/x/embeddedDoc.jar;

from (from xxx_aggregation_struct_type as mytable
map mytable.tag, mytable.proto_agg
using '/bin/cat' as c1,c2
cluster by c1) mo
insert overwrite table mytable2
reduce mo.c1, mo.c2
using 'java -cp .:embeddedDoc.jar com.yy.xx.mapreduce.NestedDocReduce'
as x1, x2;



Exception:

at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:258)
... 7 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error 2]: 
Unable to initialize custom script.
at 
org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:367)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
at 
org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
at 
org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:249)
... 7 more
Caused by: java.io.IOException: Cannot run program "java": error=2, No such 
file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
at 
org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:326)
... 15 more
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.(UNIXProcess.java:135)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1021)
... 16 more


FAILED: Execution Error, return code 2 from 
org.apache.hadoop.hive.ql.exec.mr.MapRedTask. Unable to initialize custom 
script.

Thanks,
Chandra

From: Chen Wang [mailto:chen.apache.s...@gmail.com]
Sent: Tuesday, February 04, 2014 3:00 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Hadoop streaming with insert dynamic partition generate many small 
files

Chandra,
You don't necessary need java to implement the mapper/reducer. Checkout the 
answer in this post:
http://stackoverflow.com/questions/6178614/custom-map-reduce-program-on-hive-whats-the-rulehow-about-input-and-output

also in my sample,
A.column1, A.column2 ==> mymapper ==> key, value, and myapper simply read from 
std.in<http://std.in>, and convert to key,value.
Chen

On Mon, Feb 3, 2014 at 5:51 AM, Bogala, Chandra Reddy 
mailto:chandra.bog...@gs.com>> wrote:
Hi Wang,

I am first time trying MAP & Reduce inside hive query. Is it possible to 
share mymapper and myreducer code? So that I can understand how the columns 
(A.column1,A to key, value) converted? Also can you point me to some 
documents to read more about it.
Thanks,
Chandra


From: Chen Wang 
[mailto:chen.apache.s...@gmail.com<mailto:chen.apache.s...@gmail.com>]
Sent: Monday, February 03, 2014 12:26 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Hadoop streaming with insert dynamic partition generate many small 
files

 it seems that hive.exec.reducers.bytes.per.reducer is still not big enough: I 
added another 0, and now i only gets one file under each partition.

On Sun, Feb 2, 2014 at 10:14 PM, Chen Wang 
mailto:chen.apache.s...@gmail.com>> wrote:
Hi,
I am using java reducer reading from a table, and then write to another one:

  FROM (

FROM (

SELECT column1,...

FROM table1

WHERE  ( partition > 6 and partition < 12 )

) A

MAP A.column1,A

USING 'java -cp .my.jar  mymapper.mymapper'

AS key, value

CLUSTER BY key

 

RE: Hadoop streaming with insert dynamic partition generate many small files

2014-02-13 Thread Bogala, Chandra Reddy
Thanks Wang. I have implemented  reducer in java and trying to run with below 
job. But its failing with "java.io.IOException: error=2, No such file or 
directory" .
I am thinking it may be due to not able to find jar/java in path. Right?

Hive Job:
add jar /home/x/embeddedDoc.jar;

from (from xxx_aggregation_struct_type as mytable
map mytable.tag, mytable.proto_agg
using '/bin/cat' as c1,c2
cluster by c1) mo
insert overwrite table mytable2
reduce mo.c1, mo.c2
using 'java -cp .:embeddedDoc.jar com.yy.xx.mapreduce.NestedDocReduce'
as x1, x2;



Exception:

at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:258)
... 7 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error 2]: 
Unable to initialize custom script.
at 
org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:367)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
at 
org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
at 
org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:249)
... 7 more
Caused by: java.io.IOException: Cannot run program "java": error=2, No such 
file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
at 
org.apache.hadoop.hive.ql.exec.ScriptOperator.processOp(ScriptOperator.java:326)
... 15 more
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.(UNIXProcess.java:135)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1021)
... 16 more


FAILED: Execution Error, return code 2 from 
org.apache.hadoop.hive.ql.exec.mr.MapRedTask. Unable to initialize custom 
script.

Thanks,
Chandra

From: Chen Wang [mailto:chen.apache.s...@gmail.com]
Sent: Tuesday, February 04, 2014 3:00 AM
To: user@hive.apache.org
Subject: Re: Hadoop streaming with insert dynamic partition generate many small 
files

Chandra,
You don't necessary need java to implement the mapper/reducer. Checkout the 
answer in this post:
http://stackoverflow.com/questions/6178614/custom-map-reduce-program-on-hive-whats-the-rulehow-about-input-and-output

also in my sample,
A.column1, A.column2 ==> mymapper ==> key, value, and myapper simply read from 
std.in<http://std.in>, and convert to key,value.
Chen

On Mon, Feb 3, 2014 at 5:51 AM, Bogala, Chandra Reddy 
mailto:chandra.bog...@gs.com>> wrote:
Hi Wang,

I am first time trying MAP & Reduce inside hive query. Is it possible to 
share mymapper and myreducer code? So that I can understand how the columns 
(A.column1,A to key, value) converted? Also can you point me to some 
documents to read more about it.
Thanks,
Chandra


From: Chen Wang 
[mailto:chen.apache.s...@gmail.com<mailto:chen.apache.s...@gmail.com>]
Sent: Monday, February 03, 2014 12:26 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Hadoop streaming with insert dynamic partition generate many small 
files

 it seems that hive.exec.reducers.bytes.per.reducer is still not big enough: I 
added another 0, and now i only gets one file under each partition.

On Sun, Feb 2, 2014 at 10:14 PM, Chen Wang 
mailto:chen.apache.s...@gmail.com>> wrote:
Hi,
I am using java reducer reading from a table, and then write to another one:

  FROM (

FROM (

SELECT column1,...

FROM table1

WHERE  ( partition > 6 and partition < 12 )

) A

MAP A.column1,A

USING 'java -cp .my.jar  mymapper.mymapper'

AS key, value

CLUSTER BY key

) map_output

INSERT OVERWRITE TABLE target_table PARTITION(partition)

REDUCE

map_output.key,

map_output.value

USING 'java -cp .:myjar.jar  myreducer.myreducer'

AS column1,column2;"

Its all working fine, except that there are many (20-30) small files generated 
under each partition. i am setting  SET 
hive.exec.reducers.bytes.per.reducer=1280,000,000; hoping to get one big enough 
file under for each partition.But it does not seem to have any effect. I still 
get 20-30 small files under each folder, and each file siz

distributed cache

2014-02-06 Thread Bogala, Chandra Reddy
Hi,
  My all hive jobs require "mongodb-java-driver, mongo-hadoop-core, 
mongo-hadoop-hive" jars to successfully execute. I don't have cluster access to 
copy these jars. So I use distributed cache (add jar ) for every job 
to make available these jars for M/R tasks. Because of this approach my user 
file cache (tmp folder) getting filled very quickly. So is there a way to copy 
these jars to distribute cache only once and reuse for all the jobs all the 
time?. So that I can avoid file cache getting filled.

Thanks,
Chandra


RE: Hadoop streaming with insert dynamic partition generate many small files

2014-02-03 Thread Bogala, Chandra Reddy
Hi Wang,

I am first time trying MAP & Reduce inside hive query. Is it possible to 
share mymapper and myreducer code? So that I can understand how the columns 
(A.column1,A to key, value) converted? Also can you point me to some 
documents to read more about it.
Thanks,
Chandra


From: Chen Wang [mailto:chen.apache.s...@gmail.com]
Sent: Monday, February 03, 2014 12:26 PM
To: user@hive.apache.org
Subject: Re: Hadoop streaming with insert dynamic partition generate many small 
files

 it seems that hive.exec.reducers.bytes.per.reducer is still not big enough: I 
added another 0, and now i only gets one file under each partition.

On Sun, Feb 2, 2014 at 10:14 PM, Chen Wang 
mailto:chen.apache.s...@gmail.com>> wrote:
Hi,
I am using java reducer reading from a table, and then write to another one:

  FROM (

FROM (

SELECT column1,...

FROM table1

WHERE  ( partition > 6 and partition < 12 )

) A

MAP A.column1,A

USING 'java -cp .my.jar  mymapper.mymapper'

AS key, value

CLUSTER BY key

) map_output

INSERT OVERWRITE TABLE target_table PARTITION(partition)

REDUCE

map_output.key,

map_output.value

USING 'java -cp .:myjar.jar  myreducer.myreducer'

AS column1,column2;"

Its all working fine, except that there are many (20-30) small files generated 
under each partition. i am setting  SET 
hive.exec.reducers.bytes.per.reducer=1280,000,000; hoping to get one big enough 
file under for each partition.But it does not seem to have any effect. I still 
get 20-30 small files under each folder, and each file size is around 7kb.

How can I force to generate only 1 big file for one partition? Does this have 
anything to do with the streaming? I recall in the past i was directly reading 
from a table with UDF, and write to another table, it only generates one big 
file for the target partition. Not sure why is that.



Any help appreciated.

Thanks,

Chen







RE: casting complex data types for outputs of custom scripts

2014-01-20 Thread Bogala, Chandra Reddy
Can it be possible to share python script which does the conversion?

Thanks,
Chandra

-Original Message-
From: rohan monga [mailto:monga.ro...@gmail.com] 
Sent: Monday, January 20, 2014 6:08 AM
To: user@hive.apache.org
Subject: Re: casting complex data types for outputs of custom scripts

sorry for the delayed response.

yes the python script follows that.

--
Rohan Monga


On Tue, Jan 14, 2014 at 4:31 PM, Stephen Sprague  wrote:
> @OP - first thing i'd ask is does your python script obey the 
> ^A,^B,^C,^D etc. nesting delimiter pattern.  give that your create 
> table does not specify delimiters those are the defaults.  nb. ^A == 
> control-A == \001
>
> Cheers,
> Stephen.
>
>
> On Tue, Jan 14, 2014 at 3:11 PM, Andre Araujo  wrote:
>>
>> I had a similar issue in the past when trying to cast an empty array 
>> to array(). By default Hive assumes it's an array().
>> I don't think there's currently a Hive syntax to cast values to 
>> complex data types. If there's one, I'd love to know what it is :)
>>
>>
>> On 14 January 2014 10:22, rohan monga  wrote:
>>>
>>> Hi,
>>>
>>> I have a table that is of the following format
>>>
>>> create table t1 ( f1 int, f2 array> );
>>>
>>> Now I have a custom script that does some computation and generates 
>>> the value for f2 like so
>>>
>>> from (
>>> from randomtable r
>>> map r.g1, r.g2, r.g3
>>> using '/bin/cat' as g1, g2, g3
>>> cluster by g1 ) m
>>> insert overwrite table t1
>>> reduce m.g1, m.g2, m.g3
>>> using 'python customScript.py' as ( f1 , f2 );
>>>
>>> however f2 is not being loaded properly into t1, it comes up broken 
>>> or null. What should I do so that f2 is loaded as an array of structs.
>>>
>>>
>>> Thanks,
>>>
>>> --
>>> Rohan Monga
>>
>>
>>
>>
>> --
>> André Araújo
>> Big Data Consultant/Solutions Architect The Pythian Group - Australia 
>> - www.pythian.com
>>
>> Office (calls from within Australia): 1300 366 021 x1270
>> Office (international): +61 2 8016 7000  x270 OR +1 613 565 8696   x1270
>> Mobile: +61 410 323 559
>> Fax: +61 2 9805 0544
>> IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk
>>
>> "Success is not about standing at the top, it's the steps you leave 
>> behind." - Iker Pou (rock climber)
>>
>> --
>>
>>
>>
>


RE: complex datatypes filling

2014-01-16 Thread Bogala, Chandra Reddy
Thanks for quick reply. I will take a look at stream job and transform 
functions.
One more question:
I have multiple csv files ( same structure, dir added as partition) mapped to 
hive table. Then I run different group by jobs on same data like below. All 
these are spanned as different jobs. So multiple mappers read/fetch data from 
disk and then computes different group/aggregation jobs.
Each below job fetch same data from disk. Can this be avoided by reading split 
only once and mapper computing different group by jobs in same mapper itself. 
That may no of mappers will come down drastically and also mainly multiple disk 
seeks for same data avoided. Do I need to write custom map reduce job to do 
this?


1)  Insert into temptable1 select TAG,col2,SUM(col5) as SUM_col5,SUM(col6) 
as SUM_col6,SUM(col7) as SUM_col7,ts  from raw_data_by_epoch where 
ts=${hivevar:collectiontimestamp} group by TAG,col2,TS


2)  Insert into temptable2 select TAG,col2,col3,SUM(col5) as 
SUM_col5,SUM(col6) as SUM_col6,SUM(col7) as SUM_col7,ts  from raw_data_by_epoch 
where ts=${hivevar:collectiontimestamp} group by TAG,col2,col3,TS


3)  Insert into temptable3 select TAG,col2,col3,col4,SUM(col5) as 
SUM_col5,SUM(col6) as SUM_col6,SUM(col7) as SUM_col7,ts  from raw_data_by_epoch 
where ts=${hivevar:collectiontimestamp} group by TAG,col2,col3,col4,TS


Thanks,
Chandra

From: Stephen Sprague [mailto:sprag...@gmail.com]
Sent: Friday, January 17, 2014 11:39 AM
To: user@hive.apache.org
Subject: Re: complex datatypes filling

remember you can always setup a stream job to do any wild and crazy custom 
thing you want. see the tranform() function documentation.  Its really quite 
easy. honest.

On Thu, Jan 16, 2014 at 9:39 PM, Bogala, Chandra Reddy 
mailto:chandra.bog...@gs.com>> wrote:
Hi,
  I found lot of examples to map json data into hive complex data types (map, 
array , struct etc). But I don't see anywhere filling complex data types with 
nested sql  query ( I.e group by few columns(key) and array of struct(multiple 
columns) containing  result values ).
So that it will be easy for me to map back into embedded/nested json document.

Thanks,
Chandra



complex datatypes filling

2014-01-16 Thread Bogala, Chandra Reddy
Hi,
  I found lot of examples to map json data into hive complex data types (map, 
array , struct etc). But I don't see anywhere filling complex data types with 
nested sql  query ( I.e group by few columns(key) and array of struct(multiple 
columns) containing  result values ).
So that it will be easy for me to map back into embedded/nested json document.

Thanks,
Chandra


Hive setup

2014-01-08 Thread Bogala, Chandra Reddy
Hi,
   Is there any performance difference in running jobs using hive client (hive 
-f  option inside shell scripts) vs configuring hiverserver2 and running jobs 
using thrift services with java programs? .
Which is the preferred option and why?

Thanks,
Chandra


RE: merge columns and count no of records

2014-01-08 Thread Bogala, Chandra Reddy
Or Is it good idea to get data into shell variable/file and doing processing. 
Or using a pig script to do?

hive -e 'select distinct(columnA), distinct(columnB)  from blah' | sed 
's/[\t]/,/g' >/tmp/test

Thanks,
Chandra


From: Bogala, Chandra Reddy [Tech]
Sent: Wednesday, January 08, 2014 5:49 PM
To: 'user@hive.apache.org'
Subject: merge columns and count no of records

Hi,
My requirement is to merge ( not concat )  two columns and count number of 
distinct records. I can use self-join on column A and column B and can count 
number of records.
But  looks not optimal way of doing. Is there any better way to do.

Ex: Original table
Column A

COLUMN B

1

2

2

3

5

6

4

7

1

2

4

2


Logic something like this: Count(Distinct(Merge (distinct(A),distinct(B
Query OUTPUT should be :7
Values {1,2,3,4,5,6,7}

Thanks,
Chandra


merge columns and count no of records

2014-01-08 Thread Bogala, Chandra Reddy
Hi,
My requirement is to merge ( not concat )  two columns and count number of 
distinct records. I can use self-join on column A and column B and can count 
number of records.
But  looks not optimal way of doing. Is there any better way to do.

Ex: Original table
Column A

COLUMN B

1

2

2

3

5

6

4

7

1

2

4

2


Logic something like this: Count(Distinct(Merge (distinct(A),distinct(B
Query OUTPUT should be :7
Values {1,2,3,4,5,6,7}

Thanks,
Chandra


How to generate json/complex object type from hive table

2014-01-07 Thread Bogala, Chandra Reddy
Hi,
How to generate json data from a table data that's in hive? For example, if I 
have data in table format (below) and want to generate data in json format  
below.
I want to group by person name and fill the STRUCT and ARRAY with that person. 
So finally I should get one row per person. I tried with NAMED_STRUCT('NAME', 
NAME,' AREA', AREA) etc . Struct object filled with only one entry, Sogetting 
multiple rows per person .I need just one row per person with all embed data.

Name

CITY.Name

City.AREA

Child.Name

Rok

Grosuplje

12544

Matej-1

Rok

Grosuplje

12544

Matej-2

Rok

Grosuplje

12544

Matej-3

Simon

Spodnji Breg

362.2354

Simonca-1

Simon

Spodnji Breg

362.2354

Simonca-2

Simon

Spodnji Breg

362.2354

Simonca-3

Simon

Spodnji Breg

362.2354

Simonca-4



Create table person (
Name String,
   CITY:STRUCT,
  CHILDREN ARRAY)

   STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'

 WITH SERDEPROPERTIES('mongo.columns.mapping'=',.')

 TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/test.persons')




{ "name":"Rok", "city":{"name":"Grosuplje", "area":12544}, 
"children":[{"name":"Matej"}]}
{ "name":"Melanija", "children":[]}
{ "name":"Simon", "city":{"name":"Spodnji Breg", "area":362.2354}, 
"children":[{"name":"Simonca"},{"name

Thanks,
Chandra