Cannot insert into a bucketized table from the same table
Hive version: 0.10.0-cdh4.2.1 Trying to insert into a bucketized table from the same table, throws exception: FAILED: SemanticException [Error 10122]: Bucketized tables do not support INSERT INTO: Table: To test the scenario, I create the following 3 test tables: create table temp1 (a int) PARTITIONED BY ( b string, c string) clustered by (a) into 2 buckets stored as RCFILE; create table temp2 (a int) stored as RCFILE; create table temp3 (a int) PARTITIONED BY ( b string, c string) stored as RCFILE; When I run the below insert queries, queries 1, 3 4 work fine, while query 2 fails with the above mentioned exception. 1. insert into table temp1 partition(b='1', c='1') select * from temp2; 2. insert into table temp1 partition(b='1', c='1') select * from temp2 t2 join temp1 t1 on(t2.a=t1.a); 3. insert into table temp1 partition(b='1', c='1') select t3.a from (select t1.a from temp2 t2 join temp1 t1 on(t2.a=t1.a)) tt join temp3 t3 on (tt.a=t3.a); 4. insert into table temp3 partition(b='1', c='1') select t2.a from temp2 t2 join temp3 t1 on(t2.a=t1.a); So, the above exception occurs only if the target table is bucketed and it uses itself directly in the join tables. For some reason, case 3 works fine, where its joining with some other table finally. Has anyone faced this issue earlier? Looks like a Hive bug. Is there any workaround?
Re: Cannot insert into a bucketized table from the same table
Neha, its not a bug. Hive does not support insert into bucketed table as of now. There is a patch available for same at HIVE-3244https://issues.apache.org/jira/browse/HIVE-3244 You can also refer HIVE-3077https://issues.apache.org/jira/browse/HIVE-3077 On Thu, Aug 29, 2013 at 2:32 PM, Neha Sood neha.sood.agar...@gmail.comwrote: Hive version: 0.10.0-cdh4.2.1 Trying to insert into a bucketized table from the same table, throws exception: FAILED: SemanticException [Error 10122]: Bucketized tables do not support INSERT INTO: Table: To test the scenario, I create the following 3 test tables: create table temp1 (a int) PARTITIONED BY ( b string, c string) clustered by (a) into 2 buckets stored as RCFILE; create table temp2 (a int) stored as RCFILE; create table temp3 (a int) PARTITIONED BY ( b string, c string) stored as RCFILE; When I run the below insert queries, queries 1, 3 4 work fine, while query 2 fails with the above mentioned exception. 1. insert into table temp1 partition(b='1', c='1') select * from temp2; 2. insert into table temp1 partition(b='1', c='1') select * from temp2 t2 join temp1 t1 on(t2.a=t1.a); 3. insert into table temp1 partition(b='1', c='1') select t3.a from (select t1.a from temp2 t2 join temp1 t1 on(t2.a=t1.a)) tt join temp3 t3 on (tt.a=t3.a); 4. insert into table temp3 partition(b='1', c='1') select t2.a from temp2 t2 join temp3 t1 on(t2.a=t1.a); So, the above exception occurs only if the target table is bucketed and it uses itself directly in the join tables. For some reason, case 3 works fine, where its joining with some other table finally. Has anyone faced this issue earlier? Looks like a Hive bug. Is there any workaround? -- Nitin Pawar
Problème with min function in HiveQL
Hi everybody, I am coding some HiveQL script to do some calculations. I have a problem with the min() function. My hive script is below : INSERT INTO TABLE default.THM_CA_RGRP_PRODUITS_SEM SELECT '${hiveconf:in_co_societe}' as co_societe, '${hiveconf:in_co_an_semaine}'as co_an_semaine, a.type_entite as type_entite, a.code_entite as code_entite, a.type_rgrp_produits as type_rgrp_produits, a.co_rgrp_produitsas co_rgrp_produits, SUM(a.MT_CA_NET_TTC) as MT_CA_NET_TTC, SUM(a.MT_OBJ_CA_NET_TTC) as MT_OBJ_CA_NET_TTC, SUM(a.NB_CLIENTS) as NB_CLIENTS, SUM(a.MT_CA_NET_TTC_COMP) as MT_CA_NET_TTC_COMP, SUM(a.MT_OBJ_CA_NET_TTC_COMP) as MT_OBJ_CA_NET_TTC_COMP, SUM(a.NB_CLIENTS_COMP)as NB_CLIENTS_COMP from default.THM_CA_RGRP_PRODUITS_JOUR a JOIN default.CALENDRIER b -- A verifier WHERE CAST(a.dt_jour AS TIMESTAMP) = CAST(min(UNIX_TIMESTAMP(b.dt_jour)) AS BIGINT) AND CAST(a.dt_jour AS TIMESTAMP) CAST(min(b.dt_jour)+1 AS TIMESTAMP) AND a.co_societe = '${hiveconf:in_co_societe}' AND ('${hiveconf:flg_reprise_entite}' = 0 OR a.type_entite'MAG') GROUP BY a.type_entite, a.code_entite, a.type_rgrp_produits, a.co_rgrp_produits; And, when i try to launch this, i get this error : FAILED: SemanticException [Error 10128]: Line 20:62 Not yet supported place for UDAF 'min' Obviously, there is a problem with min() function. How can i pass through this error? Thanks for your help
Re: Problème with min function in HiveQL
the min function at column 62 is on on the column b.dt_jour. what datatype is that? if its of type 'timestamp' that might explain it. On Thu, Aug 29, 2013 at 3:01 AM, Jérôme Verdier verdier.jerom...@gmail.comwrote: Hi everybody, I am coding some HiveQL script to do some calculations. I have a problem with the min() function. My hive script is below : INSERT INTO TABLE default.THM_CA_RGRP_PRODUITS_SEM SELECT '${hiveconf:in_co_societe}' as co_societe, '${hiveconf:in_co_an_semaine}'as co_an_semaine, a.type_entite as type_entite, a.code_entite as code_entite, a.type_rgrp_produits as type_rgrp_produits, a.co_rgrp_produitsas co_rgrp_produits, SUM(a.MT_CA_NET_TTC) as MT_CA_NET_TTC, SUM(a.MT_OBJ_CA_NET_TTC) as MT_OBJ_CA_NET_TTC, SUM(a.NB_CLIENTS) as NB_CLIENTS, SUM(a.MT_CA_NET_TTC_COMP) as MT_CA_NET_TTC_COMP, SUM(a.MT_OBJ_CA_NET_TTC_COMP) as MT_OBJ_CA_NET_TTC_COMP, SUM(a.NB_CLIENTS_COMP)as NB_CLIENTS_COMP from default.THM_CA_RGRP_PRODUITS_JOUR a JOIN default.CALENDRIER b -- A verifier WHERE CAST(a.dt_jour AS TIMESTAMP) = CAST(min(UNIX_TIMESTAMP(b.dt_jour)) AS BIGINT) AND CAST(a.dt_jour AS TIMESTAMP) CAST(min(b.dt_jour)+1 AS TIMESTAMP) AND a.co_societe = '${hiveconf:in_co_societe}' AND ('${hiveconf:flg_reprise_entite}' = 0 OR a.type_entite'MAG') GROUP BY a.type_entite, a.code_entite, a.type_rgrp_produits, a.co_rgrp_produits; And, when i try to launch this, i get this error : FAILED: SemanticException [Error 10128]: Line 20:62 Not yet supported place for UDAF 'min' Obviously, there is a problem with min() function. How can i pass through this error? Thanks for your help
Re: Problème with min function in HiveQL
Hi stephen, Thanks for your reply. Effectively, dt_jour is timestamp format. What is the problem with this? -Original Message- From: Stephen Sprague sprag...@gmail.com Date: Thu, 29 Aug 2013 09:24:27 To: user@hive.apache.orguser@hive.apache.org Reply-To: user@hive.apache.org Subject: Re: Problème with min function in HiveQL the min function at column 62 is on on the column b.dt_jour. what datatype is that? if its of type 'timestamp' that might explain it. On Thu, Aug 29, 2013 at 3:01 AM, Jérôme Verdier verdier.jerom...@gmail.comwrote: Hi everybody, I am coding some HiveQL script to do some calculations. I have a problem with the min() function. My hive script is below : INSERT INTO TABLE default.THM_CA_RGRP_PRODUITS_SEM SELECT '${hiveconf:in_co_societe}' as co_societe, '${hiveconf:in_co_an_semaine}'as co_an_semaine, a.type_entite as type_entite, a.code_entite as code_entite, a.type_rgrp_produits as type_rgrp_produits, a.co_rgrp_produitsas co_rgrp_produits, SUM(a.MT_CA_NET_TTC) as MT_CA_NET_TTC, SUM(a.MT_OBJ_CA_NET_TTC) as MT_OBJ_CA_NET_TTC, SUM(a.NB_CLIENTS) as NB_CLIENTS, SUM(a.MT_CA_NET_TTC_COMP) as MT_CA_NET_TTC_COMP, SUM(a.MT_OBJ_CA_NET_TTC_COMP) as MT_OBJ_CA_NET_TTC_COMP, SUM(a.NB_CLIENTS_COMP)as NB_CLIENTS_COMP from default.THM_CA_RGRP_PRODUITS_JOUR a JOIN default.CALENDRIER b -- A verifier WHERE CAST(a.dt_jour AS TIMESTAMP) = CAST(min(UNIX_TIMESTAMP(b.dt_jour)) AS BIGINT) AND CAST(a.dt_jour AS TIMESTAMP) CAST(min(b.dt_jour)+1 AS TIMESTAMP) AND a.co_societe = '${hiveconf:in_co_societe}' AND ('${hiveconf:flg_reprise_entite}' = 0 OR a.type_entite'MAG') GROUP BY a.type_entite, a.code_entite, a.type_rgrp_produits, a.co_rgrp_produits; And, when i try to launch this, i get this error : FAILED: SemanticException [Error 10128]: Line 20:62 Not yet supported place for UDAF 'min' Obviously, there is a problem with min() function. How can i pass through this error? Thanks for your help
Re: Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors
Hi All: Put some debugging code in TUGIContainingTransport.getTransport() and I tracked it down to @Override public TUGIContainingTransport getTransport(TTransport trans) { // UGI information is not available at connection setup time, it will be set later // via set_ugi() rpc. transMap.putIfAbsent(trans, new TUGIContainingTransport(trans)); //return transMap.get(trans); -change TUGIContainingTransport retTrans = transMap.get(trans); if ( retTrans == null ) { } On Wed, Jul 31, 2013 at 9:48 AM, agateaaa agate...@gmail.com wrote: Thanks Nitin There arent too many connections in close_wait state only 1 or two when we run into this. Most likely its because of dropped connection. I could not find any read or write timeouts we can set for the thrift server which will tell thrift to hold on to the client connection. See this https://issues.apache.org/jira/browse/HIVE-2006 but doesnt seem to have been implemented yet. We do have set a client connection timeout but cannot find an equivalent setting for the server. We have a suspicion that this happens when we run two client processes which modify two distinct partitions of the same hive table. We put in a workaround so that the two hive client processes never run together and so far things look ok but we will keep monitoring. Could it be because hive metastore server is not thread safe, would running two alter table statements on two distinct partitions of the same table using two client connections cause problems like these, where hive metastore server closes or drops a wrong client connection and leaves the other hanging? Agateaaa On Tue, Jul 30, 2013 at 12:49 AM, Nitin Pawar nitinpawar...@gmail.comwrote: The mentioned flow is called when you have unsecure mode of thrift metastore client-server connection. So one way to avoid this is have a secure way. code public boolean process(final TProtocol in, final TProtocol out) throwsTException { setIpAddress(in); ... ... ... @Override protected void setIpAddress(final TProtocol in) { TUGIContainingTransport ugiTrans = (TUGIContainingTransport)in.getTransport(); Socket socket = ugiTrans.getSocket(); if (socket != null) { setIpAddress(socket); /code From the above code snippet, it looks like the null pointer exception is not handled if the getSocket returns null. can you check whats the ulimit setting on the server? If its set to default can you set it to unlimited and restart hcat server. (This is just a wild guess). also the getSocket method suggests If the underlying TTransport is an instance of TSocket, it returns the Socket object which it contains. Otherwise it returns null. so someone from thirft gurus need to tell us whats happening. I have no knowledge of this depth may be Ashutosh or Thejas will be able to help on this. From the netstat close_wait, it looks like the hive metastore server has not closed the connection (do not know why yet), may be the hive dev guys can help.Are there too many connections in close_wait state? On Tue, Jul 30, 2013 at 5:52 AM, agateaaa agate...@gmail.com wrote: Looking at the hive metastore server logs see errors like these: 2013-07-26 06:34:52,853 ERROR server.TThreadPoolServer (TThreadPoolServer.java:run(182)) - Error occurred during processing of message. java.lang.NullPointerException at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.setIpAddress(TUGIBasedProcessor.java:183) at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:79) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) approx same time as we see timeout or connection reset errors. Dont know if this is the cause or the side affect of he connection timeout/connection reset errors. Does anybody have any pointers or suggestions ? Thanks On Mon, Jul 29, 2013 at 11:29 AM, agateaaa agate...@gmail.com wrote: Thanks Nitin! We have simiar setup (identical hcatalog and hive server versions) on a another production environment and dont see any errors (its been running ok for a few months) Unfortunately we wont be able to move to hcat 0.5 and hive 0.11 or hive 0.10 soon. I did see that the last time we ran into this problem doing a netstat-ntp | grep :1 see that server was holding on to one socket connection in CLOSE_WAIT state for a long time (hive metastore server is running on port 1). Dont know if thats relevant here or not Can you suggest any hive configuration settings we can tweak or networking tools/tips, we can use to
Re: Problème with min function in HiveQL
well. i would suggest you test whether or not min() works on timestamp datatype. it seems like something one should rule out first before going down the rabbit hole further. My opinion only! On Thu, Aug 29, 2013 at 9:28 AM, Jérôme Verdier verdier.jerom...@gmail.comwrote: ** Hi stephen, Thanks for your reply. Effectively, dt_jour is timestamp format. What is the problem with this? -- *From: * Stephen Sprague sprag...@gmail.com *Date: *Thu, 29 Aug 2013 09:24:27 -0700 *To: *user@hive.apache.orguser@hive.apache.org *ReplyTo: * user@hive.apache.org *Subject: *Re: Problème with min function in HiveQL the min function at column 62 is on on the column b.dt_jour. what datatype is that? if its of type 'timestamp' that might explain it. On Thu, Aug 29, 2013 at 3:01 AM, Jérôme Verdier verdier.jerom...@gmail.com wrote: Hi everybody, I am coding some HiveQL script to do some calculations. I have a problem with the min() function. My hive script is below : INSERT INTO TABLE default.THM_CA_RGRP_PRODUITS_SEM SELECT '${hiveconf:in_co_societe}' as co_societe, '${hiveconf:in_co_an_semaine}'as co_an_semaine, a.type_entite as type_entite, a.code_entite as code_entite, a.type_rgrp_produits as type_rgrp_produits, a.co_rgrp_produitsas co_rgrp_produits, SUM(a.MT_CA_NET_TTC) as MT_CA_NET_TTC, SUM(a.MT_OBJ_CA_NET_TTC) as MT_OBJ_CA_NET_TTC, SUM(a.NB_CLIENTS) as NB_CLIENTS, SUM(a.MT_CA_NET_TTC_COMP) as MT_CA_NET_TTC_COMP, SUM(a.MT_OBJ_CA_NET_TTC_COMP) as MT_OBJ_CA_NET_TTC_COMP, SUM(a.NB_CLIENTS_COMP)as NB_CLIENTS_COMP from default.THM_CA_RGRP_PRODUITS_JOUR a JOIN default.CALENDRIER b -- A verifier WHERE CAST(a.dt_jour AS TIMESTAMP) = CAST(min(UNIX_TIMESTAMP(b.dt_jour)) AS BIGINT) AND CAST(a.dt_jour AS TIMESTAMP) CAST(min(b.dt_jour)+1 AS TIMESTAMP) AND a.co_societe = '${hiveconf:in_co_societe}' AND ('${hiveconf:flg_reprise_entite}' = 0 OR a.type_entite'MAG') GROUP BY a.type_entite, a.code_entite, a.type_rgrp_produits, a.co_rgrp_produits; And, when i try to launch this, i get this error : FAILED: SemanticException [Error 10128]: Line 20:62 Not yet supported place for UDAF 'min' Obviously, there is a problem with min() function. How can i pass through this error? Thanks for your help
Re: Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors
Sorry hit send too soon ... Hi All: Put some debugging code in TUGIContainingTransport.getTransport() and I tracked it down to @Override public TUGIContainingTransport getTransport(TTransport trans) { // UGI information is not available at connection setup time, it will be set later // via set_ugi() rpc. transMap.putIfAbsent(trans, new TUGIContainingTransport(trans)); //return transMap.get(trans); //-change TUGIContainingTransport retTrans = transMap.get(trans); if ( retTrans == null ) { LOGGER.error ( cannot find transport that was in map !!) } else { LOGGER.debug ( cannot find transport that was in map !!) return retTrans; } } When we run this in our test environment, see that we run into the problem just after GC runs, and cannot find transport that was in the map!! message gets logged. Could the GC be collecting entries from transMap, just before the we get it Tried a minor change which seems to work public TUGIContainingTransport getTransport(TTransport trans) { TUGIContainingTransport retTrans = transMap.get(trans); if ( retTrans == null ) { // UGI information is not available at connection setup time, it will be set later // via set_ugi() rpc. transMap.putIfAbsent(trans, retTrans); } return retTrans; } My questions for hive and thrift experts 1.) Do we need to use a ConcurrentMap ConcurrentMapTTransport, TUGIContainingTransport transMap = new MapMaker().weakKeys().weakValues().makeMap(); It does use == to compare keys (which might be the problem), also in this case we cant rely on the trans to be always there in the transMap, even after a put, so in that case change above probably makes sense 2.) Is it better idea to use WeakHashMap with WeakReference instead ? (was looking at org.apache.thrift.transport.TSaslServerTransport, esp change made by THRIFT-1468) e.g. private static MapTTransport, WeakReferenceTUGIContainingTransport transMap3 = Collections.synchronizedMap(new WeakHashMapTTransport, WeakReferenceTUGIContainingTransport()); getTransport() would be something like public TUGIContainingTransport getTransport(TTransport trans) { WeakReferenceTUGIContainingTransport ret = transMap.get(trans); if (ret == null || ret.get() == null) { ret = new WeakReferenceTUGIContainingTransport(new TUGIContainingTransport(trans)); transMap3.put(trans, ret); // No need for putIfAbsent(). // Concurrent calls to getTransport() will pass in different TTransports. } return ret.get(); } I did try 1.) above in our test environment and it does seem to resolve the problem, though i am not sure if I am introducing any other problem Can someone help ? Thanks Agatea On Thu, Aug 29, 2013 at 10:57 AM, agateaaa agate...@gmail.com wrote: Hi All: Put some debugging code in TUGIContainingTransport.getTransport() and I tracked it down to @Override public TUGIContainingTransport getTransport(TTransport trans) { // UGI information is not available at connection setup time, it will be set later // via set_ugi() rpc. transMap.putIfAbsent(trans, new TUGIContainingTransport(trans)); //return transMap.get(trans); -change TUGIContainingTransport retTrans = transMap.get(trans); if ( retTrans == null ) { } On Wed, Jul 31, 2013 at 9:48 AM, agateaaa agate...@gmail.com wrote: Thanks Nitin There arent too many connections in close_wait state only 1 or two when we run into this. Most likely its because of dropped connection. I could not find any read or write timeouts we can set for the thrift server which will tell thrift to hold on to the client connection. See this https://issues.apache.org/jira/browse/HIVE-2006 but doesnt seem to have been implemented yet. We do have set a client connection timeout but cannot find an equivalent setting for the server. We have a suspicion that this happens when we run two client processes which modify two distinct partitions of the same hive table. We put in a workaround so that the two hive client processes never run together and so far things look ok but we will keep monitoring. Could it be because hive metastore server is not thread safe, would running two alter table statements on two distinct partitions of the same table using two client connections cause problems like these, where hive metastore server closes or drops a wrong client connection and leaves the other hanging? Agateaaa On Tue, Jul 30, 2013 at 12:49 AM, Nitin Pawar nitinpawar...@gmail.comwrote: The mentioned flow is called when you have unsecure mode of thrift metastore client-server connection. So one way to avoid this is have a secure way. code public boolean process(final TProtocol in, final TProtocol out) throwsTException { setIpAddress(in); ... ... ... @Override protected void setIpAddress(final TProtocol in) { TUGIContainingTransport ugiTrans = (TUGIContainingTransport)in.getTransport();
Re: Problème with min function in HiveQL
Looks like the issue is the use of min() within the WHERE clause - the place where the exception is being thrown has the following comment: // UDAF in filter condition, group-by caluse, param of funtion, etc. On Aug 29, 2013, at 3:01 AM, Jérôme Verdier verdier.jerom...@gmail.com wrote: Hi everybody, I am coding some HiveQL script to do some calculations. I have a problem with the min() function. My hive script is below : INSERT INTO TABLE default.THM_CA_RGRP_PRODUITS_SEM SELECT '${hiveconf:in_co_societe}' as co_societe, '${hiveconf:in_co_an_semaine}'as co_an_semaine, a.type_entite as type_entite, a.code_entite as code_entite, a.type_rgrp_produits as type_rgrp_produits, a.co_rgrp_produitsas co_rgrp_produits, SUM(a.MT_CA_NET_TTC) as MT_CA_NET_TTC, SUM(a.MT_OBJ_CA_NET_TTC) as MT_OBJ_CA_NET_TTC, SUM(a.NB_CLIENTS) as NB_CLIENTS, SUM(a.MT_CA_NET_TTC_COMP) as MT_CA_NET_TTC_COMP, SUM(a.MT_OBJ_CA_NET_TTC_COMP) as MT_OBJ_CA_NET_TTC_COMP, SUM(a.NB_CLIENTS_COMP)as NB_CLIENTS_COMP from default.THM_CA_RGRP_PRODUITS_JOUR a JOIN default.CALENDRIER b -- A verifier WHERE CAST(a.dt_jour AS TIMESTAMP) = CAST(min(UNIX_TIMESTAMP(b.dt_jour)) AS BIGINT) AND CAST(a.dt_jour AS TIMESTAMP) CAST(min(b.dt_jour)+1 AS TIMESTAMP) AND a.co_societe = '${hiveconf:in_co_societe}' AND ('${hiveconf:flg_reprise_entite}' = 0 OR a.type_entite'MAG') GROUP BY a.type_entite, a.code_entite, a.type_rgrp_produits, a.co_rgrp_produits; And, when i try to launch this, i get this error : FAILED: SemanticException [Error 10128]: Line 20:62 Not yet supported place for UDAF 'min' Obviously, there is a problem with min() function. How can i pass through this error? Thanks for your help -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Problème with min function in HiveQL
Aggregate functions need to go in a HAVING clause instead of the WHERE clause. WHERE clauses are applied prior to aggregation, HAVING is applied post aggregation. select ... from ... where some row level filter group by ... having some aggregate level filter On Thu, Aug 29, 2013 at 2:49 PM, Jason Dere jd...@hortonworks.com wrote: Looks like the issue is the use of min() within the WHERE clause - the place where the exception is being thrown has the following comment: // UDAF in filter condition, group-by caluse, param of funtion, etc. On Aug 29, 2013, at 3:01 AM, Jérôme Verdier verdier.jerom...@gmail.com wrote: Hi everybody, I am coding some HiveQL script to do some calculations. I have a problem with the min() function. My hive script is below : INSERT INTO TABLE default.THM_CA_RGRP_PRODUITS_SEM SELECT '${hiveconf:in_co_societe}' as co_societe, '${hiveconf:in_co_an_semaine}'as co_an_semaine, a.type_entite as type_entite, a.code_entite as code_entite, a.type_rgrp_produits as type_rgrp_produits, a.co_rgrp_produitsas co_rgrp_produits, SUM(a.MT_CA_NET_TTC) as MT_CA_NET_TTC, SUM(a.MT_OBJ_CA_NET_TTC) as MT_OBJ_CA_NET_TTC, SUM(a.NB_CLIENTS) as NB_CLIENTS, SUM(a.MT_CA_NET_TTC_COMP) as MT_CA_NET_TTC_COMP, SUM(a.MT_OBJ_CA_NET_TTC_COMP) as MT_OBJ_CA_NET_TTC_COMP, SUM(a.NB_CLIENTS_COMP)as NB_CLIENTS_COMP from default.THM_CA_RGRP_PRODUITS_JOUR a JOIN default.CALENDRIER b -- A verifier WHERE CAST(a.dt_jour AS TIMESTAMP) = CAST(min(UNIX_TIMESTAMP(b.dt_jour)) AS BIGINT) AND CAST(a.dt_jour AS TIMESTAMP) CAST(min(b.dt_jour)+1 AS TIMESTAMP) AND a.co_societe = '${hiveconf:in_co_societe}' AND ('${hiveconf:flg_reprise_entite}' = 0 OR a.type_entite'MAG') GROUP BY a.type_entite, a.code_entite, a.type_rgrp_produits, a.co_rgrp_produits; And, when i try to launch this, i get this error : FAILED: SemanticException [Error 10128]: Line 20:62 Not yet supported place for UDAF 'min' Obviously, there is a problem with min() function. How can i pass through this error? Thanks for your help CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors
Thanks Agatea for digging in. Seems like you have hit a bug. Would you mind opening a jira and adding your findings to it. Thanks, Ashutosh On Thu, Aug 29, 2013 at 11:22 AM, agateaaa agate...@gmail.com wrote: Sorry hit send too soon ... Hi All: Put some debugging code in TUGIContainingTransport.getTransport() and I tracked it down to @Override public TUGIContainingTransport getTransport(TTransport trans) { // UGI information is not available at connection setup time, it will be set later // via set_ugi() rpc. transMap.putIfAbsent(trans, new TUGIContainingTransport(trans)); //return transMap.get(trans); //-change TUGIContainingTransport retTrans = transMap.get(trans); if ( retTrans == null ) { LOGGER.error ( cannot find transport that was in map !!) } else { LOGGER.debug ( cannot find transport that was in map !!) return retTrans; } } When we run this in our test environment, see that we run into the problem just after GC runs, and cannot find transport that was in the map!! message gets logged. Could the GC be collecting entries from transMap, just before the we get it Tried a minor change which seems to work public TUGIContainingTransport getTransport(TTransport trans) { TUGIContainingTransport retTrans = transMap.get(trans); if ( retTrans == null ) { // UGI information is not available at connection setup time, it will be set later // via set_ugi() rpc. transMap.putIfAbsent(trans, retTrans); } return retTrans; } My questions for hive and thrift experts 1.) Do we need to use a ConcurrentMap ConcurrentMapTTransport, TUGIContainingTransport transMap = new MapMaker().weakKeys().weakValues().makeMap(); It does use == to compare keys (which might be the problem), also in this case we cant rely on the trans to be always there in the transMap, even after a put, so in that case change above probably makes sense 2.) Is it better idea to use WeakHashMap with WeakReference instead ? (was looking at org.apache.thrift.transport.TSaslServerTransport, esp change made by THRIFT-1468) e.g. private static MapTTransport, WeakReferenceTUGIContainingTransport transMap3 = Collections.synchronizedMap(new WeakHashMapTTransport, WeakReferenceTUGIContainingTransport()); getTransport() would be something like public TUGIContainingTransport getTransport(TTransport trans) { WeakReferenceTUGIContainingTransport ret = transMap.get(trans); if (ret == null || ret.get() == null) { ret = new WeakReferenceTUGIContainingTransport(new TUGIContainingTransport(trans)); transMap3.put(trans, ret); // No need for putIfAbsent(). // Concurrent calls to getTransport() will pass in different TTransports. } return ret.get(); } I did try 1.) above in our test environment and it does seem to resolve the problem, though i am not sure if I am introducing any other problem Can someone help ? Thanks Agatea On Thu, Aug 29, 2013 at 10:57 AM, agateaaa agate...@gmail.com wrote: Hi All: Put some debugging code in TUGIContainingTransport.getTransport() and I tracked it down to @Override public TUGIContainingTransport getTransport(TTransport trans) { // UGI information is not available at connection setup time, it will be set later // via set_ugi() rpc. transMap.putIfAbsent(trans, new TUGIContainingTransport(trans)); //return transMap.get(trans); -change TUGIContainingTransport retTrans = transMap.get(trans); if ( retTrans == null ) { } On Wed, Jul 31, 2013 at 9:48 AM, agateaaa agate...@gmail.com wrote: Thanks Nitin There arent too many connections in close_wait state only 1 or two when we run into this. Most likely its because of dropped connection. I could not find any read or write timeouts we can set for the thrift server which will tell thrift to hold on to the client connection. See this https://issues.apache.org/jira/browse/HIVE-2006 but doesnt seem to have been implemented yet. We do have set a client connection timeout but cannot find an equivalent setting for the server. We have a suspicion that this happens when we run two client processes which modify two distinct partitions of the same hive table. We put in a workaround so that the two hive client processes never run together and so far things look ok but we will keep monitoring. Could it be because hive metastore server is not thread safe, would running two alter table statements on two distinct partitions of the same table using two client connections cause problems like these, where hive metastore server closes or drops a wrong client connection and leaves the other hanging? Agateaaa On Tue, Jul 30, 2013 at 12:49 AM, Nitin Pawar nitinpawar...@gmail.com wrote: The mentioned flow is called when you have unsecure mode of thrift metastore client-server
Re: Problème with min function in HiveQL
indeed. you nailed it. On Thu, Aug 29, 2013 at 11:53 AM, John Meagher john.meag...@gmail.comwrote: Aggregate functions need to go in a HAVING clause instead of the WHERE clause. WHERE clauses are applied prior to aggregation, HAVING is applied post aggregation. select ... from ... where some row level filter group by ... having some aggregate level filter On Thu, Aug 29, 2013 at 2:49 PM, Jason Dere jd...@hortonworks.com wrote: Looks like the issue is the use of min() within the WHERE clause - the place where the exception is being thrown has the following comment: // UDAF in filter condition, group-by caluse, param of funtion, etc. On Aug 29, 2013, at 3:01 AM, Jérôme Verdier verdier.jerom...@gmail.com wrote: Hi everybody, I am coding some HiveQL script to do some calculations. I have a problem with the min() function. My hive script is below : INSERT INTO TABLE default.THM_CA_RGRP_PRODUITS_SEM SELECT '${hiveconf:in_co_societe}' as co_societe, '${hiveconf:in_co_an_semaine}'as co_an_semaine, a.type_entite as type_entite, a.code_entite as code_entite, a.type_rgrp_produits as type_rgrp_produits, a.co_rgrp_produitsas co_rgrp_produits, SUM(a.MT_CA_NET_TTC) as MT_CA_NET_TTC, SUM(a.MT_OBJ_CA_NET_TTC) as MT_OBJ_CA_NET_TTC, SUM(a.NB_CLIENTS) as NB_CLIENTS, SUM(a.MT_CA_NET_TTC_COMP) as MT_CA_NET_TTC_COMP, SUM(a.MT_OBJ_CA_NET_TTC_COMP) as MT_OBJ_CA_NET_TTC_COMP, SUM(a.NB_CLIENTS_COMP)as NB_CLIENTS_COMP from default.THM_CA_RGRP_PRODUITS_JOUR a JOIN default.CALENDRIER b -- A verifier WHERE CAST(a.dt_jour AS TIMESTAMP) = CAST(min(UNIX_TIMESTAMP(b.dt_jour)) AS BIGINT) AND CAST(a.dt_jour AS TIMESTAMP) CAST(min(b.dt_jour)+1 AS TIMESTAMP) AND a.co_societe = '${hiveconf:in_co_societe}' AND ('${hiveconf:flg_reprise_entite}' = 0 OR a.type_entite'MAG') GROUP BY a.type_entite, a.code_entite, a.type_rgrp_produits, a.co_rgrp_produits; And, when i try to launch this, i get this error : FAILED: SemanticException [Error 10128]: Line 20:62 Not yet supported place for UDAF 'min' Obviously, there is a problem with min() function. How can i pass through this error? Thanks for your help CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors
Thanks Ashutosh. Filed https://issues.apache.org/jira/browse/HIVE-5172 On Thu, Aug 29, 2013 at 11:53 AM, Ashutosh Chauhan hashut...@apache.orgwrote: Thanks Agatea for digging in. Seems like you have hit a bug. Would you mind opening a jira and adding your findings to it. Thanks, Ashutosh On Thu, Aug 29, 2013 at 11:22 AM, agateaaa agate...@gmail.com wrote: Sorry hit send too soon ... Hi All: Put some debugging code in TUGIContainingTransport.getTransport() and I tracked it down to @Override public TUGIContainingTransport getTransport(TTransport trans) { // UGI information is not available at connection setup time, it will be set later // via set_ugi() rpc. transMap.putIfAbsent(trans, new TUGIContainingTransport(trans)); //return transMap.get(trans); //-change TUGIContainingTransport retTrans = transMap.get(trans); if ( retTrans == null ) { LOGGER.error ( cannot find transport that was in map !!) } else { LOGGER.debug ( cannot find transport that was in map !!) return retTrans; } } When we run this in our test environment, see that we run into the problem just after GC runs, and cannot find transport that was in the map!! message gets logged. Could the GC be collecting entries from transMap, just before the we get it Tried a minor change which seems to work public TUGIContainingTransport getTransport(TTransport trans) { TUGIContainingTransport retTrans = transMap.get(trans); if ( retTrans == null ) { // UGI information is not available at connection setup time, it will be set later // via set_ugi() rpc. transMap.putIfAbsent(trans, retTrans); } return retTrans; } My questions for hive and thrift experts 1.) Do we need to use a ConcurrentMap ConcurrentMapTTransport, TUGIContainingTransport transMap = new MapMaker().weakKeys().weakValues().makeMap(); It does use == to compare keys (which might be the problem), also in this case we cant rely on the trans to be always there in the transMap, even after a put, so in that case change above probably makes sense 2.) Is it better idea to use WeakHashMap with WeakReference instead ? (was looking at org.apache.thrift.transport.TSaslServerTransport, esp change made by THRIFT-1468) e.g. private static MapTTransport, WeakReferenceTUGIContainingTransport transMap3 = Collections.synchronizedMap(new WeakHashMapTTransport, WeakReferenceTUGIContainingTransport()); getTransport() would be something like public TUGIContainingTransport getTransport(TTransport trans) { WeakReferenceTUGIContainingTransport ret = transMap.get(trans); if (ret == null || ret.get() == null) { ret = new WeakReferenceTUGIContainingTransport(new TUGIContainingTransport(trans)); transMap3.put(trans, ret); // No need for putIfAbsent(). // Concurrent calls to getTransport() will pass in different TTransports. } return ret.get(); } I did try 1.) above in our test environment and it does seem to resolve the problem, though i am not sure if I am introducing any other problem Can someone help ? Thanks Agatea On Thu, Aug 29, 2013 at 10:57 AM, agateaaa agate...@gmail.com wrote: Hi All: Put some debugging code in TUGIContainingTransport.getTransport() and I tracked it down to @Override public TUGIContainingTransport getTransport(TTransport trans) { // UGI information is not available at connection setup time, it will be set later // via set_ugi() rpc. transMap.putIfAbsent(trans, new TUGIContainingTransport(trans)); //return transMap.get(trans); -change TUGIContainingTransport retTrans = transMap.get(trans); if ( retTrans == null ) { } On Wed, Jul 31, 2013 at 9:48 AM, agateaaa agate...@gmail.com wrote: Thanks Nitin There arent too many connections in close_wait state only 1 or two when we run into this. Most likely its because of dropped connection. I could not find any read or write timeouts we can set for the thrift server which will tell thrift to hold on to the client connection. See this https://issues.apache.org/jira/browse/HIVE-2006 but doesnt seem to have been implemented yet. We do have set a client connection timeout but cannot find an equivalent setting for the server. We have a suspicion that this happens when we run two client processes which modify two distinct partitions of the same hive table. We put in a workaround so that the two hive client processes never run together and so far things look ok but we will keep monitoring. Could it be because hive metastore server is not thread safe, would running two alter table statements on two distinct partitions of the same table using two client connections cause problems like these, where hive metastore server closes or drops a wrong client connection and leaves the other hanging? Agateaaa On Tue, Jul 30,
Re: Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors
Thanks Ashutosh. Filed https://issues.apache.org/jira/browse/HIVE-5172 On Thu, Aug 29, 2013 at 11:53 AM, Ashutosh Chauhan hashut...@apache.orgwrote: Thanks Agatea for digging in. Seems like you have hit a bug. Would you mind opening a jira and adding your findings to it. Thanks, Ashutosh On Thu, Aug 29, 2013 at 11:22 AM, agateaaa agate...@gmail.com wrote: Sorry hit send too soon ... Hi All: Put some debugging code in TUGIContainingTransport.getTransport() and I tracked it down to @Override public TUGIContainingTransport getTransport(TTransport trans) { // UGI information is not available at connection setup time, it will be set later // via set_ugi() rpc. transMap.putIfAbsent(trans, new TUGIContainingTransport(trans)); //return transMap.get(trans); //-change TUGIContainingTransport retTrans = transMap.get(trans); if ( retTrans == null ) { LOGGER.error ( cannot find transport that was in map !!) } else { LOGGER.debug ( cannot find transport that was in map !!) return retTrans; } } When we run this in our test environment, see that we run into the problem just after GC runs, and cannot find transport that was in the map!! message gets logged. Could the GC be collecting entries from transMap, just before the we get it Tried a minor change which seems to work public TUGIContainingTransport getTransport(TTransport trans) { TUGIContainingTransport retTrans = transMap.get(trans); if ( retTrans == null ) { // UGI information is not available at connection setup time, it will be set later // via set_ugi() rpc. transMap.putIfAbsent(trans, retTrans); } return retTrans; } My questions for hive and thrift experts 1.) Do we need to use a ConcurrentMap ConcurrentMapTTransport, TUGIContainingTransport transMap = new MapMaker().weakKeys().weakValues().makeMap(); It does use == to compare keys (which might be the problem), also in this case we cant rely on the trans to be always there in the transMap, even after a put, so in that case change above probably makes sense 2.) Is it better idea to use WeakHashMap with WeakReference instead ? (was looking at org.apache.thrift.transport.TSaslServerTransport, esp change made by THRIFT-1468) e.g. private static MapTTransport, WeakReferenceTUGIContainingTransport transMap3 = Collections.synchronizedMap(new WeakHashMapTTransport, WeakReferenceTUGIContainingTransport()); getTransport() would be something like public TUGIContainingTransport getTransport(TTransport trans) { WeakReferenceTUGIContainingTransport ret = transMap.get(trans); if (ret == null || ret.get() == null) { ret = new WeakReferenceTUGIContainingTransport(new TUGIContainingTransport(trans)); transMap3.put(trans, ret); // No need for putIfAbsent(). // Concurrent calls to getTransport() will pass in different TTransports. } return ret.get(); } I did try 1.) above in our test environment and it does seem to resolve the problem, though i am not sure if I am introducing any other problem Can someone help ? Thanks Agatea On Thu, Aug 29, 2013 at 10:57 AM, agateaaa agate...@gmail.com wrote: Hi All: Put some debugging code in TUGIContainingTransport.getTransport() and I tracked it down to @Override public TUGIContainingTransport getTransport(TTransport trans) { // UGI information is not available at connection setup time, it will be set later // via set_ugi() rpc. transMap.putIfAbsent(trans, new TUGIContainingTransport(trans)); //return transMap.get(trans); -change TUGIContainingTransport retTrans = transMap.get(trans); if ( retTrans == null ) { } On Wed, Jul 31, 2013 at 9:48 AM, agateaaa agate...@gmail.com wrote: Thanks Nitin There arent too many connections in close_wait state only 1 or two when we run into this. Most likely its because of dropped connection. I could not find any read or write timeouts we can set for the thrift server which will tell thrift to hold on to the client connection. See this https://issues.apache.org/jira/browse/HIVE-2006 but doesnt seem to have been implemented yet. We do have set a client connection timeout but cannot find an equivalent setting for the server. We have a suspicion that this happens when we run two client processes which modify two distinct partitions of the same hive table. We put in a workaround so that the two hive client processes never run together and so far things look ok but we will keep monitoring. Could it be because hive metastore server is not thread safe, would running two alter table statements on two distinct partitions of the same table using two client connections cause problems like these, where hive metastore server closes or drops a wrong client connection and leaves the other hanging? Agateaaa On Tue, Jul 30, 2013