Re: Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors
Uploaded a patch for HiVE-5172. Can someone please review? Do I have to be a contributor before I submit a patch? Did run about test with the patch in our test environment (Ran about 1000 pig jobs to read and insert into hive table (via hcatalog), along with equal number of alter table statements) for the past 4 days and haven't seen any error on the client or the server. On Thu, Aug 29, 2013 at 2:39 PM, agateaaa agate...@gmail.com wrote: Thanks Ashutosh. Filed https://issues.apache.org/jira/browse/HIVE-5172 On Thu, Aug 29, 2013 at 11:53 AM, Ashutosh Chauhan hashut...@apache.orgwrote: Thanks Agatea for digging in. Seems like you have hit a bug. Would you mind opening a jira and adding your findings to it. Thanks, Ashutosh On Thu, Aug 29, 2013 at 11:22 AM, agateaaa agate...@gmail.com wrote: Sorry hit send too soon ... Hi All: Put some debugging code in TUGIContainingTransport.getTransport() and I tracked it down to @Override public TUGIContainingTransport getTransport(TTransport trans) { // UGI information is not available at connection setup time, it will be set later // via set_ugi() rpc. transMap.putIfAbsent(trans, new TUGIContainingTransport(trans)); //return transMap.get(trans); //-change TUGIContainingTransport retTrans = transMap.get(trans); if ( retTrans == null ) { LOGGER.error ( cannot find transport that was in map !!) } else { LOGGER.debug ( cannot find transport that was in map !!) return retTrans; } } When we run this in our test environment, see that we run into the problem just after GC runs, and cannot find transport that was in the map!! message gets logged. Could the GC be collecting entries from transMap, just before the we get it Tried a minor change which seems to work public TUGIContainingTransport getTransport(TTransport trans) { TUGIContainingTransport retTrans = transMap.get(trans); if ( retTrans == null ) { // UGI information is not available at connection setup time, it will be set later // via set_ugi() rpc. transMap.putIfAbsent(trans, retTrans); } return retTrans; } My questions for hive and thrift experts 1.) Do we need to use a ConcurrentMap ConcurrentMapTTransport, TUGIContainingTransport transMap = new MapMaker().weakKeys().weakValues().makeMap(); It does use == to compare keys (which might be the problem), also in this case we cant rely on the trans to be always there in the transMap, even after a put, so in that case change above probably makes sense 2.) Is it better idea to use WeakHashMap with WeakReference instead ? (was looking at org.apache.thrift.transport.TSaslServerTransport, esp change made by THRIFT-1468) e.g. private static MapTTransport, WeakReferenceTUGIContainingTransport transMap3 = Collections.synchronizedMap(new WeakHashMapTTransport, WeakReferenceTUGIContainingTransport()); getTransport() would be something like public TUGIContainingTransport getTransport(TTransport trans) { WeakReferenceTUGIContainingTransport ret = transMap.get(trans); if (ret == null || ret.get() == null) { ret = new WeakReferenceTUGIContainingTransport(new TUGIContainingTransport(trans)); transMap3.put(trans, ret); // No need for putIfAbsent(). // Concurrent calls to getTransport() will pass in different TTransports. } return ret.get(); } I did try 1.) above in our test environment and it does seem to resolve the problem, though i am not sure if I am introducing any other problem Can someone help ? Thanks Agatea On Thu, Aug 29, 2013 at 10:57 AM, agateaaa agate...@gmail.com wrote: Hi All: Put some debugging code in TUGIContainingTransport.getTransport() and I tracked it down to @Override public TUGIContainingTransport getTransport(TTransport trans) { // UGI information is not available at connection setup time, it will be set later // via set_ugi() rpc. transMap.putIfAbsent(trans, new TUGIContainingTransport(trans)); //return transMap.get(trans); -change TUGIContainingTransport retTrans = transMap.get(trans); if ( retTrans == null ) { } On Wed, Jul 31, 2013 at 9:48 AM, agateaaa agate...@gmail.com wrote: Thanks Nitin There arent too many connections in close_wait state only 1 or two when we run into this. Most likely its because of dropped connection. I could not find any read or write timeouts we can set for the thrift server which will tell thrift to hold on to the client connection. See this https://issues.apache.org/jira/browse/HIVE-2006 but doesnt seem to have been implemented yet. We do have set a client connection timeout but cannot find an equivalent setting for the server. We have a suspicion that this happens when we run two client processes which modify two distinct partitions of the same hive table. We put in a workaround so that the two
Re: Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors
Hi All: Put some debugging code in TUGIContainingTransport.getTransport() and I tracked it down to @Override public TUGIContainingTransport getTransport(TTransport trans) { // UGI information is not available at connection setup time, it will be set later // via set_ugi() rpc. transMap.putIfAbsent(trans, new TUGIContainingTransport(trans)); //return transMap.get(trans); -change TUGIContainingTransport retTrans = transMap.get(trans); if ( retTrans == null ) { } On Wed, Jul 31, 2013 at 9:48 AM, agateaaa agate...@gmail.com wrote: Thanks Nitin There arent too many connections in close_wait state only 1 or two when we run into this. Most likely its because of dropped connection. I could not find any read or write timeouts we can set for the thrift server which will tell thrift to hold on to the client connection. See this https://issues.apache.org/jira/browse/HIVE-2006 but doesnt seem to have been implemented yet. We do have set a client connection timeout but cannot find an equivalent setting for the server. We have a suspicion that this happens when we run two client processes which modify two distinct partitions of the same hive table. We put in a workaround so that the two hive client processes never run together and so far things look ok but we will keep monitoring. Could it be because hive metastore server is not thread safe, would running two alter table statements on two distinct partitions of the same table using two client connections cause problems like these, where hive metastore server closes or drops a wrong client connection and leaves the other hanging? Agateaaa On Tue, Jul 30, 2013 at 12:49 AM, Nitin Pawar nitinpawar...@gmail.comwrote: The mentioned flow is called when you have unsecure mode of thrift metastore client-server connection. So one way to avoid this is have a secure way. code public boolean process(final TProtocol in, final TProtocol out) throwsTException { setIpAddress(in); ... ... ... @Override protected void setIpAddress(final TProtocol in) { TUGIContainingTransport ugiTrans = (TUGIContainingTransport)in.getTransport(); Socket socket = ugiTrans.getSocket(); if (socket != null) { setIpAddress(socket); /code From the above code snippet, it looks like the null pointer exception is not handled if the getSocket returns null. can you check whats the ulimit setting on the server? If its set to default can you set it to unlimited and restart hcat server. (This is just a wild guess). also the getSocket method suggests If the underlying TTransport is an instance of TSocket, it returns the Socket object which it contains. Otherwise it returns null. so someone from thirft gurus need to tell us whats happening. I have no knowledge of this depth may be Ashutosh or Thejas will be able to help on this. From the netstat close_wait, it looks like the hive metastore server has not closed the connection (do not know why yet), may be the hive dev guys can help.Are there too many connections in close_wait state? On Tue, Jul 30, 2013 at 5:52 AM, agateaaa agate...@gmail.com wrote: Looking at the hive metastore server logs see errors like these: 2013-07-26 06:34:52,853 ERROR server.TThreadPoolServer (TThreadPoolServer.java:run(182)) - Error occurred during processing of message. java.lang.NullPointerException at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.setIpAddress(TUGIBasedProcessor.java:183) at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:79) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) approx same time as we see timeout or connection reset errors. Dont know if this is the cause or the side affect of he connection timeout/connection reset errors. Does anybody have any pointers or suggestions ? Thanks On Mon, Jul 29, 2013 at 11:29 AM, agateaaa agate...@gmail.com wrote: Thanks Nitin! We have simiar setup (identical hcatalog and hive server versions) on a another production environment and dont see any errors (its been running ok for a few months) Unfortunately we wont be able to move to hcat 0.5 and hive 0.11 or hive 0.10 soon. I did see that the last time we ran into this problem doing a netstat-ntp | grep :1 see that server was holding on to one socket connection in CLOSE_WAIT state for a long time (hive metastore server is running on port 1). Dont know if thats relevant here or not Can you suggest any hive configuration settings we can tweak or networking tools/tips, we can use to
Re: Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors
Thanks Ashutosh. Filed https://issues.apache.org/jira/browse/HIVE-5172 On Thu, Aug 29, 2013 at 11:53 AM, Ashutosh Chauhan hashut...@apache.orgwrote: Thanks Agatea for digging in. Seems like you have hit a bug. Would you mind opening a jira and adding your findings to it. Thanks, Ashutosh On Thu, Aug 29, 2013 at 11:22 AM, agateaaa agate...@gmail.com wrote: Sorry hit send too soon ... Hi All: Put some debugging code in TUGIContainingTransport.getTransport() and I tracked it down to @Override public TUGIContainingTransport getTransport(TTransport trans) { // UGI information is not available at connection setup time, it will be set later // via set_ugi() rpc. transMap.putIfAbsent(trans, new TUGIContainingTransport(trans)); //return transMap.get(trans); //-change TUGIContainingTransport retTrans = transMap.get(trans); if ( retTrans == null ) { LOGGER.error ( cannot find transport that was in map !!) } else { LOGGER.debug ( cannot find transport that was in map !!) return retTrans; } } When we run this in our test environment, see that we run into the problem just after GC runs, and cannot find transport that was in the map!! message gets logged. Could the GC be collecting entries from transMap, just before the we get it Tried a minor change which seems to work public TUGIContainingTransport getTransport(TTransport trans) { TUGIContainingTransport retTrans = transMap.get(trans); if ( retTrans == null ) { // UGI information is not available at connection setup time, it will be set later // via set_ugi() rpc. transMap.putIfAbsent(trans, retTrans); } return retTrans; } My questions for hive and thrift experts 1.) Do we need to use a ConcurrentMap ConcurrentMapTTransport, TUGIContainingTransport transMap = new MapMaker().weakKeys().weakValues().makeMap(); It does use == to compare keys (which might be the problem), also in this case we cant rely on the trans to be always there in the transMap, even after a put, so in that case change above probably makes sense 2.) Is it better idea to use WeakHashMap with WeakReference instead ? (was looking at org.apache.thrift.transport.TSaslServerTransport, esp change made by THRIFT-1468) e.g. private static MapTTransport, WeakReferenceTUGIContainingTransport transMap3 = Collections.synchronizedMap(new WeakHashMapTTransport, WeakReferenceTUGIContainingTransport()); getTransport() would be something like public TUGIContainingTransport getTransport(TTransport trans) { WeakReferenceTUGIContainingTransport ret = transMap.get(trans); if (ret == null || ret.get() == null) { ret = new WeakReferenceTUGIContainingTransport(new TUGIContainingTransport(trans)); transMap3.put(trans, ret); // No need for putIfAbsent(). // Concurrent calls to getTransport() will pass in different TTransports. } return ret.get(); } I did try 1.) above in our test environment and it does seem to resolve the problem, though i am not sure if I am introducing any other problem Can someone help ? Thanks Agatea On Thu, Aug 29, 2013 at 10:57 AM, agateaaa agate...@gmail.com wrote: Hi All: Put some debugging code in TUGIContainingTransport.getTransport() and I tracked it down to @Override public TUGIContainingTransport getTransport(TTransport trans) { // UGI information is not available at connection setup time, it will be set later // via set_ugi() rpc. transMap.putIfAbsent(trans, new TUGIContainingTransport(trans)); //return transMap.get(trans); -change TUGIContainingTransport retTrans = transMap.get(trans); if ( retTrans == null ) { } On Wed, Jul 31, 2013 at 9:48 AM, agateaaa agate...@gmail.com wrote: Thanks Nitin There arent too many connections in close_wait state only 1 or two when we run into this. Most likely its because of dropped connection. I could not find any read or write timeouts we can set for the thrift server which will tell thrift to hold on to the client connection. See this https://issues.apache.org/jira/browse/HIVE-2006 but doesnt seem to have been implemented yet. We do have set a client connection timeout but cannot find an equivalent setting for the server. We have a suspicion that this happens when we run two client processes which modify two distinct partitions of the same hive table. We put in a workaround so that the two hive client processes never run together and so far things look ok but we will keep monitoring. Could it be because hive metastore server is not thread safe, would running two alter table statements on two distinct partitions of the same table using two client connections cause problems like these, where hive metastore server closes or drops a wrong client connection and leaves the other hanging? Agateaaa On Tue, Jul 30, 2013
Re: Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors
Thanks Nitin There arent too many connections in close_wait state only 1 or two when we run into this. Most likely its because of dropped connection. I could not find any read or write timeouts we can set for the thrift server which will tell thrift to hold on to the client connection. See this https://issues.apache.org/jira/browse/HIVE-2006 but doesnt seem to have been implemented yet. We do have set a client connection timeout but cannot find an equivalent setting for the server. We have a suspicion that this happens when we run two client processes which modify two distinct partitions of the same hive table. We put in a workaround so that the two hive client processes never run together and so far things look ok but we will keep monitoring. Could it be because hive metastore server is not thread safe, would running two alter table statements on two distinct partitions of the same table using two client connections cause problems like these, where hive metastore server closes or drops a wrong client connection and leaves the other hanging? Agateaaa On Tue, Jul 30, 2013 at 12:49 AM, Nitin Pawar nitinpawar...@gmail.comwrote: The mentioned flow is called when you have unsecure mode of thrift metastore client-server connection. So one way to avoid this is have a secure way. code public boolean process(final TProtocol in, final TProtocol out) throwsTException { setIpAddress(in); ... ... ... @Override protected void setIpAddress(final TProtocol in) { TUGIContainingTransport ugiTrans = (TUGIContainingTransport)in.getTransport(); Socket socket = ugiTrans.getSocket(); if (socket != null) { setIpAddress(socket); /code From the above code snippet, it looks like the null pointer exception is not handled if the getSocket returns null. can you check whats the ulimit setting on the server? If its set to default can you set it to unlimited and restart hcat server. (This is just a wild guess). also the getSocket method suggests If the underlying TTransport is an instance of TSocket, it returns the Socket object which it contains. Otherwise it returns null. so someone from thirft gurus need to tell us whats happening. I have no knowledge of this depth may be Ashutosh or Thejas will be able to help on this. From the netstat close_wait, it looks like the hive metastore server has not closed the connection (do not know why yet), may be the hive dev guys can help.Are there too many connections in close_wait state? On Tue, Jul 30, 2013 at 5:52 AM, agateaaa agate...@gmail.com wrote: Looking at the hive metastore server logs see errors like these: 2013-07-26 06:34:52,853 ERROR server.TThreadPoolServer (TThreadPoolServer.java:run(182)) - Error occurred during processing of message. java.lang.NullPointerException at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.setIpAddress(TUGIBasedProcessor.java:183) at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:79) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) approx same time as we see timeout or connection reset errors. Dont know if this is the cause or the side affect of he connection timeout/connection reset errors. Does anybody have any pointers or suggestions ? Thanks On Mon, Jul 29, 2013 at 11:29 AM, agateaaa agate...@gmail.com wrote: Thanks Nitin! We have simiar setup (identical hcatalog and hive server versions) on a another production environment and dont see any errors (its been running ok for a few months) Unfortunately we wont be able to move to hcat 0.5 and hive 0.11 or hive 0.10 soon. I did see that the last time we ran into this problem doing a netstat-ntp | grep :1 see that server was holding on to one socket connection in CLOSE_WAIT state for a long time (hive metastore server is running on port 1). Dont know if thats relevant here or not Can you suggest any hive configuration settings we can tweak or networking tools/tips, we can use to narrow this down ? Thanks Agateaaa On Mon, Jul 29, 2013 at 11:02 AM, Nitin Pawar nitinpawar...@gmail.com wrote: Is there any chance you can do a update on test environment with hcat-0.5 and hive-0(11 or 10) and see if you can reproduce the issue? We used to see this error when there was load on hcat server or some network issue connecting to the server(second one was rare occurrence) On Mon, Jul 29, 2013 at 11:13 PM, agateaaa agate...@gmail.com wrote: Hi All: We are running into frequent problem using HCatalog 0.4.1
Re: Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors
The mentioned flow is called when you have unsecure mode of thrift metastore client-server connection. So one way to avoid this is have a secure way. code public boolean process(final TProtocol in, final TProtocol out) throwsTException { setIpAddress(in); ... ... ... @Override protected void setIpAddress(final TProtocol in) { TUGIContainingTransport ugiTrans = (TUGIContainingTransport)in.getTransport(); Socket socket = ugiTrans.getSocket(); if (socket != null) { setIpAddress(socket); /code From the above code snippet, it looks like the null pointer exception is not handled if the getSocket returns null. can you check whats the ulimit setting on the server? If its set to default can you set it to unlimited and restart hcat server. (This is just a wild guess). also the getSocket method suggests If the underlying TTransport is an instance of TSocket, it returns the Socket object which it contains. Otherwise it returns null. so someone from thirft gurus need to tell us whats happening. I have no knowledge of this depth may be Ashutosh or Thejas will be able to help on this. From the netstat close_wait, it looks like the hive metastore server has not closed the connection (do not know why yet), may be the hive dev guys can help.Are there too many connections in close_wait state? On Tue, Jul 30, 2013 at 5:52 AM, agateaaa agate...@gmail.com wrote: Looking at the hive metastore server logs see errors like these: 2013-07-26 06:34:52,853 ERROR server.TThreadPoolServer (TThreadPoolServer.java:run(182)) - Error occurred during processing of message. java.lang.NullPointerException at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.setIpAddress(TUGIBasedProcessor.java:183) at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:79) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) approx same time as we see timeout or connection reset errors. Dont know if this is the cause or the side affect of he connection timeout/connection reset errors. Does anybody have any pointers or suggestions ? Thanks On Mon, Jul 29, 2013 at 11:29 AM, agateaaa agate...@gmail.com wrote: Thanks Nitin! We have simiar setup (identical hcatalog and hive server versions) on a another production environment and dont see any errors (its been running ok for a few months) Unfortunately we wont be able to move to hcat 0.5 and hive 0.11 or hive 0.10 soon. I did see that the last time we ran into this problem doing a netstat-ntp | grep :1 see that server was holding on to one socket connection in CLOSE_WAIT state for a long time (hive metastore server is running on port 1). Dont know if thats relevant here or not Can you suggest any hive configuration settings we can tweak or networking tools/tips, we can use to narrow this down ? Thanks Agateaaa On Mon, Jul 29, 2013 at 11:02 AM, Nitin Pawar nitinpawar...@gmail.com wrote: Is there any chance you can do a update on test environment with hcat-0.5 and hive-0(11 or 10) and see if you can reproduce the issue? We used to see this error when there was load on hcat server or some network issue connecting to the server(second one was rare occurrence) On Mon, Jul 29, 2013 at 11:13 PM, agateaaa agate...@gmail.com wrote: Hi All: We are running into frequent problem using HCatalog 0.4.1 (HIve Metastore Server 0.9) where we get connection reset or connection timeout errors. The hive metastore server has been allocated enough (12G) memory. This is a critical problem for us and would appreciate if anyone has any pointers. We did add a retry logic in our client, which seems to help, but I am just wondering how can we narrow down to the root cause of this problem. Could this be a hiccup in networking which causes the hive server to get into a unresponsive state ? Thanks Agateaaa Example Connection reset error: === org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) at
Re: Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors
Is there any chance you can do a update on test environment with hcat-0.5 and hive-0(11 or 10) and see if you can reproduce the issue? We used to see this error when there was load on hcat server or some network issue connecting to the server(second one was rare occurrence) On Mon, Jul 29, 2013 at 11:13 PM, agateaaa agate...@gmail.com wrote: Hi All: We are running into frequent problem using HCatalog 0.4.1 (HIve Metastore Server 0.9) where we get connection reset or connection timeout errors. The hive metastore server has been allocated enough (12G) memory. This is a critical problem for us and would appreciate if anyone has any pointers. We did add a retry logic in our client, which seems to help, but I am just wondering how can we narrow down to the root cause of this problem. Could this be a hiccup in networking which causes the hive server to get into a unresponsive state ? Thanks Agateaaa Example Connection reset error: === org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.init(HiveMetaStoreClient.java:157) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2092) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:888) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeAlterTableAddParts(DDLSemanticAnalyzer.java:1817) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:297) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:642) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:168) at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ... 30 more Example Connection timeout error: == org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.init(HiveMetaStoreClient.java:157) at
Re: Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors
Thanks Nitin! We have simiar setup (identical hcatalog and hive server versions) on a another production environment and dont see any errors (its been running ok for a few months) Unfortunately we wont be able to move to hcat 0.5 and hive 0.11 or hive 0.10 soon. I did see that the last time we ran into this problem doing a netstat-ntp | grep :1 see that server was holding on to one socket connection in CLOSE_WAIT state for a long time (hive metastore server is running on port 1). Dont know if thats relevant here or not Can you suggest any hive configuration settings we can tweak or networking tools/tips, we can use to narrow this down ? Thanks Agateaaa On Mon, Jul 29, 2013 at 11:02 AM, Nitin Pawar nitinpawar...@gmail.comwrote: Is there any chance you can do a update on test environment with hcat-0.5 and hive-0(11 or 10) and see if you can reproduce the issue? We used to see this error when there was load on hcat server or some network issue connecting to the server(second one was rare occurrence) On Mon, Jul 29, 2013 at 11:13 PM, agateaaa agate...@gmail.com wrote: Hi All: We are running into frequent problem using HCatalog 0.4.1 (HIve Metastore Server 0.9) where we get connection reset or connection timeout errors. The hive metastore server has been allocated enough (12G) memory. This is a critical problem for us and would appreciate if anyone has any pointers. We did add a retry logic in our client, which seems to help, but I am just wondering how can we narrow down to the root cause of this problem. Could this be a hiccup in networking which causes the hive server to get into a unresponsive state ? Thanks Agateaaa Example Connection reset error: === org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.init(HiveMetaStoreClient.java:157) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2092) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:888) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeAlterTableAddParts(DDLSemanticAnalyzer.java:1817) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:297) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:642) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:168) at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ... 30 more Example Connection timeout error: == org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378) at
Re: Hive Metastore Server 0.9 Connection Reset and Connection Timeout errors
Looking at the hive metastore server logs see errors like these: 2013-07-26 06:34:52,853 ERROR server.TThreadPoolServer (TThreadPoolServer.java:run(182)) - Error occurred during processing of message. java.lang.NullPointerException at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.setIpAddress(TUGIBasedProcessor.java:183) at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:79) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) approx same time as we see timeout or connection reset errors. Dont know if this is the cause or the side affect of he connection timeout/connection reset errors. Does anybody have any pointers or suggestions ? Thanks On Mon, Jul 29, 2013 at 11:29 AM, agateaaa agate...@gmail.com wrote: Thanks Nitin! We have simiar setup (identical hcatalog and hive server versions) on a another production environment and dont see any errors (its been running ok for a few months) Unfortunately we wont be able to move to hcat 0.5 and hive 0.11 or hive 0.10 soon. I did see that the last time we ran into this problem doing a netstat-ntp | grep :1 see that server was holding on to one socket connection in CLOSE_WAIT state for a long time (hive metastore server is running on port 1). Dont know if thats relevant here or not Can you suggest any hive configuration settings we can tweak or networking tools/tips, we can use to narrow this down ? Thanks Agateaaa On Mon, Jul 29, 2013 at 11:02 AM, Nitin Pawar nitinpawar...@gmail.comwrote: Is there any chance you can do a update on test environment with hcat-0.5 and hive-0(11 or 10) and see if you can reproduce the issue? We used to see this error when there was load on hcat server or some network issue connecting to the server(second one was rare occurrence) On Mon, Jul 29, 2013 at 11:13 PM, agateaaa agate...@gmail.com wrote: Hi All: We are running into frequent problem using HCatalog 0.4.1 (HIve Metastore Server 0.9) where we get connection reset or connection timeout errors. The hive metastore server has been allocated enough (12G) memory. This is a critical problem for us and would appreciate if anyone has any pointers. We did add a retry logic in our client, which seems to help, but I am just wondering how can we narrow down to the root cause of this problem. Could this be a hiccup in networking which causes the hive server to get into a unresponsive state ? Thanks Agateaaa Example Connection reset error: === org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.init(HiveMetaStoreClient.java:157) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2092) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:888) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeAlterTableAddParts(DDLSemanticAnalyzer.java:1817) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:297) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:642)