Hi All: Put some debugging code in TUGIContainingTransport.getTransport() and I tracked it down to
@Override public TUGIContainingTransport getTransport(TTransport trans) { // UGI information is not available at connection setup time, it will be set later // via set_ugi() rpc. transMap.putIfAbsent(trans, new TUGIContainingTransport(trans)); //return transMap.get(trans); <-change TUGIContainingTransport retTrans = transMap.get(trans); if ( retTrans == null ) { } On Wed, Jul 31, 2013 at 9:48 AM, agateaaa <agate...@gmail.com> wrote: > Thanks Nitin > > There arent too many connections in close_wait state only 1 or two when we > run into this. Most likely its because of dropped connection. > > I could not find any read or write timeouts we can set for the thrift > server which will tell thrift to hold on to the client connection. > See this https://issues.apache.org/jira/browse/HIVE-2006 but doesnt seem > to have been implemented yet. We do have set a client connection timeout > but cannot find > an equivalent setting for the server. > > We have a suspicion that this happens when we run two client processes > which modify two distinct partitions of the same hive table. We put in a > workaround so that the two hive client processes never run together and so > far things look ok but we will keep monitoring. > > Could it be because hive metastore server is not thread safe, would > running two alter table statements on two distinct partitions of the same > table using two client connections cause problems like these, where hive > metastore server closes or drops a wrong client connection and leaves the > other hanging? > > Agateaaa > > > > > On Tue, Jul 30, 2013 at 12:49 AM, Nitin Pawar <nitinpawar...@gmail.com>wrote: > >> The mentioned flow is called when you have unsecure mode of thrift >> metastore client-server connection. So one way to avoid this is have a >> secure way. >> >> <code> >> public boolean process(final TProtocol in, final TProtocol out) >> throwsTException { >> setIpAddress(in); >> ... >> ... >> ... >> @Override >> protected void setIpAddress(final TProtocol in) { >> TUGIContainingTransport ugiTrans = >> (TUGIContainingTransport)in.getTransport(); >> Socket socket = ugiTrans.getSocket(); >> if (socket != null) { >> setIpAddress(socket); >> >> </code> >> >> >> From the above code snippet, it looks like the null pointer exception is >> not handled if the getSocket returns null. >> >> can you check whats the ulimit setting on the server? If its set to >> default >> can you set it to unlimited and restart hcat server. (This is just a wild >> guess). >> >> also the getSocket method suggests "If the underlying TTransport is an >> instance of TSocket, it returns the Socket object which it contains. >> Otherwise it returns null." >> >> so someone from thirft gurus need to tell us whats happening. I have no >> knowledge of this depth >> >> may be Ashutosh or Thejas will be able to help on this. >> >> >> >> >> From the netstat close_wait, it looks like the hive metastore server has >> not closed the connection (do not know why yet), may be the hive dev guys >> can help.Are there too many connections in close_wait state? >> >> >> >> On Tue, Jul 30, 2013 at 5:52 AM, agateaaa <agate...@gmail.com> wrote: >> >> > Looking at the hive metastore server logs see errors like these: >> > >> > 2013-07-26 06:34:52,853 ERROR server.TThreadPoolServer >> > (TThreadPoolServer.java:run(182)) - Error occurred during processing of >> > message. >> > java.lang.NullPointerException >> > at >> > >> > >> org.apache.hadoop.hive.metastore.TUGIBasedProcessor.setIpAddress(TUGIBasedProcessor.java:183) >> > at >> > >> > >> org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:79) >> > at >> > >> > >> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176) >> > at >> > >> > >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >> > at >> > >> > >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >> > at java.lang.Thread.run(Thread.java:662) >> > >> > approx same time as we see timeout or connection reset errors. >> > >> > Dont know if this is the cause or the side affect of he connection >> > timeout/connection reset errors. Does anybody have any pointers or >> > suggestions ? >> > >> > Thanks >> > >> > >> > On Mon, Jul 29, 2013 at 11:29 AM, agateaaa <agate...@gmail.com> wrote: >> > >> > > Thanks Nitin! >> > > >> > > We have simiar setup (identical hcatalog and hive server versions) on >> a >> > > another production environment and dont see any errors (its been >> running >> > ok >> > > for a few months) >> > > >> > > Unfortunately we wont be able to move to hcat 0.5 and hive 0.11 or >> hive >> > > 0.10 soon. >> > > >> > > I did see that the last time we ran into this problem doing a >> netstat-ntp >> > > | grep ":10000" see that server was holding on to one socket >> connection >> > in >> > > CLOSE_WAIT state for a long time >> > > (hive metastore server is running on port 10000). Dont know if thats >> > > relevant here or not >> > > >> > > Can you suggest any hive configuration settings we can tweak or >> > networking >> > > tools/tips, we can use to narrow this down ? >> > > >> > > Thanks >> > > Agateaaa >> > > >> > > >> > > >> > > >> > > On Mon, Jul 29, 2013 at 11:02 AM, Nitin Pawar < >> nitinpawar...@gmail.com >> > >wrote: >> > > >> > >> Is there any chance you can do a update on test environment with >> > hcat-0.5 >> > >> and hive-0(11 or 10) and see if you can reproduce the issue? >> > >> >> > >> We used to see this error when there was load on hcat server or some >> > >> network issue connecting to the server(second one was rare >> occurrence) >> > >> >> > >> >> > >> On Mon, Jul 29, 2013 at 11:13 PM, agateaaa <agate...@gmail.com> >> wrote: >> > >> >> > >>> Hi All: >> > >>> >> > >>> We are running into frequent problem using HCatalog 0.4.1 (HIve >> > Metastore >> > >>> Server 0.9) where we get connection reset or connection timeout >> errors. >> > >>> >> > >>> The hive metastore server has been allocated enough (12G) memory. >> > >>> >> > >>> This is a critical problem for us and would appreciate if anyone has >> > any >> > >>> pointers. >> > >>> >> > >>> We did add a retry logic in our client, which seems to help, but I >> am >> > >>> just >> > >>> wondering how can we narrow down to the root cause >> > >>> of this problem. Could this be a hiccup in networking which causes >> the >> > >>> hive >> > >>> server to get into a unresponsive state ? >> > >>> >> > >>> Thanks >> > >>> >> > >>> Agateaaa >> > >>> >> > >>> >> > >>> Example Connection reset error: >> > >>> ======================= >> > >>> >> > >>> org.apache.thrift.transport.TTransportException: >> > >>> java.net.SocketException: >> > >>> Connection reset >> > >>> at >> > >>> >> > >>> >> > >> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) >> > >>> at >> org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204) >> > >>> at >> > org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:157) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2092) >> > >>> at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102) >> > >>> at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:888) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeAlterTableAddParts(DDLSemanticAnalyzer.java:1817) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:297) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243) >> > >>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431) >> > >>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336) >> > >>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909) >> > >>> at >> > >>> >> > org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258) >> > >>> at >> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215) >> > >>> at >> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406) >> > >>> at >> > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341) >> > >>> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:642) >> > >>> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557) >> > >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> > >>> at >> > >>> >> > >>> >> > >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> > >>> at >> > >>> >> > >>> >> > >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> > >>> at java.lang.reflect.Method.invoke(Method.java:597) >> > >>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) >> > >>> Caused by: java.net.SocketException: Connection reset >> > >>> at java.net.SocketInputStream.read(SocketInputStream.java:168) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) >> > >>> ... 30 more >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> Example Connection timeout error: >> > >>> ========================== >> > >>> >> > >>> org.apache.thrift.transport.TTransportException: >> > >>> java.net.SocketTimeoutException: Read timed out >> > >>> at >> > >>> >> > >>> >> > >> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) >> > >>> at >> org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204) >> > >>> at >> > org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:157) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2092) >> > >>> at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102) >> > >>> at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:888) >> > >>> at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:830) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:954) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7524) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243) >> > >>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431) >> > >>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336) >> > >>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909) >> > >>> at >> > >>> >> > org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258) >> > >>> at >> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215) >> > >>> at >> > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406) >> > >>> at >> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341) >> > >>> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:642) >> > >>> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557) >> > >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> > >>> at >> > >>> >> > >>> >> > >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> > >>> at >> > >>> >> > >>> >> > >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> > >>> at java.lang.reflect.Method.invoke(Method.java:597) >> > >>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) >> > >>> Caused by: java.net.SocketTimeoutException: Read timed out >> > >>> at java.net.SocketInputStream.socketRead0(Native Method) >> > >>> at java.net.SocketInputStream.read(SocketInputStream.java:129) >> > >>> at >> > >>> >> > >>> >> > >> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) >> > >>> ... 31 more >> > >>> >> > >> >> > >> >> > >> >> > >> -- >> > >> Nitin Pawar >> > >> >> > > >> > > >> > >> >> >> >> -- >> Nitin Pawar >> > >