[
https://issues.apache.org/jira/browse/HIVE-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13801584#comment-13801584
]
Navis commented on HIVE-5172:
-----------------------------
As mentioned above by Ashutosh, current patch is not complete fix for this
problem. So I've made separate issue and committed to trunk for a temporal
mending.
> TUGIContainingTransport returning null transport, causing intermittent
> SocketTimeoutException on hive client and NullPointerException in
> TUGIBasedProcessor on the server
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-5172
> URL: https://issues.apache.org/jira/browse/HIVE-5172
> Project: Hive
> Issue Type: Bug
> Components: Metastore
> Affects Versions: 0.9.0, 0.10.0, 0.11.0
> Reporter: agate
> Attachments: HIVE-5172.1.patch.txt
>
>
> We are running into frequent problem using HCatalog 0.4.1 (Hive Metastore
> Server 0.9) where we get connection reset or connection timeout errors on the
> client and NullPointerException in TUGIBasedProcessor on the server.
> {code}
> hive client logs:
> =================
> org.apache.thrift.transport.TTransportException:
> java.net.SocketTimeoutException: Read timed out
> at
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
> at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
> at
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
> at
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
> at
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
> at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
> at
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136)
> at
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:157)
> at
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2092)
> at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102)
> at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:888)
> at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:830)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:954)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7524)
> at
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
> at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:642)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> Caused by: java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
> ... 31 more
> {code}
> {code}
> hive metastore server logs:
> ===============================
> 2013-07-26 06:34:52,853 ERROR server.TThreadPoolServer
> (TThreadPoolServer.java:run(182)) - Error occurred during processing of
> message.
> java.lang.NullPointerException
> at
> org.apache.hadoop.hive.metastore.TUGIBasedProcessor.setIpAddress(TUGIBasedProcessor.java:183)
> at
> org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:79)
> at
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> Adding some extra debug log messages in TUGIBasedProcessor, noticed that the
> TUGIContainingTransport is null which results in NullPointerException on the
> server.
> Further drilling into TUGIContainingTransport noticed that getTransport()
> returns a null which causes the above
> error.
> Further corelating with GC logs observed that that error always hits when the
> CMS GC has just kicked in
> (but does not happen after every GC)
> Put some debugging code in TUGIContainingTransport.getTransport() and I
> tracked it down to
> {code}
> @Override
> public TUGIContainingTransport getTransport(TTransport trans) {
> // UGI information is not available at connection setup time, it will
> be set later
> // via set_ugi() rpc.
> transMap.putIfAbsent(trans, new TUGIContainingTransport(trans));
> //return transMap.get(trans); //<-change
> TUGIContainingTransport retTrans = transMap.get(trans);
> if ( retTrans == null ) {
> LOGGER.error ("cannot find transport that was in map !!")
> return retTrans;
> } else {
> LOGGER.debug ("found transport in map")
> return retTrans;
> }
> }
> {code}
> When we run this in our test environment, see that we run into the problem
> just after GC runs,
> and "cannot find transport that was in the map!!" message gets logged.
> Could the GC be collecting entries from transMap, just before the we get it
> Tried a minor change which seems to work
> {code}
> public TUGIContainingTransport getTransport(TTransport trans) {
> TUGIContainingTransport tugiTrans = transMap.get(trans);
> if ( tugiTrans == null ) {
> // UGI information is not available at connection setup time, it will
> be set later
> // via set_ugi() rpc.
> tugiTrans = new TUGIContainingTransport(trans); //get a new
> TUGIContainingTransport
> transMap.putIfAbsent(trans, tugiTrans);
> }
> return tugiTrans;
> }
> {code}
> My questions for hive and thrift experts
> 1.) Do we need to use a ConcurrentMap
> {code}
> ConcurrentMap<TTransport, TUGIContainingTransport> transMap = new
> MapMaker().weakKeys().weakValues().makeMap();
> {code}
> It does use == to compare keys (which might be the problem), also in this
> case we cant rely on the trans to be always there in the transMap, even after
> a put, so in that case change above probably makes sense
> 2.) Is it better idea to use WeakHashMap with WeakReference value instead ?
> (was looking at org.apache.thrift.transport.TSaslServerTransport, especially
> change made by THRIFT-1468 as an example)
> e.g.
> {code}
> private static Map<TTransport, WeakReference<TUGIContainingTransport>>
> transMap = Collections.synchronizedMap(new WeakHashMap<TTransport,
> WeakReference<TUGIContainingTransport>>());
> {code}
> getTransport() would be something like
> {code}
> public TUGIContainingTransport getTransport(TTransport trans) {
> WeakReference<TUGIContainingTransport> ret = transMap.get(trans);
> if (ret == null || ret.get() == null) {
> ret = new WeakReference<TUGIContainingTransport>(new
> TUGIContainingTransport(trans));
> transMap.put(trans, ret); // No need for putIfAbsent().
> // Concurrent calls to getTransport() will pass in different
> TTransports.
> }
> return ret.get();
> }
> {code}
> 3.) Is it absolutely essential that successive calls to get getTransport()
> need to retrieve the same TUGIContainingTransport object or its ok to create
> a new one, if its not available in the map ?
--
This message was sent by Atlassian JIRA
(v6.1#6144)