Jecarm opened a new issue #1994:
URL: https://github.com/apache/iceberg/issues/1994
### Backgroud
we have started two Hive Metastore servers named _Server1_ and _Server2_,
and the Iceberg-0.9.1 config set is
* spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
* spark.sql.catalog.hive_prod.type = hive
* spark.sql.catalog.hive_prod.uri = thrift://_Server1_, thrift://_Server2_
* spark.sql.catalog.hive_prod.clients=2
It is assume that the current thread pool has already built two connections
_Server1_ and _Server2_
### Operation and Problem
1. We send a request to HMS using client pool connection with operation
'table listing' or 'namespaces listing'. In Spark SQL, the operation is `show
tables in hive_prod.db` or `show namespaces in hive_prod `, it will return
corrent result.
2. Stop HMS ‘_Server1_’ and send the same request. The expected result is
the same as step 1, because the '_Server2_' is still active, but the actual
result is 'null'.
3. In general, when one HMS stops and the other is OK, the client should
automatically connect to the active one. However, in Iceberg, the client
connect pool does not release failed connections.
### Analysis
1. The `ClientPool` implements as follows:
```
public <R> R run(Action<R, C, E> action) throws E, InterruptedException {
C client = get();
try {
return action.run(client);
} catch (Exception exc) {
if (reconnectExc.isInstance(exc)) {
try {
client = reconnect(client);
} catch (Exception ignored) {
// if reconnection throws any exception, rethrow the original
failure
throw reconnectExc.cast(exc);
}
return action.run(client);
}
throw exc;
} finally {
release(client);
}
}
```
In `HiveClientPool`, the `reconnectExc` is instance of `TTransportException`
which is extends `TException`. Therefore, when the client throw
`TTransportException`, it reconnects to the HMS.
```
protected HiveMetaStoreClient reconnect(HiveMetaStoreClient client) {
try {
client.close();
client.reconnect();
} catch (MetaException e) {
throw new RuntimeMetaException(e, "Failed to reconnect to Hive
Metastore");
}
return client;
}
```
All exceptions except `TTransportException` do not release the previously
established connection.
However, In `HiveMetastoreClient` class, some methods `getAllTables`,
`getAllDatabases` throw the exception of `MetaException` which is extends
`TException` when one HMS is inactivation.
```
public List<String> getAllTables(String dbname) throws MetaException {
try {
return filterHook.filterTableNames(dbname,
client.get_all_tables(dbname));
} catch (Exception e) {
MetaStoreUtils.logAndThrowMetaException(e);
}
return null;
}
/**
* Catches exceptions that can't be handled and bundles them to
MetaException
*
* @param e
* @throws MetaException
*/
static void logAndThrowMetaException(Exception e) throws MetaException {
String exInfo = "Got exception: " + e.getClass().getName() + " "
+ e.getMessage();
LOG.error(exInfo, e);
LOG.error("Converting exception to MetaException");
throw new MetaException(exInfo);
}
```
The `MetaException` wrapped the exception of `TTransportException`, which
causes the obsolete connection not be released, and returns an error mesage to
the user.
> ERROR - Got exception: org.apache.thrift.transport.TTransportException null
> org.apache.thrift.transport.TTransportException: null
> ...
### Solution
The solution that can be thought of at present is add a special detection
for `MetaException` which contains error message
`org.apache.thrift.transport.TTransportException`. Once a special error message
is detected, the current client connection is closed and reconnected.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]