database connection resilience

2013-07-06 Thread Marcus Sorensen
I've noticed that the cloudstack management server creates persistent
connections to the database, and crashes if the database connection is
lost. I haven't looked at the code yet, but I was wondering if anyone
knew about what was going on here, if it's simply not set up to
gracefully handle reconnect, or something else.  We have a
multi-master database setup, but cloudstack doesn't take advantage of
it since it doesn't attempt graceful reconnect, if the particular node
it connected to on startup goes down, it simply crashes.


Re: database connection resilience

2013-07-06 Thread Marcus Sorensen
I see that my db.properties has db.cloud.autoReconnect=true, which
translates to setting autoReconnect in the jdbc driver connection in
utils/src/com/cloud/utils/db/Transaction.java. I also see that if I
manually trigger the issue I get:

013-07-07 00:42:50,502 ERROR [cloud.cluster.ClusterManagerImpl]
(Cluster-Heartbeat-1:null) Runtime DB exception
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
Communications link failure

The last packet successfully received from the server was 1,503
milliseconds ago.  The last packet sent successfully to the server was
0 milliseconds ago.
at sun.reflect.GeneratedConstructorAccessor159.newInstance(Unknown Source)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1117)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3567)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3456)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3997)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2468)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2629)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2719)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2155)
at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2318)
at 
org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
at 
org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
at 
com.cloud.utils.db.GenericDaoBase.searchIncludingRemoved(GenericDaoBase.java:409)
at 
com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
at 
com.cloud.utils.db.GenericDaoBase.searchIncludingRemoved(GenericDaoBase.java:350)
at 
com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
at 
com.cloud.utils.db.GenericDaoBase.listIncludingRemovedBy(GenericDaoBase.java:907)
at 
com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
at 
com.cloud.utils.db.GenericDaoBase.listIncludingRemovedBy(GenericDaoBase.java:912)
at 
com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
at 
com.cloud.cluster.dao.ManagementServerHostDaoImpl.getActiveList(ManagementServerHostDaoImpl.java:158)
at 
com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
at com.cloud.cluster.ClusterManagerImpl.peerScan(ClusterManagerImpl.java:1057)
at com.cloud.cluster.ClusterManagerImpl.access$1200(ClusterManagerImpl.java:95)
at com.cloud.cluster.ClusterManagerImpl$4.run(ClusterManagerImpl.java:789)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
Caused by: java.io.EOFException: Can not read response from server.
Expected to read 4 bytes, read 0 bytes before connection was
unexpectedly lost.
... 55 more
2013-07-07 00:42:50,505 ERROR [cloud.cluster.ClusterManagerImpl]
(Cluster-Heartbeat-1:null) DB communication problem detected, fence it

And I have only to restart cloudstack-management so it can connect to
another member in the loadbalanced multimaster database to get things
running again.


On Sun, Jul 7, 2013 at 12:35 AM, Marcus Sorensen  wrote:
> I've noticed that the cloudstack management server creates persistent
> connections to the database, and crashes if the database connection is
> lost. I haven't looked at the code yet, but I was wondering if anyone
> knew about what was going on here, if it's simply not set up to
> gracefully handle reconnect, or something else.  We have a
> multi-master database setup, but cloudstack doesn't take advantage of
> it since it doesn't attempt graceful reconnect, if the particular node
> it connected to on startup goes down, it simply crashes.


Re: database connection resilience

2013-07-06 Thread Wido den Hollander

Hi,

On 07/07/2013 08:45 AM, Marcus Sorensen wrote:

I see that my db.properties has db.cloud.autoReconnect=true, which
translates to setting autoReconnect in the jdbc driver connection in
utils/src/com/cloud/utils/db/Transaction.java. I also see that if I
manually trigger the issue I get:



Just to confirm, I see the same issues. I haven't looked into this yet, 
but this is also one of the things I want to have fixed.


Maybe create an issue for it?

Wido


013-07-07 00:42:50,502 ERROR [cloud.cluster.ClusterManagerImpl]
(Cluster-Heartbeat-1:null) Runtime DB exception
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
Communications link failure

The last packet successfully received from the server was 1,503
milliseconds ago.  The last packet sent successfully to the server was
0 milliseconds ago.
at sun.reflect.GeneratedConstructorAccessor159.newInstance(Unknown Source)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1117)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3567)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3456)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3997)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2468)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2629)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2719)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2155)
at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2318)
at 
org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
at 
org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
at 
com.cloud.utils.db.GenericDaoBase.searchIncludingRemoved(GenericDaoBase.java:409)
at 
com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
at 
com.cloud.utils.db.GenericDaoBase.searchIncludingRemoved(GenericDaoBase.java:350)
at 
com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
at 
com.cloud.utils.db.GenericDaoBase.listIncludingRemovedBy(GenericDaoBase.java:907)
at 
com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
at 
com.cloud.utils.db.GenericDaoBase.listIncludingRemovedBy(GenericDaoBase.java:912)
at 
com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
at 
com.cloud.cluster.dao.ManagementServerHostDaoImpl.getActiveList(ManagementServerHostDaoImpl.java:158)
at 
com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
at com.cloud.cluster.ClusterManagerImpl.peerScan(ClusterManagerImpl.java:1057)
at com.cloud.cluster.ClusterManagerImpl.access$1200(ClusterManagerImpl.java:95)
at com.cloud.cluster.ClusterManagerImpl$4.run(ClusterManagerImpl.java:789)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
Caused by: java.io.EOFException: Can not read response from server.
Expected to read 4 bytes, read 0 bytes before connection was
unexpectedly lost.
... 55 more
2013-07-07 00:42:50,505 ERROR [cloud.cluster.ClusterManagerImpl]
(Cluster-Heartbeat-1:null) DB communication problem detected, fence it

And I have only to restart cloudstack-management so it can connect to
another member in the loadbalanced multimaster database to get things
running again.


On Sun, Jul 7, 2013 at 12:35 AM, Marcus Sorensen  wrote:

I've noticed that the cloudstack management server creates persistent
connections to the database, and crashes if the database connection is
lost. I haven't looked at the code yet, but I was wondering if anyone
knew about what was going on here, if it's simply not set up to
gracefully handle reconnect, or something else.  We have a
multi-master database setup, but cloudstack doesn't take advantage of
it since it doesn't attempt graceful reconnect, if the particular nod

Re: database connection resilience

2013-07-07 Thread Marcus Sorensen
Ok. After a cursory look, I've seen that the autoReconnect is kind of
a bad option for jdbc. I've also found this, which seems kind of hairy
for what I want to do:

http://dev.mysql.com/doc/refman/5.0/en/connector-j-usagenotes-j2ee-concepts-managing-load-balanced-connections.html

I don't necessarily want to hand off the loadbalancing management to
the java code, I just want cloudstack to automatically reinitialize
the database connection when this 'communications link failure'
occurs, maybe with a db.cloud.connection.retry.count property or
similar.

On Sun, Jul 7, 2013 at 12:54 AM, Wido den Hollander  wrote:
> Hi,
>
>
> On 07/07/2013 08:45 AM, Marcus Sorensen wrote:
>>
>> I see that my db.properties has db.cloud.autoReconnect=true, which
>> translates to setting autoReconnect in the jdbc driver connection in
>> utils/src/com/cloud/utils/db/Transaction.java. I also see that if I
>> manually trigger the issue I get:
>>
>
> Just to confirm, I see the same issues. I haven't looked into this yet, but
> this is also one of the things I want to have fixed.
>
> Maybe create an issue for it?
>
> Wido
>
>
>> 013-07-07 00:42:50,502 ERROR [cloud.cluster.ClusterManagerImpl]
>> (Cluster-Heartbeat-1:null) Runtime DB exception
>> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
>> Communications link failure
>>
>> The last packet successfully received from the server was 1,503
>> milliseconds ago.  The last packet sent successfully to the server was
>> 0 milliseconds ago.
>> at sun.reflect.GeneratedConstructorAccessor159.newInstance(Unknown Source)
>> at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
>> at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
>> at
>> com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1117)
>> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3567)
>> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3456)
>> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3997)
>> at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2468)
>> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2629)
>> at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2719)
>> at
>> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2155)
>> at
>> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2318)
>> at
>> org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
>> at
>> org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
>> at
>> com.cloud.utils.db.GenericDaoBase.searchIncludingRemoved(GenericDaoBase.java:409)
>> at
>> com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
>> at
>> com.cloud.utils.db.GenericDaoBase.searchIncludingRemoved(GenericDaoBase.java:350)
>> at
>> com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
>> at
>> com.cloud.utils.db.GenericDaoBase.listIncludingRemovedBy(GenericDaoBase.java:907)
>> at
>> com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
>> at
>> com.cloud.utils.db.GenericDaoBase.listIncludingRemovedBy(GenericDaoBase.java:912)
>> at
>> com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
>> at
>> com.cloud.cluster.dao.ManagementServerHostDaoImpl.getActiveList(ManagementServerHostDaoImpl.java:158)
>> at
>> com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
>> at
>> com.cloud.cluster.ClusterManagerImpl.peerScan(ClusterManagerImpl.java:1057)
>> at
>> com.cloud.cluster.ClusterManagerImpl.access$1200(ClusterManagerImpl.java:95)
>> at com.cloud.cluster.ClusterManagerImpl$4.run(ClusterManagerImpl.java:789)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> at
>> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
>> at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
>> at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:679)
>> Caused by: java.io.EOFException: Can not read response from server.
>> Expected to read 4 bytes, read 0 bytes before connection was
>> unexpectedly l

Re: database connection resilience

2013-07-07 Thread Marcus Sorensen
Oh, and I should correct myself, it doesn't crash, it seems that the
management server fences itself because it can't talk to the database.

On Sun, Jul 7, 2013 at 12:59 AM, Marcus Sorensen  wrote:
> Ok. After a cursory look, I've seen that the autoReconnect is kind of
> a bad option for jdbc. I've also found this, which seems kind of hairy
> for what I want to do:
>
> http://dev.mysql.com/doc/refman/5.0/en/connector-j-usagenotes-j2ee-concepts-managing-load-balanced-connections.html
>
> I don't necessarily want to hand off the loadbalancing management to
> the java code, I just want cloudstack to automatically reinitialize
> the database connection when this 'communications link failure'
> occurs, maybe with a db.cloud.connection.retry.count property or
> similar.
>
> On Sun, Jul 7, 2013 at 12:54 AM, Wido den Hollander  wrote:
>> Hi,
>>
>>
>> On 07/07/2013 08:45 AM, Marcus Sorensen wrote:
>>>
>>> I see that my db.properties has db.cloud.autoReconnect=true, which
>>> translates to setting autoReconnect in the jdbc driver connection in
>>> utils/src/com/cloud/utils/db/Transaction.java. I also see that if I
>>> manually trigger the issue I get:
>>>
>>
>> Just to confirm, I see the same issues. I haven't looked into this yet, but
>> this is also one of the things I want to have fixed.
>>
>> Maybe create an issue for it?
>>
>> Wido
>>
>>
>>> 013-07-07 00:42:50,502 ERROR [cloud.cluster.ClusterManagerImpl]
>>> (Cluster-Heartbeat-1:null) Runtime DB exception
>>> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
>>> Communications link failure
>>>
>>> The last packet successfully received from the server was 1,503
>>> milliseconds ago.  The last packet sent successfully to the server was
>>> 0 milliseconds ago.
>>> at sun.reflect.GeneratedConstructorAccessor159.newInstance(Unknown Source)
>>> at
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
>>> at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
>>> at
>>> com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1117)
>>> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3567)
>>> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3456)
>>> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3997)
>>> at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2468)
>>> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2629)
>>> at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2719)
>>> at
>>> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2155)
>>> at
>>> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2318)
>>> at
>>> org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
>>> at
>>> org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
>>> at
>>> com.cloud.utils.db.GenericDaoBase.searchIncludingRemoved(GenericDaoBase.java:409)
>>> at
>>> com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
>>> at
>>> com.cloud.utils.db.GenericDaoBase.searchIncludingRemoved(GenericDaoBase.java:350)
>>> at
>>> com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
>>> at
>>> com.cloud.utils.db.GenericDaoBase.listIncludingRemovedBy(GenericDaoBase.java:907)
>>> at
>>> com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
>>> at
>>> com.cloud.utils.db.GenericDaoBase.listIncludingRemovedBy(GenericDaoBase.java:912)
>>> at
>>> com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
>>> at
>>> com.cloud.cluster.dao.ManagementServerHostDaoImpl.getActiveList(ManagementServerHostDaoImpl.java:158)
>>> at
>>> com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
>>> at
>>> com.cloud.cluster.ClusterManagerImpl.peerScan(ClusterManagerImpl.java:1057)
>>> at
>>> com.cloud.cluster.ClusterManagerImpl.access$1200(ClusterManagerImpl.java:95)
>>> at com.cloud.cluster.ClusterManagerImpl$4.run(ClusterManagerImpl.java:789)
>>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>> at
>>> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
>>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
>>> at
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
>>> at
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolE

Re: database connection resilience

2013-07-07 Thread Marcus Sorensen
I think there are two separate issues here.

1) The management server uses the database to determine cluster
membership, and if no database connection can be made, the management
server fences itself (shuts down). This is good, but in the case where
there's only one management server (no cluster intended), it seems
like an issue. However, it may be better to shut down, I'm not sure
how the management server will react after a temporary database
outage. Some opinions would be appreciated, my preference would be
that a single-management server would just be able to pick back up
where it left off rather than dying.

2) There is no support for JDBC's built-in loadbalancing features. I
have a patch that fixes this, however I noticed a few things that I'd
like some feedback on. Namely, the awsapi database connection doesn't
have its own settings, rather it uses the same host connection
settings as the cloud db and the autoReconnect setting from the usage
database settings. Was this a shortcut, or is there a reason for it?
My current version of the patch just keeps the same methodology, but
it seems that while I'm at adding properties to db.properties we could
allow true db.awsapi.host and db.awsapi.port.

On Sun, Jul 7, 2013 at 1:02 AM, Marcus Sorensen  wrote:
> Oh, and I should correct myself, it doesn't crash, it seems that the
> management server fences itself because it can't talk to the database.
>
> On Sun, Jul 7, 2013 at 12:59 AM, Marcus Sorensen  wrote:
>> Ok. After a cursory look, I've seen that the autoReconnect is kind of
>> a bad option for jdbc. I've also found this, which seems kind of hairy
>> for what I want to do:
>>
>> http://dev.mysql.com/doc/refman/5.0/en/connector-j-usagenotes-j2ee-concepts-managing-load-balanced-connections.html
>>
>> I don't necessarily want to hand off the loadbalancing management to
>> the java code, I just want cloudstack to automatically reinitialize
>> the database connection when this 'communications link failure'
>> occurs, maybe with a db.cloud.connection.retry.count property or
>> similar.
>>
>> On Sun, Jul 7, 2013 at 12:54 AM, Wido den Hollander  wrote:
>>> Hi,
>>>
>>>
>>> On 07/07/2013 08:45 AM, Marcus Sorensen wrote:

 I see that my db.properties has db.cloud.autoReconnect=true, which
 translates to setting autoReconnect in the jdbc driver connection in
 utils/src/com/cloud/utils/db/Transaction.java. I also see that if I
 manually trigger the issue I get:

>>>
>>> Just to confirm, I see the same issues. I haven't looked into this yet, but
>>> this is also one of the things I want to have fixed.
>>>
>>> Maybe create an issue for it?
>>>
>>> Wido
>>>
>>>
 013-07-07 00:42:50,502 ERROR [cloud.cluster.ClusterManagerImpl]
 (Cluster-Heartbeat-1:null) Runtime DB exception
 com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
 Communications link failure

 The last packet successfully received from the server was 1,503
 milliseconds ago.  The last packet sent successfully to the server was
 0 milliseconds ago.
 at sun.reflect.GeneratedConstructorAccessor159.newInstance(Unknown Source)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
 at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
 at
 com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1117)
 at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3567)
 at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3456)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3997)
 at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2468)
 at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2629)
 at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2719)
 at
 com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2155)
 at
 com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2318)
 at
 org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
 at
 org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
 at
 com.cloud.utils.db.GenericDaoBase.searchIncludingRemoved(GenericDaoBase.java:409)
 at
 com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
 at
 com.cloud.utils.db.GenericDaoBase.searchIncludingRemoved(GenericDaoBase.java:350)
 at
 com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
 at
 com.cloud.utils.db.GenericDaoBase.listIncludingRemovedBy(GenericDaoBase.java:907)
 at
 com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInst

Re: database connection resilience

2013-07-07 Thread Marcus Sorensen
Looks like there's no "db.usage.url.params", either. Is there a reason
for it, or was it just implemented quickly?

On Sun, Jul 7, 2013 at 4:36 PM, Marcus Sorensen  wrote:
> I think there are two separate issues here.
>
> 1) The management server uses the database to determine cluster
> membership, and if no database connection can be made, the management
> server fences itself (shuts down). This is good, but in the case where
> there's only one management server (no cluster intended), it seems
> like an issue. However, it may be better to shut down, I'm not sure
> how the management server will react after a temporary database
> outage. Some opinions would be appreciated, my preference would be
> that a single-management server would just be able to pick back up
> where it left off rather than dying.
>
> 2) There is no support for JDBC's built-in loadbalancing features. I
> have a patch that fixes this, however I noticed a few things that I'd
> like some feedback on. Namely, the awsapi database connection doesn't
> have its own settings, rather it uses the same host connection
> settings as the cloud db and the autoReconnect setting from the usage
> database settings. Was this a shortcut, or is there a reason for it?
> My current version of the patch just keeps the same methodology, but
> it seems that while I'm at adding properties to db.properties we could
> allow true db.awsapi.host and db.awsapi.port.
>
> On Sun, Jul 7, 2013 at 1:02 AM, Marcus Sorensen  wrote:
>> Oh, and I should correct myself, it doesn't crash, it seems that the
>> management server fences itself because it can't talk to the database.
>>
>> On Sun, Jul 7, 2013 at 12:59 AM, Marcus Sorensen  wrote:
>>> Ok. After a cursory look, I've seen that the autoReconnect is kind of
>>> a bad option for jdbc. I've also found this, which seems kind of hairy
>>> for what I want to do:
>>>
>>> http://dev.mysql.com/doc/refman/5.0/en/connector-j-usagenotes-j2ee-concepts-managing-load-balanced-connections.html
>>>
>>> I don't necessarily want to hand off the loadbalancing management to
>>> the java code, I just want cloudstack to automatically reinitialize
>>> the database connection when this 'communications link failure'
>>> occurs, maybe with a db.cloud.connection.retry.count property or
>>> similar.
>>>
>>> On Sun, Jul 7, 2013 at 12:54 AM, Wido den Hollander  wrote:
 Hi,


 On 07/07/2013 08:45 AM, Marcus Sorensen wrote:
>
> I see that my db.properties has db.cloud.autoReconnect=true, which
> translates to setting autoReconnect in the jdbc driver connection in
> utils/src/com/cloud/utils/db/Transaction.java. I also see that if I
> manually trigger the issue I get:
>

 Just to confirm, I see the same issues. I haven't looked into this yet, but
 this is also one of the things I want to have fixed.

 Maybe create an issue for it?

 Wido


> 013-07-07 00:42:50,502 ERROR [cloud.cluster.ClusterManagerImpl]
> (Cluster-Heartbeat-1:null) Runtime DB exception
> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
> Communications link failure
>
> The last packet successfully received from the server was 1,503
> milliseconds ago.  The last packet sent successfully to the server was
> 0 milliseconds ago.
> at sun.reflect.GeneratedConstructorAccessor159.newInstance(Unknown Source)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
> at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
> at
> com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1117)
> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3567)
> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3456)
> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3997)
> at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2468)
> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2629)
> at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2719)
> at
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2155)
> at
> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2318)
> at
> org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
> at
> org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery(DelegatingPreparedStatement.java:96)
> at
> com.cloud.utils.db.GenericDaoBase.searchIncludingRemoved(GenericDaoBase.java:409)
> at
> com.cloud.utils.component.ComponentInstantiationPostProcessor$InterceptorDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
> at
> com.cloud.utils.db.GenericDaoBase.searchIncludingRemoved(GenericDaoBase.java:350)
> at
> com.cloud.utils.component.ComponentInstantiationPostProcessor$Inter

Re: database connection resilience

2013-07-08 Thread Kelven Yang


On 7/7/13 3:36 PM, "Marcus Sorensen"  wrote:

>I think there are two separate issues here.
>
>1) The management server uses the database to determine cluster
>membership, and if no database connection can be made, the management
>server fences itself (shuts down). This is good, but in the case where
>there's only one management server (no cluster intended), it seems
>like an issue. However, it may be better to shut down, I'm not sure
>how the management server will react after a temporary database
>outage. Some opinions would be appreciated, my preference would be
>that a single-management server would just be able to pick back up
>where it left off rather than dying.

In a management server cluster setup with multiple management servers, to
avoid split-brian situation we will actively perform management server
self-fence as soon as the detection of inconsistent view of the cluster
from individual management servers.

As the clustering logic relies on DB heavily, lost of DB connectivity is
considered as a fatal event to trigger self-fence in addition to the
inconsistent view detection. For a multi-master DB setup, it only works if
the switch of database instance is transparent to CloudStack. Means,
database automatic fail-over should be completely handled at DB
connectivity layer and CloudStack should not be aware of it. Most of
current CloudStack logic is built upon such assumption, it may be possible
to relax this requirement, but we need to investigate the impact and test
out how resilient CloudStack would be to unexpected DB connectivity
exceptions in the middle of various orchestration work flows

>
>2) There is no support for JDBC's built-in loadbalancing features. I
>have a patch that fixes this, however I noticed a few things that I'd
>like some feedback on. Namely, the awsapi database connection doesn't
>have its own settings, rather it uses the same host connection
>settings as the cloud db and the autoReconnect setting from the usage
>database settings. Was this a shortcut, or is there a reason for it?
>My current version of the patch just keeps the same methodology, but
>it seems that while I'm at adding properties to db.properties we could
>allow true db.awsapi.host and db.awsapi.port.
>
>On Sun, Jul 7, 2013 at 1:02 AM, Marcus Sorensen 
>wrote:
>> Oh, and I should correct myself, it doesn't crash, it seems that the
>> management server fences itself because it can't talk to the database.
>>
>> On Sun, Jul 7, 2013 at 12:59 AM, Marcus Sorensen 
>>wrote:
>>> Ok. After a cursory look, I've seen that the autoReconnect is kind of
>>> a bad option for jdbc. I've also found this, which seems kind of hairy
>>> for what I want to do:
>>>
>>> 
>>>http://dev.mysql.com/doc/refman/5.0/en/connector-j-usagenotes-j2ee-conce
>>>pts-managing-load-balanced-connections.html
>>>
>>> I don't necessarily want to hand off the loadbalancing management to
>>> the java code, I just want cloudstack to automatically reinitialize
>>> the database connection when this 'communications link failure'
>>> occurs, maybe with a db.cloud.connection.retry.count property or
>>> similar.
>>>
>>> On Sun, Jul 7, 2013 at 12:54 AM, Wido den Hollander 
>>>wrote:
 Hi,


 On 07/07/2013 08:45 AM, Marcus Sorensen wrote:
>
> I see that my db.properties has db.cloud.autoReconnect=true, which
> translates to setting autoReconnect in the jdbc driver connection in
> utils/src/com/cloud/utils/db/Transaction.java. I also see that if I
> manually trigger the issue I get:
>

 Just to confirm, I see the same issues. I haven't looked into this
yet, but
 this is also one of the things I want to have fixed.

 Maybe create an issue for it?

 Wido


> 013-07-07 00:42:50,502 ERROR [cloud.cluster.ClusterManagerImpl]
> (Cluster-Heartbeat-1:null) Runtime DB exception
> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
> Communications link failure
>
> The last packet successfully received from the server was 1,503
> milliseconds ago.  The last packet sent successfully to the server
>was
> 0 milliseconds ago.
> at sun.reflect.GeneratedConstructorAccessor159.newInstance(Unknown
>Source)
> at
> 
>sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCo
>nstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
> at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
> at
> 
>com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:11
>17)
> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3567)
> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3456)
> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3997)
> at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2468)
> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2629)
> at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImp

Re: database connection resilience

2013-07-08 Thread Kelven Yang


On 7/7/13 9:08 PM, "Marcus Sorensen"  wrote:

>Looks like there's no "db.usage.url.params", either. Is there a reason
>for it, or was it just implemented quickly?
>
>On Sun, Jul 7, 2013 at 4:36 PM, Marcus Sorensen 
>wrote:
>> I think there are two separate issues here.
>>
>> 1) The management server uses the database to determine cluster
>> membership, and if no database connection can be made, the management
>> server fences itself (shuts down). This is good, but in the case where
>> there's only one management server (no cluster intended), it seems
>> like an issue. However, it may be better to shut down, I'm not sure
>> how the management server will react after a temporary database
>> outage. Some opinions would be appreciated, my preference would be
>> that a single-management server would just be able to pick back up
>> where it left off rather than dying.

Since temporary database outage may come in at anytime, the logic to make
CloudStack orchestration flow resilient to such outage would be the same
to single management server or multi-management server setup, this is the
impact that we haven't investigated and tested.

I also don't think we have DB resilience in mind to code it in such way.
The previous assumption is to let DB connectivity layer handle it
transparently for CloudStack, if such assumption can't stand still due to
lack of support from database vendors, it is a good time to consider and
estimate the effort to add support in CloudStack now, we've seen similar
requests from customers recently.


>>
>> 2) There is no support for JDBC's built-in loadbalancing features. I
>> have a patch that fixes this, however I noticed a few things that I'd
>> like some feedback on. Namely, the awsapi database connection doesn't
>> have its own settings, rather it uses the same host connection
>> settings as the cloud db and the autoReconnect setting from the usage
>> database settings. Was this a shortcut, or is there a reason for it?
>> My current version of the patch just keeps the same methodology, but
>> it seems that while I'm at adding properties to db.properties we could
>> allow true db.awsapi.host and db.awsapi.port.
>>
>> On Sun, Jul 7, 2013 at 1:02 AM, Marcus Sorensen 
>>wrote:
>>> Oh, and I should correct myself, it doesn't crash, it seems that the
>>> management server fences itself because it can't talk to the database.
>>>
>>> On Sun, Jul 7, 2013 at 12:59 AM, Marcus Sorensen 
>>>wrote:
 Ok. After a cursory look, I've seen that the autoReconnect is kind of
 a bad option for jdbc. I've also found this, which seems kind of hairy
 for what I want to do:

 
http://dev.mysql.com/doc/refman/5.0/en/connector-j-usagenotes-j2ee-conc
epts-managing-load-balanced-connections.html

 I don't necessarily want to hand off the loadbalancing management to
 the java code, I just want cloudstack to automatically reinitialize
 the database connection when this 'communications link failure'
 occurs, maybe with a db.cloud.connection.retry.count property or
 similar.

 On Sun, Jul 7, 2013 at 12:54 AM, Wido den Hollander 
wrote:
> Hi,
>
>
> On 07/07/2013 08:45 AM, Marcus Sorensen wrote:
>>
>> I see that my db.properties has db.cloud.autoReconnect=true, which
>> translates to setting autoReconnect in the jdbc driver connection in
>> utils/src/com/cloud/utils/db/Transaction.java. I also see that if I
>> manually trigger the issue I get:
>>
>
> Just to confirm, I see the same issues. I haven't looked into this
>yet, but
> this is also one of the things I want to have fixed.
>
> Maybe create an issue for it?
>
> Wido
>
>
>> 013-07-07 00:42:50,502 ERROR [cloud.cluster.ClusterManagerImpl]
>> (Cluster-Heartbeat-1:null) Runtime DB exception
>> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
>> Communications link failure
>>
>> The last packet successfully received from the server was 1,503
>> milliseconds ago.  The last packet sent successfully to the server
>>was
>> 0 milliseconds ago.
>> at sun.reflect.GeneratedConstructorAccessor159.newInstance(Unknown
>>Source)
>> at
>> 
>>sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingC
>>onstructorAccessorImpl.java:45)
>> at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
>> at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
>> at
>> 
>>com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1
>>117)
>> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3567)
>> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3456)
>> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3997)
>> at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2468)
>> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2629)
>> at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionI

Re: database connection resilience

2013-07-08 Thread Marcus Sorensen
Yes, I have a sample set up. It took some looking through the code,
but for cloud db one can set up multi-database at the JDBC driver
level with no modifications. For usage and awsapi databases, this
isn't currently possible because there are some parameters missing for
db.properties, the aforementioned ones.

For example, to do multi-master database with failover, one simply has
to provide comma delimited hosts to db.cloud.host parameter, along
with db.cloud.url.params addition of
"&failOverReadOnly=false&connectTimeout=200". Doing this allowed me to
stop mysql service on one of any of my database masters and cloudstack
kept on running, switching connections between them as needed.

I didn't find any cloudstack documentation on this, so I think it's
just taking advantage of how the jdbc connection string is put
together in the code, rather than intentional support.

Loadbalance options could be added to take advantage of JDBC
loadbalance functionality, but many multi-master clusters require
special transaction failure handling when writing to multiple masters
that I don't think cloudstack is doing.

On Mon, Jul 8, 2013 at 2:03 PM, Kelven Yang  wrote:
>
>
> On 7/7/13 3:36 PM, "Marcus Sorensen"  wrote:
>
>>I think there are two separate issues here.
>>
>>1) The management server uses the database to determine cluster
>>membership, and if no database connection can be made, the management
>>server fences itself (shuts down). This is good, but in the case where
>>there's only one management server (no cluster intended), it seems
>>like an issue. However, it may be better to shut down, I'm not sure
>>how the management server will react after a temporary database
>>outage. Some opinions would be appreciated, my preference would be
>>that a single-management server would just be able to pick back up
>>where it left off rather than dying.
>
> In a management server cluster setup with multiple management servers, to
> avoid split-brian situation we will actively perform management server
> self-fence as soon as the detection of inconsistent view of the cluster
> from individual management servers.
>
> As the clustering logic relies on DB heavily, lost of DB connectivity is
> considered as a fatal event to trigger self-fence in addition to the
> inconsistent view detection. For a multi-master DB setup, it only works if
> the switch of database instance is transparent to CloudStack. Means,
> database automatic fail-over should be completely handled at DB
> connectivity layer and CloudStack should not be aware of it. Most of
> current CloudStack logic is built upon such assumption, it may be possible
> to relax this requirement, but we need to investigate the impact and test
> out how resilient CloudStack would be to unexpected DB connectivity
> exceptions in the middle of various orchestration work flows
>
>>
>>2) There is no support for JDBC's built-in loadbalancing features. I
>>have a patch that fixes this, however I noticed a few things that I'd
>>like some feedback on. Namely, the awsapi database connection doesn't
>>have its own settings, rather it uses the same host connection
>>settings as the cloud db and the autoReconnect setting from the usage
>>database settings. Was this a shortcut, or is there a reason for it?
>>My current version of the patch just keeps the same methodology, but
>>it seems that while I'm at adding properties to db.properties we could
>>allow true db.awsapi.host and db.awsapi.port.
>>
>>On Sun, Jul 7, 2013 at 1:02 AM, Marcus Sorensen 
>>wrote:
>>> Oh, and I should correct myself, it doesn't crash, it seems that the
>>> management server fences itself because it can't talk to the database.
>>>
>>> On Sun, Jul 7, 2013 at 12:59 AM, Marcus Sorensen 
>>>wrote:
 Ok. After a cursory look, I've seen that the autoReconnect is kind of
 a bad option for jdbc. I've also found this, which seems kind of hairy
 for what I want to do:


http://dev.mysql.com/doc/refman/5.0/en/connector-j-usagenotes-j2ee-conce
pts-managing-load-balanced-connections.html

 I don't necessarily want to hand off the loadbalancing management to
 the java code, I just want cloudstack to automatically reinitialize
 the database connection when this 'communications link failure'
 occurs, maybe with a db.cloud.connection.retry.count property or
 similar.

 On Sun, Jul 7, 2013 at 12:54 AM, Wido den Hollander 
wrote:
> Hi,
>
>
> On 07/07/2013 08:45 AM, Marcus Sorensen wrote:
>>
>> I see that my db.properties has db.cloud.autoReconnect=true, which
>> translates to setting autoReconnect in the jdbc driver connection in
>> utils/src/com/cloud/utils/db/Transaction.java. I also see that if I
>> manually trigger the issue I get:
>>
>
> Just to confirm, I see the same issues. I haven't looked into this
>yet, but
> this is also one of the things I want to have fixed.
>
> Maybe create an issue for it?
>
>

Re: database connection resilience

2013-07-08 Thread Marcus Sorensen
So to the original question, is it your opinion that a single
management server (non-clustered) should also fence itself, or wait
for the database connection to be restored?

On Mon, Jul 8, 2013 at 2:03 PM, Kelven Yang  wrote:
>
>
> On 7/7/13 3:36 PM, "Marcus Sorensen"  wrote:
>
>>I think there are two separate issues here.
>>
>>1) The management server uses the database to determine cluster
>>membership, and if no database connection can be made, the management
>>server fences itself (shuts down). This is good, but in the case where
>>there's only one management server (no cluster intended), it seems
>>like an issue. However, it may be better to shut down, I'm not sure
>>how the management server will react after a temporary database
>>outage. Some opinions would be appreciated, my preference would be
>>that a single-management server would just be able to pick back up
>>where it left off rather than dying.
>
> In a management server cluster setup with multiple management servers, to
> avoid split-brian situation we will actively perform management server
> self-fence as soon as the detection of inconsistent view of the cluster
> from individual management servers.
>
> As the clustering logic relies on DB heavily, lost of DB connectivity is
> considered as a fatal event to trigger self-fence in addition to the
> inconsistent view detection. For a multi-master DB setup, it only works if
> the switch of database instance is transparent to CloudStack. Means,
> database automatic fail-over should be completely handled at DB
> connectivity layer and CloudStack should not be aware of it. Most of
> current CloudStack logic is built upon such assumption, it may be possible
> to relax this requirement, but we need to investigate the impact and test
> out how resilient CloudStack would be to unexpected DB connectivity
> exceptions in the middle of various orchestration work flows
>
>>
>>2) There is no support for JDBC's built-in loadbalancing features. I
>>have a patch that fixes this, however I noticed a few things that I'd
>>like some feedback on. Namely, the awsapi database connection doesn't
>>have its own settings, rather it uses the same host connection
>>settings as the cloud db and the autoReconnect setting from the usage
>>database settings. Was this a shortcut, or is there a reason for it?
>>My current version of the patch just keeps the same methodology, but
>>it seems that while I'm at adding properties to db.properties we could
>>allow true db.awsapi.host and db.awsapi.port.
>>
>>On Sun, Jul 7, 2013 at 1:02 AM, Marcus Sorensen 
>>wrote:
>>> Oh, and I should correct myself, it doesn't crash, it seems that the
>>> management server fences itself because it can't talk to the database.
>>>
>>> On Sun, Jul 7, 2013 at 12:59 AM, Marcus Sorensen 
>>>wrote:
 Ok. After a cursory look, I've seen that the autoReconnect is kind of
 a bad option for jdbc. I've also found this, which seems kind of hairy
 for what I want to do:


http://dev.mysql.com/doc/refman/5.0/en/connector-j-usagenotes-j2ee-conce
pts-managing-load-balanced-connections.html

 I don't necessarily want to hand off the loadbalancing management to
 the java code, I just want cloudstack to automatically reinitialize
 the database connection when this 'communications link failure'
 occurs, maybe with a db.cloud.connection.retry.count property or
 similar.

 On Sun, Jul 7, 2013 at 12:54 AM, Wido den Hollander 
wrote:
> Hi,
>
>
> On 07/07/2013 08:45 AM, Marcus Sorensen wrote:
>>
>> I see that my db.properties has db.cloud.autoReconnect=true, which
>> translates to setting autoReconnect in the jdbc driver connection in
>> utils/src/com/cloud/utils/db/Transaction.java. I also see that if I
>> manually trigger the issue I get:
>>
>
> Just to confirm, I see the same issues. I haven't looked into this
>yet, but
> this is also one of the things I want to have fixed.
>
> Maybe create an issue for it?
>
> Wido
>
>
>> 013-07-07 00:42:50,502 ERROR [cloud.cluster.ClusterManagerImpl]
>> (Cluster-Heartbeat-1:null) Runtime DB exception
>> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
>> Communications link failure
>>
>> The last packet successfully received from the server was 1,503
>> milliseconds ago.  The last packet sent successfully to the server
>>was
>> 0 milliseconds ago.
>> at sun.reflect.GeneratedConstructorAccessor159.newInstance(Unknown
>>Source)
>> at
>>
>>sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCo
>>nstructorAccessorImpl.java:45)
>> at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
>> at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
>> at
>>
>>com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:11
>>17)
>> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket

Re: database connection resilience

2013-07-10 Thread Kelven Yang


On 7/8/13 1:40 PM, "Marcus Sorensen"  wrote:

>So to the original question, is it your opinion that a single
>management server (non-clustered) should also fence itself, or wait
>for the database connection to be restored?

Yes, my opinion is that a single management server should also fence
itself (if you like, use a monitoring script to bring it up automatically)

If connection failure event has been revealed to CloudStack via runtime
exception, then it is too late as current CloudStack code may not be able
to gracefully handle these out-of-flow-context unexpected exceptions in
every scenario (at least we haven't really tested it). We need the DB
connection layer to cover the multi-master failover case and make it
transparent.

Kelven 



>
>On Mon, Jul 8, 2013 at 2:03 PM, Kelven Yang 
>wrote:
>>
>>
>> On 7/7/13 3:36 PM, "Marcus Sorensen"  wrote:
>>
>>>I think there are two separate issues here.
>>>
>>>1) The management server uses the database to determine cluster
>>>membership, and if no database connection can be made, the management
>>>server fences itself (shuts down). This is good, but in the case where
>>>there's only one management server (no cluster intended), it seems
>>>like an issue. However, it may be better to shut down, I'm not sure
>>>how the management server will react after a temporary database
>>>outage. Some opinions would be appreciated, my preference would be
>>>that a single-management server would just be able to pick back up
>>>where it left off rather than dying.
>>
>> In a management server cluster setup with multiple management servers,
>>to
>> avoid split-brian situation we will actively perform management server
>> self-fence as soon as the detection of inconsistent view of the cluster
>> from individual management servers.
>>
>> As the clustering logic relies on DB heavily, lost of DB connectivity is
>> considered as a fatal event to trigger self-fence in addition to the
>> inconsistent view detection. For a multi-master DB setup, it only works
>>if
>> the switch of database instance is transparent to CloudStack. Means,
>> database automatic fail-over should be completely handled at DB
>> connectivity layer and CloudStack should not be aware of it. Most of
>> current CloudStack logic is built upon such assumption, it may be
>>possible
>> to relax this requirement, but we need to investigate the impact and
>>test
>> out how resilient CloudStack would be to unexpected DB connectivity
>> exceptions in the middle of various orchestration work flows
>>
>>>
>>>2) There is no support for JDBC's built-in loadbalancing features. I
>>>have a patch that fixes this, however I noticed a few things that I'd
>>>like some feedback on. Namely, the awsapi database connection doesn't
>>>have its own settings, rather it uses the same host connection
>>>settings as the cloud db and the autoReconnect setting from the usage
>>>database settings. Was this a shortcut, or is there a reason for it?
>>>My current version of the patch just keeps the same methodology, but
>>>it seems that while I'm at adding properties to db.properties we could
>>>allow true db.awsapi.host and db.awsapi.port.
>>>
>>>On Sun, Jul 7, 2013 at 1:02 AM, Marcus Sorensen 
>>>wrote:
 Oh, and I should correct myself, it doesn't crash, it seems that the
 management server fences itself because it can't talk to the database.

 On Sun, Jul 7, 2013 at 12:59 AM, Marcus Sorensen 
wrote:
> Ok. After a cursory look, I've seen that the autoReconnect is kind of
> a bad option for jdbc. I've also found this, which seems kind of
>hairy
> for what I want to do:
>
>
>http://dev.mysql.com/doc/refman/5.0/en/connector-j-usagenotes-j2ee-con
>ce
>pts-managing-load-balanced-connections.html
>
> I don't necessarily want to hand off the loadbalancing management to
> the java code, I just want cloudstack to automatically reinitialize
> the database connection when this 'communications link failure'
> occurs, maybe with a db.cloud.connection.retry.count property or
> similar.
>
> On Sun, Jul 7, 2013 at 12:54 AM, Wido den Hollander 
>wrote:
>> Hi,
>>
>>
>> On 07/07/2013 08:45 AM, Marcus Sorensen wrote:
>>>
>>> I see that my db.properties has db.cloud.autoReconnect=true, which
>>> translates to setting autoReconnect in the jdbc driver connection
>>>in
>>> utils/src/com/cloud/utils/db/Transaction.java. I also see that if I
>>> manually trigger the issue I get:
>>>
>>
>> Just to confirm, I see the same issues. I haven't looked into this
>>yet, but
>> this is also one of the things I want to have fixed.
>>
>> Maybe create an issue for it?
>>
>> Wido
>>
>>
>>> 013-07-07 00:42:50,502 ERROR [cloud.cluster.ClusterManagerImpl]
>>> (Cluster-Heartbeat-1:null) Runtime DB exception
>>> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
>>> Communications link failure
>>