Thanks all for the quick comments!
If i understand the discussion correctly, I will just change all the added log
printing statements to debug verbosity. I will upload a new patch for that
shortly.
Now a bit back story: the reason we are doing this is that we recently did an
analysis on many bugs collected from JIRA to understand why today’s cloud
system fails. And we found that almost all of the cluster-wide failures are
caused by buggy exception handling, which would often turn a component failure
into a cluster-wide one. One of the common bug pattern is ignoring some
exceptions — allowing them to propagate and finally turn into disaster.
Therefore we built a simple static checking tool just to check some simple
rules for exception handling, such as if an exception is ignored. Admittedly,
it would be much harder to reason about the potential consequences caused for
ignoring a certain exception, that’s why without much more domain knowledge I
can only recommend to 1) avoid over-catching an exception, especially when the
handling logic will swallow it, and 2) log them, as what this patch does.
Nevertheless, it seems the four cases I mentioned in JIRA-6242 are particularly
suspicious. It might be worthwhile to double check their correctness if you
have time. I am reposting them below.
Thanks!
Ding
=========================
Case 1:
Line: 375, File: "com/cloud/network/ovs/OvsTunnelManagerImpl.java"
326: try {
327: String myIp = getGreEndpointIP(dest.getHost(), nw);
.. .. ..
373: } catch (Exception e) {
374: // I really thing we should do a better handling of these exceptions
375: s_logger.warn("Ovs Tunnel network created tunnel failed", e);
376: }
Comment seems to suggest for a better handling.
=========================
=========================
Case 2:
Line: 2295, File: "com/cloud/resource/ResourceManagerImpl.java"
2284: try {
2285: /*
2286: * FIXME: this is a buggy logic, check with alex. Shouldn't
2287: * return if propagation return non null
2288: */
2289: Boolean result = _clusterMgr.propagateResourceEvent(h.getId(),
ResourceState.Event.UpdatePassword);
2290: if (result != null) {
2291: return result;
2292: }
2293:
2294: doUpdateHostPassword(h.getId());
2295: } catch (AgentUnavailableException e) {
2296: }
Seem from the comment the logic should be fixed.
A similar code snipeet is at:
Line: 2276, File: "com/cloud/resource/ResourceManagerImpl.java"
=========================
=========================
Case 3:
Line: 184, File:
"org/apache/cloudstack/api/command/user/autoscale/CreateAutoScaleVmGroupCmd.java"
174: try
175: {
176: success = _autoScaleService.configureAutoScaleVmGroup(this);
177: if (success) {
178: vmGroup = _entityMgr.findById(AutoScaleVmGroup.class, getEntityId());
179: AutoScaleVmGroupResponse responseObject =
_responseGenerator.createAutoScaleVmGroupResponse(vmGroup);
180: setResponseObject(responseObject);
181: responseObject.setResponseName(getCommandName());
182: }
183: } catch (Exception ex) {
184: // TODO what will happen if Resource Layer fails in a step inbetween
185: s_logger.warn("Failed to create autoscale vm group", ex);
186: }
The comment, "TODO", seems to suggest for better handling.
=========================
=========================
Case 4:
Line: 222, File: "com/cloud/api/ApiDispatcher.java"
This snippet is moved to ParamProcessWorker.java in the trunk.
203: try {
204: setFieldValue(field, cmd, paramObj, parameterAnnotation);
.. ..
220: } catch (CloudRuntimeException cloudEx) {
221: s_logger.error("CloudRuntimeException", cloudEx);
222: // FIXME: Better error message? This only happens if the API command is
not executable, which typically
223: //means
224: // there was
225: // and IllegalAccessException setting one of the parameters.
226: throw new ServerApiException(ApiErrorCode.INTERNAL_ERROR, "Internal error
executing API command " + cmd.getCommandName().substring(0,
cmd.getCommandName().length() - 8));
227: }
The "FIXME" comment seems to suggest for getter error message.
=========================
On Apr 2, 2014, at 4:37 PM, Daan Hoogland <[email protected]> wrote:
> Ding Yuan,
>
> Any objections to that?
>
> On Wed, Apr 2, 2014 at 10:32 PM, Alena Prokharchyk
> <[email protected]> wrote:
>>
>>
>> On 4/2/14, 1:27 PM, "Daan Hoogland" <[email protected]> wrote:
>>
>>> I think we agree indeed. Doesn't mean we should start this discuss
>>> thread or write a arch guideline on the wiki somewhere. Maybe Ding
>>> Yuan wants to do a preliminary version of it?
>>
>> Wiki guide would be useful indeed.
>>
>>
>>>
>>> In the meantime I don't think that it hurts for the present patch to
>>> do everything in debug and decide about higher levels needed later.
>>
>> Agree.
>>
>>
>>>
>>> regards,
>>>
>>> On Wed, Apr 2, 2014 at 10:11 PM, Alena Prokharchyk
>>> <[email protected]> wrote:
>>>> Daan,
>>>>
>>>> Correct me if I¹m wrong, but all of the logging added by Ding, fall
>>>> under
>>>> "to go with it or to indicate passing a certain code path². I¹ve just
>>>> noticed that some of them were added with DEBUG, and some with WARN
>>>> level,
>>>> and wanted to correct that.
>>>>
>>>> So we should:
>>>>
>>>> 1) For sure: never print them out in WARN as there is nothing admin
>>>> should
>>>> do in this case, because the code just handles them by ignoring.
>>>> 2) Figure out what would be the correct level to log them with: INFO or
>>>> DEBUG
>>>>
>>>> From ³Logging best practices² articles, I can see that people use INFO
>>>> as
>>>> a ³storyline² of normal application behavior, and DEBUG for sort of
>>>> information that helps you to track down the failure cases scenarios.
>>>> To
>>>> me, stuff added by Ding, falls under second category. But I might be
>>>> wrong
>>>> as I don¹t recall on the spot any discussions happening on the debug
>>>> topic, from the mailing list.
>>>>
>>>> -Alena.
>>>>
>>>>
>>>> On 4/2/14, 12:57 PM, "Daan Hoogland" <[email protected]> wrote:
>>>>
>>>>> Alena,
>>>>>
>>>>> What I read in your comment is a description of INFO vs WARN. Debug
>>>>> would be only for outputting stacktraces to go with it or to indicate
>>>>> passing a certain code path.
>>>>>
>>>>> Agree?
>>>>>
>>>>> On Wed, Apr 2, 2014 at 8:31 PM, Alena Prokharchyk
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> -----------------------------------------------------------
>>>>>> This is an automatically generated e-mail. To reply, visit:
>>>>>> https://reviews.apache.org/r/19917/#review39324
>>>>>> -----------------------------------------------------------
>>>>>>
>>>>>>
>>>>>> Is there a reason why logs for some exceptions are being logged in
>>>>>> DEBUG mode, and some in WARN? From my point of view, if the code only
>>>>>> catches it and doesn't error out, it should be logged in DEBUG. Lots of
>>>>>> Admins are seeking for WARN statements in the log, and they might be
>>>>>> confused seeing WARN w/o further failure or retry.
>>>>>>
>>>>>> - Alena Prokharchyk
>>>>>>
>>>>>>
>>>>>> On April 2, 2014, 1:55 p.m., Ding Yuan wrote:
>>>>>>>
>>>>>>> -----------------------------------------------------------
>>>>>>> This is an automatically generated e-mail. To reply, visit:
>>>>>>> https://reviews.apache.org/r/19917/
>>>>>>> -----------------------------------------------------------
>>>>>>>
>>>>>>> (Updated April 2, 2014, 1:55 p.m.)
>>>>>>>
>>>>>>>
>>>>>>> Review request for cloudstack.
>>>>>>>
>>>>>>>
>>>>>>> Repository: cloudstack-git
>>>>>>>
>>>>>>>
>>>>>>> Description
>>>>>>> -------
>>>>>>>
>>>>>>> This is the patch for JIRA-6242. See
>>>>>>> https://issues.apache.org/jira/browse/CLOUDSTACK-6242 for more
>>>>>>> details.
>>>>>>> Thanks!
>>>>>>>
>>>>>>>
>>>>>>> Diffs
>>>>>>> -----
>>>>>>>
>>>>>>>
>>>>>>> engine/orchestration/src/com/cloud/agent/manager/AgentManagerImpl.java
>>>>>>> 0d41bc1
>>>>>>>
>>>>>>> engine/orchestration/src/com/cloud/agent/manager/ClusteredAgentManager
>>>>>>> Im
>>>>>>> pl.java 01508a4
>>>>>>>
>>>>>>> engine/orchestration/src/com/cloud/vm/VirtualMachineManagerImpl.java
>>>>>>> 3e088db
>>>>>>>
>>>>>>> engine/orchestration/src/org/apache/cloudstack/engine/datacenter/entit
>>>>>>> y/
>>>>>>> api/db/dao/EngineDataCenterDaoImpl.java 4b6818e
>>>>>>> engine/schema/src/com/cloud/dc/dao/DataCenterDaoImpl.java ea5039f
>>>>>>> engine/schema/src/com/cloud/host/dao/HostDaoImpl.java 426c90d
>>>>>>> engine/schema/src/com/cloud/storage/dao/StoragePoolHostDaoImpl.java
>>>>>>> e42eaf4
>>>>>>> engine/schema/src/com/cloud/storage/dao/VMTemplateDaoImpl.java
>>>>>>> 34fdca5
>>>>>>> engine/schema/src/com/cloud/upgrade/dao/Upgrade2214to30.java
>>>>>>> 58dd916
>>>>>>> engine/schema/src/com/cloud/vm/dao/ConsoleProxyDaoImpl.java 5e9c2f0
>>>>>>> engine/schema/src/com/cloud/vm/dao/SecondaryStorageVmDaoImpl.java
>>>>>>> 1f382d6
>>>>>>>
>>>>>>> engine/storage/src/org/apache/cloudstack/storage/datastore/DataObjectM
>>>>>>> an
>>>>>>> agerImpl.java 6ed1274
>>>>>>>
>>>>>>> framework/ipc/src/org/apache/cloudstack/framework/serializer/OnwireCla
>>>>>>> ss
>>>>>>> Registry.java 83c8a42
>>>>>>>
>>>>>>> plugins/hypervisors/xen/src/com/cloud/hypervisor/xen/discoverer/XcpSer
>>>>>>> ve
>>>>>>> rDiscoverer.java 0ad6dc4
>>>>>>>
>>>>>>> plugins/hypervisors/xen/src/com/cloud/hypervisor/xen/resource/XenServe
>>>>>>> rC
>>>>>>> onnectionPool.java b779085
>>>>>>>
>>>>>>> plugins/hypervisors/xen/src/com/cloud/hypervisor/xen/resource/XenServe
>>>>>>> rS
>>>>>>> torageProcessor.java e512046
>>>>>>>
>>>>>>> plugins/storage/volume/solidfire/src/org/apache/cloudstack/storage/dat
>>>>>>> as
>>>>>>> tore/lifecycle/SolidFirePrimaryDataStoreLifeCycle.java af6a77a
>>>>>>> server/src/com/cloud/resource/ResourceManagerImpl.java f9a59ba
>>>>>>> server/src/com/cloud/server/ConfigurationServerImpl.java b8da4c8
>>>>>>>
>>>>>>> services/console-proxy/server/src/com/cloud/consoleproxy/ConsoleProxyT
>>>>>>> hu
>>>>>>> mbnailHandler.java 06f21d3
>>>>>>> utils/src/com/cloud/utils/net/NetUtils.java 6350986
>>>>>>>
>>>>>>> Diff: https://reviews.apache.org/r/19917/diff/
>>>>>>>
>>>>>>>
>>>>>>> Testing
>>>>>>> -------
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Ding Yuan
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Daan
>>>>
>>>
>>>
>>>
>>> --
>>> Daan
>>
>
>
>
> --
> Daan
>