Re: SystemFailure and VirtualMachineErrors

Anthony Baker Wed, 07 Sep 2016 10:42:23 -0700

I’m a little concerned about trying to continue on when a thread has died.  
What if the thread was important?  Can we prevent the StackOverflow in the 
first place?


Anthony

> On Sep 7, 2016, at 10:00 AM, Kirk Lund <kl...@pivotal.io> wrote:
> 
> Here's the current behavior: the stack trace for a StackOverflowError is
> logged, the thread dies, and then Geode closes its Cache and
> DistributedSystem. If it's a Server process, then the process exits.
> 
> The proposal is to still have Geode log the stack trace and the thread will
> die. The change is that we will not close the Cache and DistributedSystem
> so that the Server process does not exit.
> 
> The only way we would hit this situation today that I know of involves
> writing a query that exposes a recursive json parsing bug in the TypedJson
> class. I would alter SystemFailure to not shutdown for a StackOverflowError
> AND also fix the underlying bug in TypedJson which results in a
> StackOverflowError. The fix for TypedJson may involve removing it in favor
> of using Jackson for json parsing. I'll work on writing tests that expose
> both of these issues -- in the meantime please let me know if anyone has
> any feedback or opinions.
> 
> Thanks,
> Kirk
> 
> 
> On Wed, Sep 7, 2016 at 9:44 AM, Kirk Lund <kl...@pivotal.io> wrote:
> 
>> I'd like to change SystemFailure and calling code to not shutdown for a
>> java.lang.StackOverflowError.
>> 
>> The existing behavior would be unchanged for these VirtualMachineErrors:
>> 
>> java.lang.InternalError
>> java.lang.OutOfMemoryError
>> java.lang.UnknownError
>> java.util.zip.ZipError
>> 
>> Thoughts or concerns?
>> 
>> Thanks,
>> Kirk
>> 
>> 
>> On Fri, Sep 2, 2016 at 2:55 PM, Kirk Lund <kl...@apache.org> wrote:
>> 
>>> The Geode codebase currently includes the component SystemFailure which
>>> is initiated by any instance of VirtualMachineError:
>>> 
>>>      } catch (VirtualMachineError e) {
>>>        SystemFailure.initiateFailure(e);
>>>        throw e;
>>> 
>>> SystemFailure will ultimately react by closing the DistributedSystem and
>>> Cache (ie, shutdown the server). The original reason was to close the Cache
>>> in the event of an OutOfMemoryError to prevent Cache inconsistency from one
>>> member to another.
>>> 
>>> There are additional types of VirtualMachineError besides
>>> OutOfMemoryError. Does it really make sense to initiate SystemFailure for
>>> all other types including StackOverflowError?
>>> 
>>> GFSH starts all processes with a flag indicating that OutOfMemoryError
>>> should result in shutdown. It specifies "-XX:OnOutOfMemoryError=taskkill
>>> /F /PID %p" for HotSpot on Windows, "-XX:OnOutOfMemoryError=kill -KILL %p"
>>> for HotSpot on all other platforms, "-Xcheck:memory" on J9
>>> or "-XXexitOnOutOfMemory" on JRockit.
>>> 
>>> Given that the above flag should terminate the process on
>>> OutOfMemoryError, are we now able to delete and remove SystemFailure from
>>> Geode? Opinions?
>>> 
>>> Thanks,
>>> Kirk
>>> 
>>> 
>>

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: SystemFailure and VirtualMachineErrors

Reply via email to