Re: Adding a third node to REPLICATED cluster fails to get correct number of elements

Denis Magda Thu, 23 Jun 2016 02:21:08 -0700

Seems that this.getClass().hashCode() executed on different VMs can produce 
different result (but it should always produce the same result on a single VM 
which doesn’t violate JVM specification). Ignite requires the hashCode of a key 
to be consistent cluster-wide. So Ignite has even more stronger requirement 
then JVM spec.


—
Denis

> On Jun 23, 2016, at 11:30 AM, Kristian Rosenvold <krosenv...@apache.org> 
> wrote:
> 
> We think the issue may regard transportability of the hashCode across
> nodes, because the hashcode in question included the hashcode of a
> class (in other words this.getClass().hashCode() as opposed to the
> more robust this.getClass().getName().hashCode())
> 
> Does ignite require the hashCode of a key to be cluster-wide consistent ?
> 
> (This would actually be a violation of the javadoc contract for
> hashcode, which states "This integer need not remain consistent from
> one execution of an application to another execution of the same
> application.". But it should possible to actually test for this if it
> is a constraint required by ignite.)
> 
> If this does not appear to be the problem, I can supply the code in question.
> 
> Kristian
> 
> 
> 
> 2016-06-23 10:05 GMT+02:00 Denis Magda <dma...@gridgain.com>:
>> Hi Kristian,
>> 
>> Could you share the source of a class that has inconsistent equals/hashCode 
>> implementation? Probably we will be able to detect your case internally 
>> somehow and print a warning.
>> 
>> —
>> Denis
>> 
>>> On Jun 17, 2016, at 10:27 PM, Kristian Rosenvold <krosenv...@apache.org> 
>>> wrote:
>>> 
>>> This whole issue was caused by inconsistent equals/hashCode on a cache
>>> key, which appearantly has the capability of stopping replication dead
>>> in its tracks. Nailing this one after 3-4 days of a very nagging
>>> "select is broken" feeling was great. You guys helping us here might
>>> want to be particularly aware of this, since it undeniably gives a newbie an
>>> impression that ignite is broken while it's my code :)
>>> 
>>> Thanks for the help !
>>> 
>>> Kristian
>>> 
>>> 
>>> 2016-06-17 20:00 GMT+02:00 Alexey Goncharuk <alexey.goncha...@gmail.com>:
>>>> Kristian,
>>>> 
>>>> Are you sure you are using the latest 1.7-SNAPSHOT for your production 
>>>> data?
>>>> Did you build binaries yourself? Can you confirm the commit# of the 
>>>> binaries
>>>> you are using? The issue you are reporting seems to be the same as
>>>> IGNITE-3305 and, since the fix was committed only a couple of days ago, it
>>>> might not get to nightly snapshot.
>>>> 
>>>> 2016-06-17 9:06 GMT-07:00 Kristian Rosenvold <krosenv...@apache.org>:
>>>>> 
>>>>> Sigh, this has all the hallmarks of a thread safety issue or race
>>>>> condition.
>>>>> 
>>>>> I had a perfect testcase that replicated the problem 100% of the time,
>>>>> but only when running on distinct nodes (never occurs on same box)
>>>>> with 2 distinct caches and with ignite 1.5; I just expanded the
>>>>> testcase I posted initially . Typically I'd be missing the last 10-20
>>>>> elements in the cache. I was about 2 seconds from reporting an issue
>>>>> and then I switched to yesterday's 1.7-SNAPSHOT version and it went
>>>>> away. Unfortunately 1.7-SNAPSHOT exhibits the same behaviour with my
>>>>> production data, it just broke my testcase :( Assumably I just need to
>>>>> tweak the cache sizes or element counts to hit some kind of non-sweet
>>>>> spot, and then it probably fails on my machine.
>>>>> 
>>>>> The testcase always worked on a single box, which lead me to think
>>>>> about socket-related issues. But it also required 2 caches to fail,
>>>>> which lead me to think about race conditions like the rebalance
>>>>> terminating once the first node finishes.
>>>>> 
>>>>> I'm no stranger to reading bug reports like this myself, and I must
>>>>> admit this seems pretty tough to diagnose.
>>>>> 
>>>>> Kristian
>>>>> 
>>>>> 
>>>>> 2016-06-17 14:57 GMT+02:00 Denis Magda <dma...@gridgain.com>:
>>>>>> Hi Kristian,
>>>>>> 
>>>>>> Your test looks absolutely correct for me. However I didn’t manage to
>>>>>> reproduce this issue on my side as well.
>>>>>> 
>>>>>> Alex G., do you have any ideas on what can be a reason of that? Can you
>>>>>> recommend Kristian enabling of DEBUG/TRACE log levels for particular
>>>>>> modules? Probably advanced logging will let us to pin point the issue
>>>>>> that
>>>>>> happens in Kristian’s environment.
>>>>>> 
>>>>>> —
>>>>>> Denis
>>>>>> 
>>>>>> On Jun 17, 2016, at 10:02 AM, Kristian Rosenvold <krosenv...@apache.org>
>>>>>> wrote:
>>>>>> 
>>>>>> For ignite 1.5, 1.6 and 1.7-SNAPSHOT, I see the same behaviour. Since
>>>>>> REPLICATED caches seem to be broken on 1.6 and beyond, I am testing
>>>>>> this on 1.5:
>>>>>> 
>>>>>> I can reliably start two nodes and get consistent correct results,
>>>>>> lets say each node has 1.5 million elements in a given cache.
>>>>>> 
>>>>>> Once I start a third or fourth node in the same cluster, it
>>>>>> consistently gets a random incorrect number of elements in the same
>>>>>> cache, typically 1.1 million or so.
>>>>>> 
>>>>>> I tried to create a testcase to reproduce this on my local machine
>>>>>> 
>>>>>> (https://github.com/krosenvold/ignite/commit/4fb3f20f51280d8381e331b7bcdb2bae95b76b95),
>>>>>> but this fails to reproduce the problem.
>>>>>> 
>>>>>> I have two nodes in 2 different datacenters, so there will invariably
>>>>>> be some differences in latencies/response times between the existing 2
>>>>>> nodes and the newly started node.
>>>>>> 
>>>>>> This sounds like some kind of timing related bug, any tips ? Is there
>>>>>> any way I kan skew the timing in the testcase ?
>>>>>> 
>>>>>> 
>>>>>> Kristian
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>

Re: Adding a third node to REPLICATED cluster fails to get correct number of elements

Reply via email to