You’re seeing an OOM, not a socket error / timeout. 

-- 
Jeff Jirsa


> On Aug 29, 2018, at 10:56 PM, Jai Bheemsen Rao Dhanwada 
> <jaibheem...@gmail.com> wrote:
> 
> Jeff,
> 
> any idea if this is somehow related to : 
> https://issues.apache.org/jira/browse/CASSANDRA-11840?
> does increasing the value of streaming_socket_timeout_in_ms to a higher value 
> helps?
> 
>> On Wed, Aug 29, 2018 at 10:52 PM Jai Bheemsen Rao Dhanwada 
>> <jaibheem...@gmail.com> wrote:
>> I have 72 nodes in the cluster, across 8 datacenters.. the moment I try to 
>> increase the node above 84 or so, the issue starts.
>> 
>> I am still using CMS Heap, assuming it will create more harm if I increase 
>> the heap size beyond 8G(recommended).
>> 
>>> On Wed, Aug 29, 2018 at 6:53 PM Jeff Jirsa <jji...@gmail.com> wrote:
>>> Given the size of your schema, you’re probably getting flooded with a bunch 
>>> of huge schema mutations as it hops into gossip and tries to pull the 
>>> schema from every host it sees. You say 8 DCs but you don’t say how many 
>>> nodes - I’m guessing it’s  a lot? 
>>> 
>>> This is something that’s incrementally better in 3.0, but a real proper fix 
>>> has been talked about a few times  - 
>>> https://issues.apache.org/jira/browse/CASSANDRA-11748 and 
>>> https://issues.apache.org/jira/browse/CASSANDRA-13569 for example 
>>> 
>>> In the short term, you may be able to work around this by increasing your 
>>> heap size. If that doesn’t work, there’s an ugly ugly hack that’ll work on 
>>> 2.1:  limiting the number of schema blobs you can get at a time - in this 
>>> case, that means firewall off all but a few nodes in your cluster for 10-30 
>>> seconds, make sure it gets the schema (watch the logs or file system for 
>>> the tables to be created), then remove the firewall so it can start the 
>>> bootstrap process (it needs the schema to setup the streaming plan, and it 
>>> needs all the hosts up in gossip to stream successfully, so this is an ugly 
>>> hack to give you time to get the schema and then heal the cluster so it can 
>>> bootstrap).
>>> 
>>> Yea that’s awful. Hopefully either of the two above JIRAs lands to make 
>>> this less awful. 
>>> 
>>> 
>>> 
>>> -- 
>>> Jeff Jirsa
>>> 
>>> 
>>>> On Aug 29, 2018, at 6:29 PM, Jai Bheemsen Rao Dhanwada 
>>>> <jaibheem...@gmail.com> wrote:
>>>> 
>>>> It fails before bootstrap
>>>> 
>>>> streaming throughpu on the nodes is set to 400Mb/ps
>>>> 
>>>>> On Wednesday, August 29, 2018, Jeff Jirsa <jji...@gmail.com> wrote:
>>>>> Is the bootstrap plan succeeding (does streaming start or does it crash 
>>>>> before it logs messages about streaming starting)?
>>>>> 
>>>>> Have you capped the stream throughput on the existing hosts? 
>>>>> 
>>>>> -- 
>>>>> Jeff Jirsa
>>>>> 
>>>>> 
>>>>>> On Aug 29, 2018, at 5:02 PM, Jai Bheemsen Rao Dhanwada 
>>>>>> <jaibheem...@gmail.com> wrote:
>>>>>> 
>>>>>> Hello All,
>>>>>> 
>>>>>> We are seeing some issue when we add more nodes to the cluster, where 
>>>>>> new node bootstrap is not able to stream the entire metadata and fails 
>>>>>> to bootstrap. Finally the process dies with OOM 
>>>>>> (java.lang.OutOfMemoryError: Java heap space)
>>>>>> 
>>>>>> But if I remove few nodes from the cluster we don't see this issue.
>>>>>> 
>>>>>> Cassandra Version: 2.1.16
>>>>>> # of KS and CF : 100, 3000 (approx)
>>>>>> # of DC: 8
>>>>>> # of Vnodes per node: 256
>>>>>> 
>>>>>> Not sure what is causing this behavior, has any one come across this 
>>>>>> scenario? 
>>>>>> thanks in advance.

Reply via email to