Re: What is data-streamer-stripe threasd?

2022-09-12 Thread Zhenya Stanilovsky via user

John, seems all you can here is just to set this pool size into «1» , «0» — 
tends to error.
 
https://ignite.apache.org/docs/latest/data-streaming#configuring-data-streamer-thread-pool-size
 
1 thread will still be frozen in such a case. 
 
> 
>> 
>>>Hi I'm profiling my application through YourKit and it indicates that a 
>>>bunch of these threads (data-streamer-stripe) are "frozen" for 21 days. This 
>>>
>>>I'm not using data streaming, is there a way to disable it or just ignore 
>>>the messages? The application is configured as thick client (client = true) 
>> 
>> 
>> 
>> 

Re[8]: Checkpointing threads

2022-09-12 Thread Zhenya Stanilovsky via user

Not throttling, but : «Thread dump is hidden due to throttling settings» There 
are huge documentation about persistence tuning in apache ignite.



 
>Hi,
>Throttling is disabled in ignite config as mentioned in prev reply. What do 
>you suggest to make ignite catchup with SSD limits on checkpointing.   
>On Mon, 12 Sept 2022, 11:32 Zhenya Stanilovsky via user, < 
>user@ignite.apache.org > wrote:
>>
>>
>>
>> 
>>>We have observed one interesting issue with checkpointing. We are using 64G 
>>>RAM 12 CPU with 3K iops/128mbps SSDs. Our application fills up the WAL 
>>>directory really fast and hence the RAM. We made the following observations
>>>
>>>0. Not so bad news first, it resumes processing after getting stuck for 
>>>several minutes.
>>>
>>>1. WAL and WAL Archive writes are a lot faster than writes to the work 
>>>directory through checkpointing. Very curious to know why this is the case. 
>>>checkpointing writes never exceeds 15 mbps while wal and wal archive go 
>>>really high upto max limits of ssd
>> 
>>Very simple example : sequential changing of 1 key, so in wal you obtain all 
>>changes and in (in your terms — checkpointing) only one key change.
>> 
>>>
>>>2. We observed that when offheap memory usage tend to zero , checkpointing 
>>>takes minutes to complete , sometimes 30+ minutes which stalls the 
>>>application writes completely on all nodes. It means the whole cluster 
>>>freezes. 
>> 
>>Seems ignite enables throttling in such a case, you need some system and 
>>cluster tuning. 
>> 
>>>
>>>3. Checkpointing thread get stuck at checkpointing page futures.get and 
>>>after several minutes, it logs this error and grid resumes processing
>>>
>>>"sys-stripe-0-#1" #19 prio=5 os_prio=0 cpu=86537.69ms elapsed=2166.63s 
>>>tid=0x7fa52a6f1000 nid=0x3b waiting on condition  [0x7fa4c58be000]
>>>   java.lang.Thread.State: WAITING (parking)
>>>at jdk.internal.misc.Unsafe.park( java.base@11.0.14.1/Native Method)
>>>at java.util.concurrent.locks.LockSupport.park( java.base@11.0.14.1/Unknown 
>>>Source)
>>>at 
>>>org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
>>>at 
>>>org.apache.ignite.internal.util.future.GridFutureAdapter.getUninterruptibly(GridFutureAdapter.java:146)
>>>at 
>>>org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointTimeoutLock.checkpointReadLock(CheckpointTimeoutLock.java:144)
>>>at 
>>>org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1613)
>>>at 
>>>org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processDhtAtomicUpdateRequest(GridDhtAtomicCache.java:3313)
>>>at 
>>>org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$600(GridDhtAtomicCache.java:143)
>>>at 
>>>org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$7.apply(GridDhtAtomicCache.java:322)
>>>at 
>>>org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$7.apply(GridDhtAtomicCache.java:317)
>>>at 
>>>org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1151)
>>>at 
>>>org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:592)
>>>at 
>>>org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:393)
>>>at 
>>>org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:319)
>>>at 
>>>org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:110)
>>>at 
>>>org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:309)
>>>at 
>>>org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1908)
>>>at 
>>>org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1529)
>>>at 
>>>org.apache.ignite.internal.managers.communication.GridIoManager.access$5300(GridIoManager.java:242)
>>>at 
>>>org.apache.ignite.internal.managers.communication.GridIoManager$9.execute(GridIoManager.java:1422)
>>>at 
>>>org.apache.ignite.internal.managers.communication.TraceRunnable.run(TraceRunnable.java:55)
>>>at 
>>>org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:569)
>>>at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
>>>at java.lang.Thread.run( java.base@11.0.14.1/Unknown Source)   
>>>CheckpointProgress pages = checkpointer.scheduleCheckpoint(0, "too many 
>>>dirty pages");
>>>checkpointReadWriteLock.readUnlock();
>>>
>>>if (timeout > 0 && U.currentTimeMillis() - start >= timeout)
>>>failCheckpointReadLock();
>>>
>>>try {
>>>pages
>>>.futureFor(LOCK_RELEASED)
>>>.getUninterruptibly();
>>>}
>>>
>>> [2022-09-09 

Re: Re[6]: Checkpointing threads

2022-09-12 Thread Surinder Mehra
Hi,
Throttling is disabled in ignite config as mentioned in prev reply. What do
you suggest to make ignite catchup with SSD limits on checkpointing.

On Mon, 12 Sept 2022, 11:32 Zhenya Stanilovsky via user, <
user@ignite.apache.org> wrote:

>
>
>
>
>
> We have observed one interesting issue with checkpointing. We are using
> 64G RAM 12 CPU with 3K iops/128mbps SSDs. Our application fills up the WAL
> directory really fast and hence the RAM. We made the following observations
>
> 0. Not so bad news first, it resumes processing after getting stuck for
> several minutes.
>
> 1. WAL and WAL Archive writes are a lot faster than writes to the work
> directory through checkpointing. Very curious to know why this is the case.
> checkpointing writes never exceeds 15 mbps while wal and wal archive go
> really high upto max limits of ssd
>
>
> Very simple example : sequential changing of 1 key, so in wal you obtain
> all changes and in (in your terms — checkpointing) only one key change.
>
>
>
> 2. We observed that when offheap memory usage tend to zero , checkpointing
> takes minutes to complete , sometimes 30+ minutes which stalls the
> application writes completely on all nodes. It means the whole cluster
> freezes.
>
>
> Seems ignite enables throttling in such a case, you need some system and
> cluster tuning.
>
>
>
> 3. Checkpointing thread get stuck at checkpointing page futures.get and
> after several minutes, it logs this error and grid resumes processing
>
> "sys-stripe-0-#1" #19 prio=5 os_prio=0 cpu=86537.69ms elapsed=2166.63s
> tid=0x7fa52a6f1000 nid=0x3b waiting on condition  [0x7fa4c58be000]
>java.lang.Thread.State: WAITING (parking)
> at jdk.internal.misc.Unsafe.park(java.base@11.0.14.1/Native Method)
> at java.util.concurrent.locks.LockSupport.park(java.base@11.0.14.1/Unknown
> Source)
> at
> org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
> at
> org.apache.ignite.internal.util.future.GridFutureAdapter.getUninterruptibly(GridFutureAdapter.java:146)
> at
> org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointTimeoutLock.checkpointReadLock(CheckpointTimeoutLock.java:144)
> at
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1613)
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processDhtAtomicUpdateRequest(GridDhtAtomicCache.java:3313)
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$600(GridDhtAtomicCache.java:143)
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$7.apply(GridDhtAtomicCache.java:322)
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$7.apply(GridDhtAtomicCache.java:317)
> at
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1151)
> at
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:592)
> at
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:393)
> at
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:319)
> at
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:110)
> at
> org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:309)
> at
> org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1908)
> at
> org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1529)
> at
> org.apache.ignite.internal.managers.communication.GridIoManager.access$5300(GridIoManager.java:242)
> at
> org.apache.ignite.internal.managers.communication.GridIoManager$9.execute(GridIoManager.java:1422)
> at
> org.apache.ignite.internal.managers.communication.TraceRunnable.run(TraceRunnable.java:55)
> at
> org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:569)
> at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
> at java.lang.Thread.run(java.base@11.0.14.1/Unknown Source)
> CheckpointProgress pages = checkpointer.scheduleCheckpoint(0, "too many
> dirty pages");
>
> checkpointReadWriteLock.readUnlock();
>
> if (timeout > 0 && U.currentTimeMillis() - start >= timeout)
> failCheckpointReadLock();
>
> try {
> pages
> .futureFor(LOCK_RELEASED)
> .getUninterruptibly();
> }
>
>
> [2022-09-09 18:58:35,148][ERROR][sys-stripe-9-#10][CheckpointTimeoutLock]
> Checkpoint read lock acquisition has been timed out.
> class
> org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointTimeoutLock$CheckpointReadLockTimeoutException:
> Checkpoint read lock acquisition has been timed out.
> at
> 

Re[6]: Checkpointing threads

2022-09-12 Thread Zhenya Stanilovsky via user




 
>We have observed one interesting issue with checkpointing. We are using 64G 
>RAM 12 CPU with 3K iops/128mbps SSDs. Our application fills up the WAL 
>directory really fast and hence the RAM. We made the following observations
>
>0. Not so bad news first, it resumes processing after getting stuck for 
>several minutes.
>
>1. WAL and WAL Archive writes are a lot faster than writes to the work 
>directory through checkpointing. Very curious to know why this is the case. 
>checkpointing writes never exceeds 15 mbps while wal and wal archive go really 
>high upto max limits of ssd
 
Very simple example : sequential changing of 1 key, so in wal you obtain all 
changes and in (in your terms — checkpointing) only one key change.
 
>
>2. We observed that when offheap memory usage tend to zero , checkpointing 
>takes minutes to complete , sometimes 30+ minutes which stalls the application 
>writes completely on all nodes. It means the whole cluster freezes. 
 
Seems ignite enables throttling in such a case, you need some system and 
cluster tuning. 
 
>
>3. Checkpointing thread get stuck at checkpointing page futures.get and after 
>several minutes, it logs this error and grid resumes processing
>
>"sys-stripe-0-#1" #19 prio=5 os_prio=0 cpu=86537.69ms elapsed=2166.63s 
>tid=0x7fa52a6f1000 nid=0x3b waiting on condition  [0x7fa4c58be000]
>   java.lang.Thread.State: WAITING (parking)
>at jdk.internal.misc.Unsafe.park( java.base@11.0.14.1/Native Method)
>at java.util.concurrent.locks.LockSupport.park( java.base@11.0.14.1/Unknown 
>Source)
>at 
>org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
>at 
>org.apache.ignite.internal.util.future.GridFutureAdapter.getUninterruptibly(GridFutureAdapter.java:146)
>at 
>org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointTimeoutLock.checkpointReadLock(CheckpointTimeoutLock.java:144)
>at 
>org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1613)
>at 
>org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processDhtAtomicUpdateRequest(GridDhtAtomicCache.java:3313)
>at 
>org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$600(GridDhtAtomicCache.java:143)
>at 
>org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$7.apply(GridDhtAtomicCache.java:322)
>at 
>org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$7.apply(GridDhtAtomicCache.java:317)
>at 
>org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1151)
>at 
>org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:592)
>at 
>org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:393)
>at 
>org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:319)
>at 
>org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:110)
>at 
>org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:309)
>at 
>org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1908)
>at 
>org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1529)
>at 
>org.apache.ignite.internal.managers.communication.GridIoManager.access$5300(GridIoManager.java:242)
>at 
>org.apache.ignite.internal.managers.communication.GridIoManager$9.execute(GridIoManager.java:1422)
>at 
>org.apache.ignite.internal.managers.communication.TraceRunnable.run(TraceRunnable.java:55)
>at 
>org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:569)
>at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
>at java.lang.Thread.run( java.base@11.0.14.1/Unknown Source)   
>CheckpointProgress pages = checkpointer.scheduleCheckpoint(0, "too many dirty 
>pages");
>checkpointReadWriteLock.readUnlock();
>
>if (timeout > 0 && U.currentTimeMillis() - start >= timeout)
>failCheckpointReadLock();
>
>try {
>pages
>.futureFor(LOCK_RELEASED)
>.getUninterruptibly();
>}
>
> [2022-09-09 18:58:35,148][ERROR][sys-stripe-9-#10][CheckpointTimeoutLock] 
> Checkpoint read lock acquisition has been timed out.
>class 
>org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointTimeoutLock$CheckpointReadLockTimeoutException:
> Checkpoint read lock acquisition has been timed out.
>at 
>org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointTimeoutLock.failCheckpointReadLock(CheckpointTimeoutLock.java:210)
>at 
>org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointTimeoutLock.checkpointReadLock(CheckpointTimeoutLock.java:108)
>at