[jira] [Comment Edited] (CASSANDRA-16681) org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky

2022-10-07 Thread Piotr Kolaczkowski (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17613928#comment-17613928
 ] 

Piotr Kolaczkowski edited comment on CASSANDRA-16681 at 10/7/22 7:41 AM:
-

The `testPoolAllocateWithRecyclePartially` fails quite reliably  if I exclude 
the last patch:

{noformat}
INFO  [main] 2022-10-07 09:31:39,811 LongBufferPoolTest.java:329 - 2022/10/07 
09:31:39 - testing 16 threads for 2m
INFO  [main] 2022-10-07 09:31:39,811 LongBufferPoolTest.java:330 - Testing 
BufferPool with memoryUsageThreshold=16777216 and enabling BufferPool.DEBUG
INFO  [test:5] 2022-10-07 09:31:39,829 NoSpamLogger.java:92 - Maximum memory 
usage reached (16,000MiB), cannot allocate chunk of 8,000MiB

java.lang.AssertionError: Last recycled 5 < last acquired 6
at 
org.apache.cassandra.utils.memory.LongBufferPoolTest$Debug.check(LongBufferPoolTest.java:122)
at 
org.apache.cassandra.utils.memory.LongBufferPoolTest.testAllocate(LongBufferPoolTest.java:353)
at 
org.apache.cassandra.utils.memory.LongBufferPoolTest.testPoolAllocate(LongBufferPoolTest.java:159)
at 
org.apache.cassandra.utils.memory.LongBufferPoolTest.testPoolAllocateWithRecyclePartially(LongBufferPoolTest.java:147)
{noformat}
 

The other test `testPoolAllocateWithoutRecyclePartially` works fine with only 6 
patches.

So the 7-th patch fixes partial recycling. Not sure how serious this is - looks 
like some chunks would simply be not recycled, so that could cause a memory 
leak worst case? 





was (Author: pkolaczk):
The {noformat}testPoolAllocateWithRecyclePartially{noformat} fails quite 
reliably  if I exclude the last patch:

{noformat}
INFO  [main] 2022-10-07 09:31:39,811 LongBufferPoolTest.java:329 - 2022/10/07 
09:31:39 - testing 16 threads for 2m
INFO  [main] 2022-10-07 09:31:39,811 LongBufferPoolTest.java:330 - Testing 
BufferPool with memoryUsageThreshold=16777216 and enabling BufferPool.DEBUG
INFO  [test:5] 2022-10-07 09:31:39,829 NoSpamLogger.java:92 - Maximum memory 
usage reached (16,000MiB), cannot allocate chunk of 8,000MiB

java.lang.AssertionError: Last recycled 5 < last acquired 6
at 
org.apache.cassandra.utils.memory.LongBufferPoolTest$Debug.check(LongBufferPoolTest.java:122)
at 
org.apache.cassandra.utils.memory.LongBufferPoolTest.testAllocate(LongBufferPoolTest.java:353)
at 
org.apache.cassandra.utils.memory.LongBufferPoolTest.testPoolAllocate(LongBufferPoolTest.java:159)
at 
org.apache.cassandra.utils.memory.LongBufferPoolTest.testPoolAllocateWithRecyclePartially(LongBufferPoolTest.java:147)
{noformat}
 

The other test ({noformat}testPoolAllocateWithoutRecyclePartially{noformat}) 
works fine with only 6 patches.

So the 7-th patch fixes partial recycling. Not sure how serious this is - looks 
like some chunks would simply be not recycled, so that could cause a memory 
leak worst case? 




> org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky
> --
>
> Key: CASSANDRA-16681
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16681
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Ekaterina Dimitrova
>Assignee: Piotr Kolaczkowski
>Priority: Normal
> Fix For: 3.11.x, 4.0.x, 4.x
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Jenkins history:
> [https://jenkins-cm4.apache.org/job/Cassandra-4.0/50/testReport/junit/org.apache.cassandra.utils.memory/LongBufferPoolTest/testPoolAllocateWithRecyclePartially/history/]
> Fails being run in a loop in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/844/workflows/945011f4-00ac-4678-89f6-5c0db0a40169/jobs/5008
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-16681) org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky

2022-10-07 Thread Piotr Kolaczkowski (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17613911#comment-17613911
 ] 

Piotr Kolaczkowski edited comment on CASSANDRA-16681 at 10/7/22 7:22 AM:
-

[~aleksey] I'm not sure if the the last patch didn't fix something actually. 
Let me try to rerun the test with only 6 changes and see if it works and I get 
back to you. There was simply many back-and-forth already on this and I don't 
remember exactly at the moment, but the fact that the original control flow 
didn't go through the same CAS could be a correctness issue, not just code 
style issue.

// edit:
Looking at the description, there was a race fixed there, not just code style:
{quote}
   2. When a chunk was evicted, it was first released and later marked as
       EVICTED. The actual order of operations with partial recycling
       enabled was as follows:
       a) release: clear owner -> tryRecycle -> try change status
          EVICTED to IN_USE (fail)
       b) change status to EVICTED
    
       This not only didn't really recycle the chunk, but could also race
       with an other thread trying to recycle the very same chunk and
       accidentally mark an already recycled chunk as EVICTED.
       The order has been fixed and now chunk is marked as evicted before
       attempting to recycle it.

{quote}


was (Author: pkolaczk):
[~aleksey] I'm not sure if the the last patch didn't fix something actually. 
Let me try to rerun the test with only 6 changes and see if it works and I get 
back to you. There was simply many back-and-forth already on this and I don't 
remember exactly at the moment, but the fact that the original control flow 
didn't go through the same CAS could be a correctness issue, not just code 
style issue.

> org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky
> --
>
> Key: CASSANDRA-16681
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16681
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Ekaterina Dimitrova
>Assignee: Piotr Kolaczkowski
>Priority: Normal
> Fix For: 3.11.x, 4.0.x, 4.x
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Jenkins history:
> [https://jenkins-cm4.apache.org/job/Cassandra-4.0/50/testReport/junit/org.apache.cassandra.utils.memory/LongBufferPoolTest/testPoolAllocateWithRecyclePartially/history/]
> Fails being run in a loop in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/844/workflows/945011f4-00ac-4678-89f6-5c0db0a40169/jobs/5008
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-16681) org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky

2022-05-31 Thread Piotr Kolaczkowski (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17544326#comment-17544326
 ] 

Piotr Kolaczkowski edited comment on CASSANDRA-16681 at 5/31/22 11:25 AM:
--

> I was proposing removing the count entirely, as it is a minor optimisation. 
> Simply use the three fields and their null/non-null status.

I agree, this makes sense as well. So that's actually not that complex as I 
initially thought. The fact we have a fixed number of items (3) and if we 
remove the counter, we could allow "add from one thread, remove from any" and 
probably even volatile would be enough.

Well, now I actually like this idea, and you seem to have (almost) convinced 
me. :)

There is also some nice symmetry with chunks, which can also only be allocated 
from by a single thread, yet any thread can return buffers to them.

One minor concern is though, that what about eviction racing with recycle. I 
mean, the owner evicts a chunk and at the same time the application returns the 
last buffer into that chunk and triggers recycling (on another thread). I guess 
one must be careful with first nulling the owner and then removing from the 
queue to prevent this

Anyway, that's slightly more complex than the isOwner param / thread check, so 
let's leave it for the future.

 

>  Some time in future, we should consider refactoring more fully, to perhaps 
> distinguish the `LocalPool` and `TinyPool` concepts more fully so that we do 
> not make this kind of mistake, as the behaviour is subtly different between 
> the two, and it is easy to forget this and fail to account for the 
> special-casing.

 

+1

 


was (Author: pkolaczk):
> I was proposing removing the count entirely, as it is a minor optimisation. 
> Simply use the three fields and their null/non-null status.

I agree, this makes sense as well. So that's actually not that complex as I 
initially thought. The fact we have a fixed number of items (3) and if we 
remove the counter, we could allow "add from one thread, remove from any" and 
probably even volatile would be enough (no CAS needed, because removers would 
never race for the same item, as you can't recycle the same chunk more than 
once).

Well, now I actually like this idea, and you seem to have convinced me. :)

There is also some nice symmetry with chunks, which can also only be allocated 
from by a single thread, yet any thread can return buffers to them.

 

>  Some time in future, we should consider refactoring more fully, to perhaps 
> distinguish the `LocalPool` and `TinyPool` concepts more fully so that we do 
> not make this kind of mistake, as the behaviour is subtly different between 
> the two, and it is easy to forget this and fail to account for the 
> special-casing.

 

+1

 

> org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky
> --
>
> Key: CASSANDRA-16681
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16681
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Ekaterina Dimitrova
>Assignee: Piotr Kolaczkowski
>Priority: Normal
> Fix For: 3.11.x, 4.0.x, 4.x
>
> Attachments: 0001-Fix-memoryInUse-counter-in-BufferPool.patch, 
> 0002-Multiple-fixes-in-BufferPool-and-LongBufferPoolTest.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Jenkins history:
> [https://jenkins-cm4.apache.org/job/Cassandra-4.0/50/testReport/junit/org.apache.cassandra.utils.memory/LongBufferPoolTest/testPoolAllocateWithRecyclePartially/history/]
> Fails being run in a loop in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/844/workflows/945011f4-00ac-4678-89f6-5c0db0a40169/jobs/5008
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-16681) org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky

2022-05-31 Thread Piotr Kolaczkowski (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17544326#comment-17544326
 ] 

Piotr Kolaczkowski edited comment on CASSANDRA-16681 at 5/31/22 11:18 AM:
--

> I was proposing removing the count entirely, as it is a minor optimisation. 
> Simply use the three fields and their null/non-null status.

I agree, this makes sense as well. So that's actually not that complex as I 
initially thought. The fact we have a fixed number of items (3) and if we 
remove the counter, we could allow "add from one thread, remove from any" and 
probably even volatile would be enough (no CAS needed, because removers would 
never race for the same item, as you can't recycle the same chunk more than 
once).

Well, now I actually like this idea, and you seem to have convinced me. :)

There is also some nice symmetry with chunks, which can also only be allocated 
from by a single thread, yet any thread can return buffers to them.

 

>  Some time in future, we should consider refactoring more fully, to perhaps 
> distinguish the `LocalPool` and `TinyPool` concepts more fully so that we do 
> not make this kind of mistake, as the behaviour is subtly different between 
> the two, and it is easy to forget this and fail to account for the 
> special-casing.

 

+1

 


was (Author: pkolaczk):
> I was proposing removing the count entirely, as it is a minor optimisation. 
> Simply use the three fields and their null/non-null status.

I agree, this makes sense as well. So that's actually not that complex as I 
initially thought. The fact we have a fixed number of items (3) and if we 
remove the counter, we could allow "add from one thread, remove from any" and 
probably even volatile would be enough (no CAS needed, because removers would 
never race for the same item, as you can't recycle the same chunk more than 
once).

Well, now I actually like this idea, and you seem to have convinced me. :)

There is also some nice symmetry with chunks, which can also only be allocated 
from by a single thread, yet any thread can return buffers to them.

 

 

> org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky
> --
>
> Key: CASSANDRA-16681
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16681
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Ekaterina Dimitrova
>Assignee: Piotr Kolaczkowski
>Priority: Normal
> Fix For: 3.11.x, 4.0.x, 4.x
>
> Attachments: 0001-Fix-memoryInUse-counter-in-BufferPool.patch, 
> 0002-Multiple-fixes-in-BufferPool-and-LongBufferPoolTest.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Jenkins history:
> [https://jenkins-cm4.apache.org/job/Cassandra-4.0/50/testReport/junit/org.apache.cassandra.utils.memory/LongBufferPoolTest/testPoolAllocateWithRecyclePartially/history/]
> Fails being run in a loop in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/844/workflows/945011f4-00ac-4678-89f6-5c0db0a40169/jobs/5008
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-16681) org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky

2022-05-31 Thread Piotr Kolaczkowski (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17544326#comment-17544326
 ] 

Piotr Kolaczkowski edited comment on CASSANDRA-16681 at 5/31/22 11:18 AM:
--

> I was proposing removing the count entirely, as it is a minor optimisation. 
> Simply use the three fields and their null/non-null status.

I agree, this makes sense as well. So that's actually not that complex as I 
initially thought. The fact we have a fixed number of items (3) and if we 
remove the counter, we could allow "add from one thread, remove from any" and 
probably even volatile would be enough (no CAS needed, because removers would 
never race for the same item, as you can't recycle the same chunk more than 
once).

Well, now I actually like this idea, and you seem to have convinced me. :)

There is also some nice symmetry with chunks, which can also only be allocated 
from by a single thread, yet any thread can return buffers to them.

 

 


was (Author: pkolaczk):
> I was proposing removing the count entirely, as it is a minor optimisation. 
> Simply use the three fields and their null/non-null status.

I agree, this makes sense as well. So that's actually not that complex as I 
initially thought. The fact we have a fixed number of items (3) and if we 
remove the counter, we could allow "add from one thread, remove from any" and 
probably even volatile would be enough (no CAS needed, because removers would 
never race for the same item, as you can't recycle the same chunk more than 
once).

Well, now I actually like this idea, and you seem to have convinced me. :)

There is also some nice symmetry with chunks, which can also only be allocated 
from by a single thread, yet any thread can return buffers to them.

 

 

 

> org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky
> --
>
> Key: CASSANDRA-16681
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16681
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Ekaterina Dimitrova
>Assignee: Piotr Kolaczkowski
>Priority: Normal
> Fix For: 3.11.x, 4.0.x, 4.x
>
> Attachments: 0001-Fix-memoryInUse-counter-in-BufferPool.patch, 
> 0002-Multiple-fixes-in-BufferPool-and-LongBufferPoolTest.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Jenkins history:
> [https://jenkins-cm4.apache.org/job/Cassandra-4.0/50/testReport/junit/org.apache.cassandra.utils.memory/LongBufferPoolTest/testPoolAllocateWithRecyclePartially/history/]
> Fails being run in a loop in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/844/workflows/945011f4-00ac-4678-89f6-5c0db0a40169/jobs/5008
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-16681) org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky

2022-05-31 Thread Piotr Kolaczkowski (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17544283#comment-17544283
 ] 

Piotr Kolaczkowski edited comment on CASSANDRA-16681 at 5/31/22 10:02 AM:
--

> The amount of time is sort of meaningless here, since nobody is looking at 
> it. 

I was referring to a few people to whom I showed the bug initially by saying 
"here, I have the evidence the MicroQueueOfChunks is being accessed from 
multiple threads" and the response was:

"how so? I looked at the code and it looks correct".

For example here:

>  I can see only two locations invoking {{{}LocalPool.release{}}}: 
> {{LocalBufferPoolAllocator.release()}} and {{{}localPool.onRemoval{}}}. The 
> former invokes on its own private pool, and has only single ownership 
> semantics, and so should be fine. The latter should only be invoked when the 
> thread is terminating? It could in principle be invoked at another time, but 
> I do not think it is. Chunks are released only via this path, or when there 
> are no references to the chunk from application threads. So, if the above is 
> fine this is likely also fine?

;)

And BTW, that was also *my* initial reaction when I saw the exceptions for the 
first time and required many days until we figured this out. Hence I conclude, 
it was far from obvious and I wanted to make the code a bit more explicit so 
the next time noone has to figure this out again.

Anyway, let me get some time to split this patch into just the essential part 
and the nice-to-have cosmetic / simplification parts, because indeed maybe some 
stuff could be left out.


was (Author: pkolaczk):
> The amount of time is sort of meaningless here, since nobody is looking at 
> it. 

I was referring to a few people to whom I showed the bug initially by saying 
"here, I have the evidence the MicroQueueOfChunks is being accessed from 
multiple threads" and the response was: 

"how so? I looked at the code and it looks correct".

For example here:

>  I can see only two locations invoking {{{}LocalPool.release{}}}: 
> {{LocalBufferPoolAllocator.release()}} and {{{}localPool.onRemoval{}}}. The 
> former invokes on its own private pool, and has only single ownership 
> semantics, and so should be fine. The latter should only be invoked when the 
> thread is terminating? It could in principle be invoked at another time, but 
> I do not think it is. Chunks are released only via this path, or when there 
> are no references to the chunk from application threads. So, if the above is 
> fine this is likely also fine?

;)

And BTW, that was also *my* initial reaction when I saw the exceptions for the 
first time and required many days until we figured this out. Hence I conclude, 
it was far from not obvious and wanted to make the code a bit more explicit so 
the next time noone has to figure this out again.

Anyway, let me get some time to split this patch into just the essential part 
and the nice-to-have cosmetic / simplification parts.

> org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky
> --
>
> Key: CASSANDRA-16681
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16681
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Ekaterina Dimitrova
>Assignee: Piotr Kolaczkowski
>Priority: Normal
> Fix For: 3.11.x, 4.0.x, 4.x
>
> Attachments: 0001-Fix-memoryInUse-counter-in-BufferPool.patch, 
> 0002-Multiple-fixes-in-BufferPool-and-LongBufferPoolTest.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Jenkins history:
> [https://jenkins-cm4.apache.org/job/Cassandra-4.0/50/testReport/junit/org.apache.cassandra.utils.memory/LongBufferPoolTest/testPoolAllocateWithRecyclePartially/history/]
> Fails being run in a loop in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/844/workflows/945011f4-00ac-4678-89f6-5c0db0a40169/jobs/5008
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-16681) org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky

2022-05-31 Thread Piotr Kolaczkowski (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17544219#comment-17544219
 ] 

Piotr Kolaczkowski edited comment on CASSANDRA-16681 at 5/31/22 8:25 AM:
-

# How is that simpler than an internal check for currentThread? It adds a 
parameter to the public API of LocalPool and moves the burden of correctness on 
the caller.
 # It would make the code analysis harder - it is not immediately obvious the 
correct thread would touch the private non-synchronized stuff. In my patch it 
is obvious.
 # I expect the majority of buffers are returned by the same thread which 
allocated it. So if you disable recycling of buffers invoked from tinypool in 
all cases of indirect returns, it would be overly conservative.

BTW point 3 leads me to thinking - do we even really need it that much to 
recycle at all when we notice an owned buffer is completely free? Why not keep 
it for future allocations from that thread? I can imagine eager recycling could 
get us in some pathological edge-case performance problem if the caller 
requests just one buffer, returns it, requests another, returns etc. In this 
case such buffer would constantly travel between the GlobalPool and LocalPool 
which kinda defeats the purpose of local pools.

 


was (Author: pkolaczk):
# How is that simpler than an internal check for currentThread? It adds a 
parameter to the public API of LocalPool and moves the burden of correctness on 
the caller. 
 # It would make the code analysis harder - it is not immediately obvious the 
correct thread would touch the private non-synchronized stuff. In my patch it 
is obvious.
 # I expect the majority of buffers are returned by the same thread which 
allocated it. So if you disable recycling of buffers invoked from tinypool in 
all cases of indirect returns, it looks overly conservative.

BTW point 3 leads me to thinking - do we even really need it that much to 
recycle at all when we notice an owned buffer is completely free? Why not keep 
it for future allocations from that thread? I can imagine eager recycling could 
get us in some pathological edge-case performance problem if the caller 
requests just one buffer, returns it, requests another, returns etc. In this 
case such buffer would constantly travel between the GlobalPool and LocalPool 
which kinda defeats the purpose of local pools.

 

> org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky
> --
>
> Key: CASSANDRA-16681
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16681
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Ekaterina Dimitrova
>Assignee: Piotr Kolaczkowski
>Priority: Normal
> Fix For: 3.11.x, 4.0.x, 4.x
>
> Attachments: 0001-Fix-memoryInUse-counter-in-BufferPool.patch, 
> 0002-Multiple-fixes-in-BufferPool-and-LongBufferPoolTest.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Jenkins history:
> [https://jenkins-cm4.apache.org/job/Cassandra-4.0/50/testReport/junit/org.apache.cassandra.utils.memory/LongBufferPoolTest/testPoolAllocateWithRecyclePartially/history/]
> Fails being run in a loop in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/844/workflows/945011f4-00ac-4678-89f6-5c0db0a40169/jobs/5008
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-16681) org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky

2022-05-31 Thread Piotr Kolaczkowski (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17544187#comment-17544187
 ] 

Piotr Kolaczkowski edited comment on CASSANDRA-16681 at 5/31/22 7:43 AM:
-

[~benedict] it is all described in the commit message. It answers exactly your 
questions.

Quoting the relevant part of the commit message for your convenience:

1. LocalPool is meant to be managed by a single thread.
Unfortunately there was a code path that allowed
a thread enter the context of the LocalPool owned by another
thread by attempting to recycle a chunk using its
recycler reference. This led to a situation where unsynchronized
data structures were modified from multiple threads.
The patch ensures LocalPool state is modified by the owning thread
only.

The following sequence triggered the bug:
a) The app asks the BufferPool for a tiny chunk on Thread A.
We get the chunk from the thread local LocalPool A, we set
its recycler to the parent LocalPool and owner to its tinyPool.
b) Eventually the tiny chunk gets evicted. Now its owner is set
to null, but the chunk is not free yet, so nothing more happens.
c) In the meantime buffer(s) of the tiny chunk get(s)
transferred to Thread B on the client side.
d) Eventually the client releases the last buffer of the tiny chunk
by calling BufferPool#put, but it does that on Thread B.
e) BufferPool grabs a reference to the thread local LocalPool B and
calls put on it. So far all correct. We are returning the buffer
to the LocalPool B, owned by Thread B.
f) First we call free to mark appropriate freeSlots of the chunk.
It is the last buffer, so all bits turn to ones (freeSlots == -1).
The chunk has been evicted, so its owner == null.
We're not the owner, so we don't need to remove the chunk from
the micro queue (no need because it is not there) and we can
give it back to the pool we got the chunk from.
So we call chunk.tryRecycle.
g) chunk.tryRecycle realizes freeSlots == -1, CAS succeeds,
now freeSlots == 0 and the chunk needs to be recycled.
h) chunk.recycle gets called and calls chunk.recycler.recycle()
underneath. But the chunk was allocated by thread A,
so chunk.recycler points to LocalPool A owned by Thread A.
Here we enter the context of the wrong thread!
i) chunk.recycler.recycle() gets the parent chunk and its slab
and calls put again. Notice we're calling put the second time now.
We're calling it on a LocalPool A, but we are still on Thread B.
Danger!
j) LocalPool A is the owner of chunk's slab we got the buffer
from in step b). So owner == this.
k) Now, we're additionally unlucky, and it turns out this was
the last buffer of the normal chunk as well.
So we have free = 0 and owner == this, the first condition after
free is true, so we remove the chunk from the micro queue.
The micro queue is managed by Thread A, and we're
doing this on Thread B. Here we crash.

 

BTW: You will not see this path in cassandra-3.x branches, because the 
`recycler` thing and tinypool was added on cassandra-4.0. So the bug in the 
BufferPool applies to 4.0 only.

 

> Chunks are released only via this path, or when there are no references to 
> the chunk from application threads

 

That holds only on 3.x.

In 4.x chunks may be released following the `recycler` reference, which doesn't 
guarantee pointing to the LocalPool of the current thread.


was (Author: pkolaczk):
[~benedict] it is all described in the commit message. It answers exactly your 
questions.

Quoting the relevant part of the commit message for your convenience:

1. LocalPool is meant to be managed by a single thread.
Unfortunately there was a code path that allowed
a thread enter the context of the LocalPool owned by another
thread by attempting to recycle a chunk using its
recycler reference. This led to a situation where unsynchronized
data structures were modified from multiple threads.
The patch ensures LocalPool state is modified by the owning thread
only.

The following sequence triggered the bug:
a) The app asks the BufferPool for a tiny chunk on Thread A.
We get the chunk from the thread local LocalPool A, we set
its recycler to the parent LocalPool and owner to its tinyPool.
b) Eventually the tiny chunk gets evicted. Now its owner is set
to null, but the chunk is not free yet, so nothing more happens.
c) In the meantime buffer(s) of the tiny chunk get(s)
transferred to Thread B on the client side.
d) Eventually the client releases the last buffer of the tiny chunk
by calling BufferPool#put, but it does that on Thread B.
e) BufferPool grabs a reference to the thread local LocalPool B and
calls put on it. So far all correct. We are returning the buffer
to the LocalPool B, owned by Thread B.
f) First we call free to mark appropriate freeSlots of the chunk.
It is the last buffer, so all bits turn to ones (freeSlots == -1).
The chunk has been evicted, so its owner == null.
We'r

[jira] [Comment Edited] (CASSANDRA-16681) org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky

2022-05-10 Thread Piotr Kolaczkowski (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534228#comment-17534228
 ] 

Piotr Kolaczkowski edited comment on CASSANDRA-16681 at 5/10/22 9:00 AM:
-

I submitted a patch for 4.0+ version.
Now looking into 3.11 branch, but it turns out it is different and it actually 
doesn't have some of the issues I fixed. 
For instance, 3.11 doesn't seem to have the biggest thread-safety issue in the 
MicroQueueOfChunks caused by Chunk#recycler  reference, because it doesn't have 
the recycler field at all (and no tinyPool either). 
It also doesn't seem to have the memory in use bug addressed by my first patch. 

However, it does seem to rely on the same assumptions in the 
LongBufferPoolTest, so the test is probably flaky there as well. 
Do you think it is worth backporting the relevant fixes to 3.11? 


CC [~stefania_alborghetti] [~jasonstack]


was (Author: pkolaczk):
I submitted a patch for 4.0+ version.
Now looking into 3.11 branch, but it turns out it is different and it actually 
doesn't have some of the issues I fixed. 
For instance, 3.11 doesn't seem to have the biggest thread-safety issue in the 
MicroQueueOfChunks caused by Chunk#recycler  reference, because it doesn't have 
the recycler field at all (and no tinyPool either). 
It also doesn't seem to have the memory in use bug addressed by my first patch. 

CC [~stefania_alborghetti] [~jasonstack]

However, it does seem to rely on the same assumptions in the 
LongBufferPoolTest, so the test is probably flaky there as well. 
Do you think it is worth backporting the relevant fixes to 3.11? 

> org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky
> --
>
> Key: CASSANDRA-16681
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16681
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Ekaterina Dimitrova
>Assignee: Piotr Kolaczkowski
>Priority: Normal
> Fix For: 3.11.x, 4.0.x, 4.x
>
> Attachments: 0001-Fix-memoryInUse-counter-in-BufferPool.patch, 
> 0002-Multiple-fixes-in-BufferPool-and-LongBufferPoolTest.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Jenkins history:
> [https://jenkins-cm4.apache.org/job/Cassandra-4.0/50/testReport/junit/org.apache.cassandra.utils.memory/LongBufferPoolTest/testPoolAllocateWithRecyclePartially/history/]
> Fails being run in a loop in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/844/workflows/945011f4-00ac-4678-89f6-5c0db0a40169/jobs/5008
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-16681) org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky

2022-05-10 Thread Piotr Kolaczkowski (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534228#comment-17534228
 ] 

Piotr Kolaczkowski edited comment on CASSANDRA-16681 at 5/10/22 9:00 AM:
-

I submitted a patch for 4.0+ version.
Now looking into 3.11 branch, but it turns out it is different and it actually 
doesn't have some of the issues I fixed. 
For instance, 3.11 doesn't seem to have the biggest thread-safety issue in the 
MicroQueueOfChunks caused by Chunk#recycler  reference, because it doesn't have 
the recycler field at all (and no tinyPool either). 
It also doesn't seem to have the memory in use bug addressed by my first patch. 

CC [~stefania_alborghetti] [~jasonstack]

However, it does seem to rely on the same assumptions in the 
LongBufferPoolTest, so the test is probably flaky there as well. 
Do you think it is worth backporting the relevant fixes to 3.11? 


was (Author: pkolaczk):
I submitted a patch for 4.0+ version.
Now looking into 3.11 branch, but it turns out it is different and it actually 
doesn't have some of the issues I fixed. 
For instance, 3.11 doesn't seem to have the biggest thread-safety issue in the 
MicroQueueOfChunks caused by Chunk#recycler  reference, because it doesn't have 
the recycler field at all (and no tinyPool either). 
It also doesn't seem to have the memory in use bug addressed by my first patch. 

However, it does seem to rely on the same assumptions in the 
LongBufferPoolTest, so the test is probably flaky there as well. 
Do you think it is worth backporting the relevant fixes to 3.11? 

> org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky
> --
>
> Key: CASSANDRA-16681
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16681
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Ekaterina Dimitrova
>Assignee: Piotr Kolaczkowski
>Priority: Normal
> Fix For: 3.11.x, 4.0.x, 4.x
>
> Attachments: 0001-Fix-memoryInUse-counter-in-BufferPool.patch, 
> 0002-Multiple-fixes-in-BufferPool-and-LongBufferPoolTest.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Jenkins history:
> [https://jenkins-cm4.apache.org/job/Cassandra-4.0/50/testReport/junit/org.apache.cassandra.utils.memory/LongBufferPoolTest/testPoolAllocateWithRecyclePartially/history/]
> Fails being run in a loop in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/844/workflows/945011f4-00ac-4678-89f6-5c0db0a40169/jobs/5008
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-16681) org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky

2021-06-22 Thread Brandon Williams (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17367233#comment-17367233
 ] 

Brandon Williams edited comment on CASSANDRA-16681 at 6/22/21, 1:27 PM:


FWIW, I can reproduce this on a plain linux machine running java 8 without any 
modification.


was (Author: brandon.williams):
FWIW, I can reproduce this on a plain linux machine runing java 8 without any 
modification.

> org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky
> --
>
> Key: CASSANDRA-16681
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16681
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Ekaterina Dimitrova
>Assignee: Brandon Williams
>Priority: Normal
> Fix For: 4.0, 4.0-rc
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Jenkins history:
> [https://jenkins-cm4.apache.org/job/Cassandra-4.0/50/testReport/junit/org.apache.cassandra.utils.memory/LongBufferPoolTest/testPoolAllocateWithRecyclePartially/history/]
> Fails being run in a loop in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/844/workflows/945011f4-00ac-4678-89f6-5c0db0a40169/jobs/5008
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-16681) org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky

2021-06-17 Thread Adam Holmberg (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17365134#comment-17365134
 ] 

Adam Holmberg edited comment on CASSANDRA-16681 at 6/17/21, 9:26 PM:
-

Looking at this more today. I'm wondering if we should review and maybe move 
forward a patch for the race I described above. As mentioned, I could still get 
failures following that, but what I realized today:
 - Failures go from ~dozens per hundred runs to single digit
 - Even though one mode of flakiness fails the same assertion, I now believe it 
may be arriving there by a different mechanism

Is it worth taking a look at those changes in the interest of getting more 
stability, and possibly spinning out investigation looking for more issues?


was (Author: aholmber):
Looking at this more today. I'm wondering if we should review and commit the 
patch for the race I described above. As mentioned, I could still get failures 
following that, but what I realized today:

- Failures go from ~dozens per hundred runs to single digit
- Even though one mode of flakiness fails the same assertion, I now believe it 
may be arriving there by a different mechanism

Is it worth taking a look at those changes in the interest of getting more 
stability, and possibly spinning out investigation looking for more issues?

> org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky
> --
>
> Key: CASSANDRA-16681
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16681
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Ekaterina Dimitrova
>Assignee: Brandon Williams
>Priority: Normal
> Fix For: 4.0, 4.0-rc
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Jenkins history:
> [https://jenkins-cm4.apache.org/job/Cassandra-4.0/50/testReport/junit/org.apache.cassandra.utils.memory/LongBufferPoolTest/testPoolAllocateWithRecyclePartially/history/]
> Fails being run in a loop in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/844/workflows/945011f4-00ac-4678-89f6-5c0db0a40169/jobs/5008
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-16681) org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky

2021-06-16 Thread Adam Holmberg (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17364353#comment-17364353
 ] 

Adam Holmberg edited comment on CASSANDRA-16681 at 6/16/21, 3:50 PM:
-

I haven't bottomed out on this, and I haven't produced a complete solution thus 
far. I'm going to continue looking at it intermittently as I'm able, but 
meanwhile I'll un-assign to make sure I'm not dissuading anyone else from 
taking it up.


was (Author: aholmber):
I haven't bottomed out on this, but I haven't produced a complete solution thus 
far. I'm going to continue looking at it intermittently as I'm able, but 
meanwhile I'll un-assign to make sure I'm not dissuading anyone else from 
taking it up.

> org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky
> --
>
> Key: CASSANDRA-16681
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16681
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Ekaterina Dimitrova
>Assignee: Brandon Williams
>Priority: Normal
> Fix For: 4.0, 4.0-rc
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Jenkins history:
> [https://jenkins-cm4.apache.org/job/Cassandra-4.0/50/testReport/junit/org.apache.cassandra.utils.memory/LongBufferPoolTest/testPoolAllocateWithRecyclePartially/history/]
> Fails being run in a loop in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/844/workflows/945011f4-00ac-4678-89f6-5c0db0a40169/jobs/5008
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-16681) org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky

2021-05-19 Thread Adam Holmberg (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347872#comment-17347872
 ] 

Adam Holmberg edited comment on CASSANDRA-16681 at 5/19/21, 9:18 PM:
-

I think this is a test issue. I think we have a race 
[here|https://github.com/apache/cassandra/blob/9a432418f2277c40a1fe4b64049688d6354ecdca/test/burn/org/apache/cassandra/utils/memory/LongBufferPoolTest.java#L276-L284]
 where, if threads exit after {{doneThreads}} is sampled, the assumptions 
mentioned in the comment are violated.

I haven't assigned or posted a change because I'm still looking at some other 
weirdness around this test and trying to understand how it was intended to work.

If anyone else wants to take a look and corroborate or just take the ticket, 
it's fine by me.


was (Author: aholmber):
I think this is a test issue. I think we have a race 
[here|https://github.com/apache/cassandra/blob/9a432418f2277c40a1fe4b64049688d6354ecdca/test/burn/org/apache/cassandra/utils/memory/LongBufferPoolTest.java#L276-L284]
 where, if threads exit after doneThreads is sampled, the assumptions mentioned 
in the comment are violated.

I haven't assigned or posted a change because I'm still looking at some other 
weirdness around this test and trying to understand how it was intended to work.

If anyone else wants to take a look and corroborate or just take the ticket, 
it's fine by me.

> org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky
> --
>
> Key: CASSANDRA-16681
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16681
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 4.0, 4.0-rc
>
>
> Jenkins history:
> [https://jenkins-cm4.apache.org/job/Cassandra-4.0/50/testReport/junit/org.apache.cassandra.utils.memory/LongBufferPoolTest/testPoolAllocateWithRecyclePartially/history/]
> Fails being run in a loop in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/844/workflows/945011f4-00ac-4678-89f6-5c0db0a40169/jobs/5008
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org